Title: | Interactive and Reproducible Data Cleaning |
Version: | 1.0.5 |
Description: | Flexible and efficient cleaning of data with interactivity. 'datacleanr' facilitates best practices in data analyses and reproducibility with built-in features and by translating interactive/manual operations to code. The package is designed for interoperability, and so seamlessly fits into reproducible analyses pipelines in 'R'. |
License: | GPL-3 |
Suggests: | testthat (≥ 2.1.0) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.1 |
URL: | https://github.com/the-Hull/datacleanr |
BugReports: | https://github.com/the-Hull/datacleanr/issues |
Imports: | shiny (≥ 1.5.0), summarytools (≥ 0.9.6), dplyr (≥ 1.0.2), rlang (≥ 0.4.9), DT (≥ 0.16), magrittr (≥ 2.0.1), plotly (≥ 4.9.2.1), grDevices, stats, purrr (≥ 0.3.4), glue (≥ 1.4.2), formatR (≥ 1.7), RColorBrewer (≥ 1.1.2), clipr (≥ 0.7.1), rstudioapi (≥ 0.13), utils, lubridate (≥ 1.7.9.2), shinyWidgets (≥ 0.5.4), htmlwidgets (≥ 1.5.3), tools, fs (≥ 1.5.0), shinyFiles (≥ 0.8.0), bslib |
Depends: | R (≥ 3.6) |
NeedsCompilation: | no |
Packaged: | 2025-05-10 10:13:46 UTC; ahurl |
Author: | Alexander Hurley |
Maintainer: | Alexander Hurley <agl.hurley@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-05-10 18:10:05 UTC |
datacleanr: Interactive and Reproducible Data Cleaning
Description
Flexible and efficient cleaning of data with interactivity. 'datacleanr' facilitates best practices in data analyses and reproducibility with built-in features and by translating interactive/manual operations to code. The package is designed for interoperability, and so seamlessly fits into reproducible analyses pipelines in 'R'.
Author(s)
Maintainer: Alexander Hurley agl.hurley@gmail.com (ORCID) [copyright holder]
Other contributors:
See Also
Useful links:
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Applies grouping to data set conditionally
Description
Applies grouping to data set conditionally
Usage
apply_data_set_up(df, group)
Arguments
df |
data frame |
group |
supply reactive output from group selector |
Value
returns df either grouped or not
Return x and y limits of "group-subsetted" dframe
Description
Used for adjusting layout of plotly plot based on selected
groups in group_selector_table
; currently used in viz tab
Usage
calc_limits_per_groups(dframe, group_index, xvar, yvar, scaling = 0.02)
Arguments
dframe |
dataframe/tibble, grouped/ungrouped |
group_index |
numeric, group indices for which to return lims |
xvar |
character, name of x var for plot (must exist in dframe) |
yvar |
character, name of y var for plot (must exist in dframe) |
scaling |
numeric, 1 +/- |
Value
list with xlim and ylim
Check for internet connection
Description
Check for internet connection
Usage
can_internet(url = "http://www.google.com")
Arguments
url |
character, valid path to url - user responsible |
Value
logical - TRUE or FALSE
check if a filter statement is valid
Description
check if a filter statement is valid
Usage
check_individual_statement(df, statement)
Arguments
df |
data frame / tibble to be filtered |
statement |
character string, |
Value
logical, did filter statement work?
datacleanr server function
Description
datacleanr server function
Usage
datacleanr_server(input, output, session, dataset, df_name, is_on_disk)
Arguments
input , output , session |
standard |
dataset |
data.frame, tibble or data.table that needs cleaning |
df_name |
character, name of dataset or file_path passed into shiny app |
is_on_disk |
logical, whether df was read from file |
Interactive and reproducible data cleaning
Description
Launches the datacleanr
app for interactive and reproducible cleaning.
See Details for more information.
Usage
dcr_app(dframe, browser = TRUE)
Arguments
dframe |
Character, a string naming a |
browser |
logical, should app start in OS's default browser? (default |
Details
datacleanr
provides an interactive data overview, and allows
reproducible filtering and (manual, interactive) visual outlier detection and annotation across multiple app tabs:
-
Overview and Set-up: set groups (see below) and generate a exploratory summary of
dframe
-
Filtering: Provide and apply filter statements (groupwise, see below and
filter_scoped_df
) -
Visualization and Annotating: interactive visualization allowing outlier highlighting, annotating and before/after histograms of displayed (numeric) variables
-
Extraction: generates Reproducible Recipe and outputs
For data sets exceeding 1.5 million rows, we suggest splitting the data, if possible, by a grouping factor.
This is because at this volume interactive visualizations using plotly
stretch the limits of what modern web browsers can handle.
A simple example using iris
is:
iris_split <- split(iris, iris$Species) dcr_app(iris_split[[1]]) # or lapply(iris_split, dcr_app)
Extensive documentation is provided on each of the tabs for individual procedures in help links.
datacleanr
relies on 1) generating a column of unique IDs (.dcrkey
) and subsetting dframe
into sub-groups (generated in-app,
added as column .dcrindex
) for filtering and visualization.
These groups are composed of unique combinations of columns in the data set (must be factor
) and are passed to group_by
,
and are carried through the app for exploratory analyses (tab Overview and Set-up), filtering (tab Filtering) and plotting
(tab Visualization).
These groups should ideally be chosen to facilitate a convenient filtering and viewing/cleaning process.
For example, a data set with time series of multiple sensors could be grouped by sensor and/or additional columns,
such that periods of interest can be visualized and cleaned simultaneously in the interactive plot.
Filtering is achieved by providing expressions that evaluate to TRUE
\ FALSE
, and can be applied to the entire
data set, or individual/all groups via scoped filtering (see filter_scoped_df
).
The interactive visualization allows selecting and deselecting points with lasso and box select tools, as well as interactive zooming (toolbar or clicking on legend items or group overview table, see tab in-app) as well as panning (toolbar and hover over plot's axes). Data formats supported are
Observational (numeric), timeseries (
POSIXct
) and categorical data inx
andy
dimensions/axisObservational (numeric) data in
z
dimension (point size)Spatial data, when
lon
andlat
in decimal degrees are present inx
andy
.
Displaying spatial data requires a Mapbox account, from which an access token needs
to be copied into your .Renviron
(e.g. MAPBOX_TOKEN=your_copied_token
).
Note, that when a column .dcrflag
(logical, TRUE
\ FALSE
) is present in dframe
,
respective observations are given contrasting
symbols (FALSE
= circle, TRUE
= star-triangle).
This column is employed as a cross-referencing tool for e.g.other outlier detection or data-processing algorithms
that were applied prior.
The tab Extraction provides code to reproduce the entire procedure (a Reproducible Recipe), which
can be copied, or sent directly to an active
RStudio
script when used interactively (i.e. whendframe
is an object inR
's environment),can be saved to disk with intermediate outputs (filter statements and selected outliers), where file names are based on the input file and configurable suffixes when
dframe
is a path.
Value
When datacleanr
is ended by clicking on Close
in the app's navigation bar, a list is invisibly returned
with the following items:
-
df_name: character, object name/file path passed into
dcr_app
-
dcr_df: tibble, filtered data set with additional columns
.dcrkey
,.dcrindex
,.annotation
- the latter isNA
for non-outliers, an empty string for outliers without annotation, and a custom string for annotated outliers -
dcr_selected_outliers: data.frame, contains the outlier
.dcrkey
, the.annotation
and aselection_count
(integer, count incrementer) column -
dcr_groups: character, a vector defining the groups (via
group_by
) used throughoutdatacleanr
-
dcr_condition_df: tibble, with columns
filter
(character, statement used for filtering) andgroup
(list, of integers), defining groups that correspond to.dcrindex
-
dcr_code: character string, containing Reproducible Recipe
Initial checks for data set
Description
Initial checks for data set
Usage
dcr_checks(dframe)
Arguments
dframe |
dframe supplied to |
extend brewer palette
Description
extend brewer palette
Usage
extend_palette(n)
Arguments
n |
numeric, number of colors |
Value
color vector of length n
Apply filter based on a statement, scoped to dplyr
groups
Description
Apply filter based on a statement, scoped to dplyr
groups
Usage
filter_scoped(dframe, statement, scope_at = NULL)
Arguments
dframe |
data.frame/tbl, grouped or ungrouped |
statement |
character, statement for filtering (only VALID expressions; use |
scope_at |
numeric, group indices to apply filter statements to |
Value
List, containing item filtered_df
, a data.frame
filtered based on statements and scope.
Filter / Subset data dplyr
-groupwise
Description
filter_scoped_df
subsets rows of a data frame based on grouping structure
(see group_by
). Filtering statements are provided in a separate tibble
where each row represents a combination of a logical expression and a list of groups
to which the expression should be applied to corresponding to see indices from
cur_group_id
).
Usage
filter_scoped_df(dframe, condition_df)
Arguments
dframe |
A grouped or ungrouped |
condition_df |
A |
Details
This function is applied in the "Filtering" tab of the datacleanr
app,
and applied in the reproducible code recipe in the "Extract" tab.
Note, that multiple checks for valid statements are performed in the app (and only valid operations
printed in the "Extract" tab). It is therefore not advisable to manually alter this code or use
this function interactively.
Value
An object of the same type as dframe
. The output is a subset of
the input, with groups and rows appearing in the same order, and an additional column
.dcrindex
representing the group indices.
The output may have less groups as the input, depending on subsetting.
Examples
# set-up condition_df
cdf <- dplyr::tibble(
statement = c(
"Sepal.Width > quantile(Sepal.Width, 0.1)",
"Petal.Width > quantile(Petal.Width, 0.1)",
"Petal.Length > quantile(Petal.Length, 0.8)"
),
scope_at = list(NULL, NULL, c(1, 2))
)
fdf <- filter_scoped_df(
dplyr::group_by(
iris,
Species
),
condition_df = cdf
)
# Example of invalid expression:
# column 'Spec' does not exist in iris
# "Spec == 'setosa'"
Identify columns carrying non-numeric values
Description
Identify columns carrying non-numeric values
Usage
get_factor_cols_idx(x)
Arguments
x |
data.frame |
Value
logical, is column in x non-numeric?
Handle outlier trace
Description
Single outlier trace is added to plotly; interactive select/deselect
was implemented by adjusting selected_points
, and subsequently adding, or deleting+adding
the (modified) trace at the end of the existing JS data array. Requires tracemap with
trace names and corresponding indices.
Simple check for re-execution was implemented by passing on the selection keys to compare against
on pertinent plotly_event
.
Usage
handle_add_outlier_trace(
sp,
dframe,
ok,
selectors,
trace_map,
source = "scatterselect",
session
)
Arguments
sp |
selected points |
dframe |
plot data |
ok |
reactive, old keys |
selectors |
reactive input selectors |
trace_map |
numeric, max trace id |
source |
plotly source |
session |
active session |
Wrapper for adjusting axis lims and hiding traces
Description
Wrapper for adjusting axis lims and hiding traces
Usage
handle_restyle_traces(
source_id,
session,
dframe,
scaling = 0.05,
xvar,
yvar,
trace_map,
max_id_group_trace,
input_sel_rows,
flush = TRUE
)
Arguments
source_id |
character, plotly source id |
session |
session object |
dframe |
data frame/tibble (grouped/ungrouped) |
scaling |
numeric, 1 +/- scaling applied to x lims for xvar and yvar |
xvar |
character, name of xvar, must be in dframe |
yvar |
character, name of yvar, must be in dframe |
trace_map |
matrix, with columns for trace name (col 1) and trace id (col 2) |
max_id_group_trace |
numeric, max id of plotly trace from original data (not outlier traces) |
input_sel_rows |
numeric, input from DT grouptable |
flush |
character, |
Value
Used for it's side effect - no return
Handle selection of outliers (with select - unselect capacity)
Description
Handle selection of outliers (with select - unselect capacity)
Usage
handle_sel_outliers(sel_old_df, sel_new)
Arguments
sel_old_df |
data.frame of selection info |
sel_new |
data.frame, event data from plotly, must have column |
Value
updated selection data frame
Provide trace ids to set to invisible
Description
Provide trace ids to set to invisible
Usage
hide_trace_idx(trace_map, max_groups, selected_groups)
Arguments
trace_map |
matrix, with cols trace name (col 1), trace id (col 2) |
max_groups |
numeric, number of groups in grouptable |
selected_groups |
groups highlighted in grouptable |
Details
Provides the indices (JS notation, starting at 0) for indices
that are set to visible = 'legendonly'
through plotly.restyle
Make grouping overview table
Description
Make grouping overview table
Usage
make_group_table(dframe)
Arguments
dframe |
data.frame |
Value
tibble with one row per group
Wrapper for saving files
Description
Wrapper for saving files
Usage
make_save_filepath(save_dir, input_filepath, suffix, ext)
Arguments
save_dir |
character, selected save dir |
input_filepath |
character, original file path to folder |
suffix |
character, e.g. 'CLEAN' or 'cleaning_script' |
ext |
character, file extension, no dot!! |
Value
OS-conform file path for saving
Server Module: apply / reset filter
Description
Server Module: apply / reset filter
Usage
module_server_apply_reset(input, output, session, df_filtered, df_original)
Arguments
input , output , session |
standard |
df_filtered |
reactive, filtered df |
df_original |
reactive, original df |
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_box_str_filter(input, output, session, selector, actionbtn)
Arguments
input , output , session |
standard |
selector |
character, html selector for placement |
actionbtn |
reactive, action button counter |
Server Module: checkbox rendering
Description
Server Module: checkbox rendering
Usage
module_server_checkbox(input, output, session, text)
Arguments
input , output , session |
standard |
text |
Character, appears next to checkbox (or coerced) |
Server Module: filter info text and filtered df output
Description
Server Module: filter info text and filtered df output
Usage
module_server_df_filter(input, output, session, dframe, condition_df)
Arguments
input , output , session |
standard |
dframe |
data frame/tibble for filtering |
condition_df |
data frame/tibble with filtering conditions and grouping scope |
Value
df, either filtered or original, based on validity of statements
in condition_df
Server Module: Selection Annotator
Description
Server Module: Selection Annotator
Usage
module_server_extract_code(
input,
output,
session,
df_label,
filter_df,
gvar,
statements,
sel_points,
overwrite,
is_on_disk,
out_path
)
Arguments
input , output , session |
standard |
df_label |
string, name of original df input |
filter_df |
reactiveValue data frame with filter statements and scoping lvl |
gvar |
reactive character, grouping vars for |
statements |
reactive, lgl, vector of working statements |
sel_points |
reactiveValue, data frame with selected point keys, annotations, and selection count |
overwrite |
reacive value, TRUE/FALSE from checkbox input |
is_on_disk |
Logical, whether df represented by |
out_path |
reactive, List, with character strings providing directory paths and file names for saving/reading in code output |
Server Module: Extraction File selection menu
Description
Server Module: Extraction File selection menu
Usage
module_server_extract_code_fileconfig(
input,
output,
session,
df_label,
is_on_disk,
has_processed
)
Arguments
input , output , session |
standard |
df_label |
character, name of original df input |
is_on_disk |
Logical, whether df represented by |
has_processed |
reactive, logical, TRUE if filtered / selected points |
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_filter_str(input, output, session, dframe)
Arguments
input , output , session |
standard |
dframe |
data frame passed into dcr app |
Details
provides UI text box element
Server Module: Selection Annotator
Description
Server Module: Selection Annotator
Usage
module_server_group_relayout_buttons(input, output, session, startscatter)
Arguments
input , output , session |
standard |
startscatter |
reactive, actionbutton value |
Details
provides UI text box element
Value
reactive values with input xvar, yvar and actionbutton counter
Server Module: group selection
Description
Server Module: group selection
Usage
module_server_group_select(input, output, session, dframe)
Arguments
input , output , session |
standard |
dframe |
data frame for filtering |
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_group_selector_table(input, output, session, df, df_label, ...)
Arguments
input , output , session |
standard |
df |
data frame (either from overview or filtering tab) |
df_label |
character, original input data frame |
... |
arguments passed to |
Details
provides UI text box element
Server Module: dynamic histogram output for n vars str filter condition
Description
Server Module: dynamic histogram output for n vars str filter condition
Usage
module_server_histograms(
input,
output,
session,
dframe,
selector_inputs,
sel_points
)
Arguments
input , output , session |
standard |
dframe |
df |
selector_inputs |
reactive vals from above-plot controls, |
sel_points |
reactive, provides .dcrkey of selected points |
Details
provides UI buttons for deleting last / entire outlier selection
Value
reactive values with input xvar, yvar and actionbutton counter
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_lowercontrol_btn(
input,
output,
session,
selector_inputs,
action_track
)
Arguments
input , output , session |
standard |
selector_inputs |
reactive vals from above-plot controls, used to determine if plot is a map (lon/lat) |
action_track |
reactive, logical - has plot been pressed? |
Details
provides UI buttons for deleting last / entire outlier selection
Value
reactive values with input xvar, yvar and actionbutton counter
Server Module: DT for annotation
Description
Server Module: DT for annotation
Usage
module_server_plot_annotation_table(input, output, session, dframe, sel_points)
Arguments
input , output , session |
standard |
dframe |
df used for plotting |
sel_points |
numeric, vector of .dcrkeys selected in plot |
Value
df with .dcrkeys and annotations
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_plot_selectable(
input,
output,
session,
selector_inputs,
df,
sel_points,
mapstyle
)
Arguments
input , output , session |
standard |
selector_inputs |
reactive, output from module_plot_selectorcontrols |
df |
reactive df |
sel_points |
reactive, provides .dcrkey of selected points |
mapstyle |
reactive, selected mapstyle from below-plot controls |
Details
provides plot, note, that data set needs a column .dcrkey, added in initial processing step
Server Module: box for str filter condition
Description
Server Module: box for str filter condition
Usage
module_server_plot_selectorcontrols(input, output, session, df)
Arguments
input , output , session |
standard |
df |
df (not reactive - prevent re-execution of observer) |
Details
provides UI text box element
Value
reactive values with input xvar, yvar and actionbutton counter
Server Module: data summary
Description
Server Module: data summary
Usage
module_server_summary(
input,
output,
session,
dframe,
df_label,
start_clicked,
group_var_check
)
Arguments
input , output , session |
standard |
dframe |
reactive, input data frame |
df_label |
character, name of initial data set |
start_clicked |
reactive holding start action button |
group_var_check |
reactive holding group check output |
Server Module: Selection Annotator
Description
Server Module: Selection Annotator
Usage
module_server_text_annotator(input, output, session, sel_data)
Arguments
input , output , session |
standard |
sel_data |
reactive df |
Details
provides UI text box element
Value
reactive values with input xvar, yvar and actionbutton counter
UI Module: Apply/Reset Filtering
Description
UI Module: Apply/Reset Filtering
Usage
module_ui_apply_reset(id)
Arguments
id |
Character, identifier for variable selection |
UI Module: box for str filter condition
Description
UI Module: box for str filter condition
Usage
module_ui_box_str_filter(id, actionbtn)
Arguments
id |
Character, identifier for variable selection |
actionbtn |
reactive, action button counter |
UI Module: data summary
Description
UI Module: data summary
Usage
module_ui_checkbox(id, cond_id)
Arguments
id |
shiny standard |
cond_id |
character, |
UI Module: filter info text output
Description
UI Module: filter info text output
Usage
module_ui_df_filter(id)
Arguments
id |
character, shiny namespacing |
Value
UI text element giving number of failed filters and percent of filtered rows
UI Module: Extraction Text output
Description
UI Module: Extraction Text output
Usage
module_ui_extract_code(id)
Arguments
id |
Character string |
UI Module: Extraction File selection menu
Description
UI Module: Extraction File selection menu
Usage
module_ui_extract_code_fileconfig(id)
Arguments
id |
Character string |
UI Module: box for str filter condition
Description
UI Module: box for str filter condition
Usage
module_ui_filter_str(id)
Arguments
id |
Character string |
UI Module: Grouptable Relayout Buttons
Description
UI Module: Grouptable Relayout Buttons
Usage
module_ui_group_relayout_buttons(id)
Arguments
id |
Character string |
UI Module: group selection
Description
UI Module: group selection
Usage
module_ui_group_select(id)
Arguments
id |
Character, identifier for variable selection |
UI Module: box for str filter condition
Description
UI Module: box for str filter condition
Usage
module_ui_group_selector_table(id)
Arguments
id |
Character string |
UI Module: dynamic histogram output for n vars
Description
UI Module: dynamic histogram output for n vars
Usage
module_ui_histograms(id)
Arguments
id |
Character string |
UI Module: Delete selection buttons
Description
UI Module: Delete selection buttons
Usage
module_ui_lowercontrol_btn(id)
Arguments
id |
Character string |
UI Module: DT for annotation
Description
UI Module: DT for annotation
Usage
module_ui_plot_annotation_table(id)
Arguments
id |
Character string |
UI Module: plotly plot
Description
UI Module: plotly plot
Usage
module_ui_plot_selectable(id)
Arguments
id |
Character string |
UI Module: selector controls
Description
UI Module: selector controls
Usage
module_ui_plot_selectorcontrols(id)
Arguments
id |
Character string |
UI Module: data summary
Description
UI Module: data summary
Usage
module_ui_summary(id)
Arguments
id |
shiny standard |
UI Module: Selection Annotator
Description
UI Module: Selection Annotator
Usage
module_ui_text_annotator(id)
Arguments
id |
Character string |
Method for printing dcr_code output
Description
Method for printing dcr_code output
Usage
## S3 method for class 'dcr_code'
print(x, ...)
Arguments
x |
character, code output from |
... |
additional arguments passed to |
Split data.frame/tibble based on grouping
Description
Split data.frame/tibble based on grouping
Usage
split_groups(dframe)
Arguments
dframe |
data.frame |
Value
list of data frames