Package {dataquieR}

Usage

acc_cat_distributions(
  resp_vars = NULL,
  group_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable the name of the measurement variable

group_vars

variable the name of the observer, device or reader variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

Details

To complete

Value

A list with:

SummaryPlot: ggplot2::ggplot for the response variable in resp_vars.

Plots and checks for distributions

Description

Data quality indicator checks "Unexpected location" and "Unexpected proportion" with histograms.

Usage

acc_distributions(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  check_param = c("any", "location", "proportion"),
  plot_ranges = TRUE,
  flip_mode = "noflip",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the names of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

check_param

enum any | location | proportion. Which type of check should be conducted (if possible): a check on the location of the mean or median value of the study data, a check on proportions of categories, or either of them if the necessary metadata is available.

plot_ranges

logical Should the plot show ranges and results from the data quality checks? (default: TRUE)

flip_mode

enum default | flip | noflip | auto. Should the plot be in default orientation, flipped, not flipped or auto-flipped. Not all options are always supported. In general, this con be controlled by setting the roptions(dataquieR.flip_mode = ...). If called from dq_report, you can also pass flip_mode to all function calls or set them specifically using specific_args.

meta_data

data.frame old name for item_level

meta_data_v2

Value

A list with:

SummaryTable: data.frame containing data quality checks for "Unexpected location" (FLG_acc_ud_loc) and "Unexpected proportion" (FLG_acc_ud_prop) for each response variable in resp_vars.
SummaryData: a data.frame containing data quality checks for "Unexpected location" and / or "Unexpected proportion" for a report
SummaryPlotList: list of ggplot2::ggplots for each response variable in resp_vars.

Algorithm of this implementation:

If no response variable is defined, select all variables of type float or integer in the study data.
Remove missing codes from the study data (if defined in the metadata).
Remove measurements deviating from (hard) limits defined in the metadata (if defined).
Exclude variables containing only NA or only one unique value (excluding NAs).
Perform check for "Unexpected location" if defined in the metadata (needs a LOCATION_METRIC (mean or median) and LOCATION_RANGE (range of expected values for the mean and median, respectively)).
Perform check for "Unexpected proportion" if defined in the metadata (needs PROPORTION_RANGE (range of expected values for the proportions of the categories)).
Plot histogram(s).

ECDF plots for distribution checks

Description

Data quality indicator checks "Unexpected location" and "Unexpected proportion" if a grouping variable is included: Plots of empirical cumulative distributions for the subgroups.

Usage

acc_distributions_ecdf(
  resp_vars = NULL,
  group_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot",
    dataquieR.max_group_var_levels_in_plot_default),
  n_obs_per_group_min = getOption("dataquieR.min_obs_per_group_var_in_plot",
    dataquieR.min_obs_per_group_var_in_plot_default)
)

Arguments

resp_vars

variable list the names of the measurement variables

group_vars

variable list the name of the observer, device or reader variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

n_group_max

maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

n_obs_per_group_min

minimum number of data points per group to create a graph for an individual category of the group_vars variable

Value

A list with:

SummaryPlotList: list of ggplot2::ggplots for each response variable in resp_vars.

Plots and checks for distributions – Location

Description

Data quality indicator checks "Unexpected location" and "Unexpected proportion" with histograms.

Usage

acc_distributions_loc(
  resp_vars = NULL,
  study_data,
  label_col = VAR_NAMES,
  item_level = "item_level",
  check_param = "location",
  plot_ranges = TRUE,
  flip_mode = "noflip",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the names of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

check_param

plot_ranges

logical Should the plot show ranges and results from the data quality checks? (default: TRUE)

flip_mode

meta_data

data.frame old name for item_level

meta_data_v2

Value

A list with:

SummaryTable: data.frame containing data quality checks for "Unexpected location" (FLG_acc_ud_loc) and "Unexpected proportion" (FLG_acc_ud_prop) for each response variable in resp_vars.
SummaryData: a data.frame containing data quality checks for "Unexpected location" and / or "Unexpected proportion" for a report
SummaryPlotList: list of ggplot2::ggplots for each response variable in resp_vars.

Algorithm of this implementation:

If no response variable is defined, select all variables of type float or integer in the study data.
Remove missing codes from the study data (if defined in the metadata).
Remove measurements deviating from (hard) limits defined in the metadata (if defined).
Exclude variables containing only NA or only one unique value (excluding NAs).
Perform check for "Unexpected location" if defined in the metadata (needs a LOCATION_METRIC (mean or median) and LOCATION_RANGE (range of expected values for the mean and median, respectively)).
Perform check for "Unexpected proportion" if defined in the metadata (needs PROPORTION_RANGE (range of expected values for the proportions of the categories)).
Plot histogram(s).

Plots and checks for distributions – only

Description

Usage

acc_distributions_only(
  resp_vars = NULL,
  study_data,
  label_col = VAR_NAMES,
  item_level = "item_level",
  flip_mode = "noflip",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the names of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

flip_mode

meta_data

data.frame old name for item_level

meta_data_v2

Value

A list with:

SummaryTable: data.frame containing data quality checks for "Unexpected location" (FLG_acc_ud_loc) and "Unexpected proportion" (FLG_acc_ud_prop) for each response variable in resp_vars.
SummaryData: a data.frame containing data quality checks for "Unexpected location" and / or "Unexpected proportion" for a report
SummaryPlotList: list of ggplot2::ggplots for each response variable in resp_vars.

Algorithm of this implementation:

If no response variable is defined, select all variables of type float or integer in the study data.
Remove missing codes from the study data (if defined in the metadata).
Remove measurements deviating from (hard) limits defined in the metadata (if defined).
Exclude variables containing only NA or only one unique value (excluding NAs).
Perform check for "Unexpected location" if defined in the metadata (needs a LOCATION_METRIC (mean or median) and LOCATION_RANGE (range of expected values for the mean and median, respectively)).
Perform check for "Unexpected proportion" if defined in the metadata (needs PROPORTION_RANGE (range of expected values for the proportions of the categories)).
Plot histogram(s).

Plots and checks for distributions – Proportion

Description

Data quality indicator checks "Unexpected location" and "Unexpected proportion" with histograms.

Usage

acc_distributions_prop(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  check_param = "proportion",
  plot_ranges = TRUE,
  flip_mode = "noflip",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the names of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

check_param

plot_ranges

logical Should the plot show ranges and results from the data quality checks? (default: TRUE)

flip_mode

meta_data

data.frame old name for item_level

meta_data_v2

Value

A list with:

SummaryTable: data.frame containing data quality checks for "Unexpected location" (FLG_acc_ud_loc) and "Unexpected proportion" (FLG_acc_ud_prop) for each response variable in resp_vars.
SummaryData: a data.frame containing data quality checks for "Unexpected location" and / or "Unexpected proportion" for a report
SummaryPlotList: list of ggplot2::ggplots for each response variable in resp_vars.

Algorithm of this implementation:

If no response variable is defined, select all variables of type float or integer in the study data.
Remove missing codes from the study data (if defined in the metadata).
Remove measurements deviating from (hard) limits defined in the metadata (if defined).
Exclude variables containing only NA or only one unique value (excluding NAs).
Perform check for "Unexpected location" if defined in the metadata (needs a LOCATION_METRIC (mean or median) and LOCATION_RANGE (range of expected values for the mean and median, respectively)).
Perform check for "Unexpected proportion" if defined in the metadata (needs PROPORTION_RANGE (range of expected values for the proportions of the categories)).
Plot histogram(s).

Extension of acc_shape_or_scale to examine uniform distributions of end digits

Description

This implementation contrasts the empirical distribution of a measurement variables against assumed distributions. The approach is adapted from the idea of rootograms (Tukey (1977)) which is also applicable for count data (Kleiber and Zeileis (2016)).

Usage

acc_end_digits(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable the names of the measurement variables, mandatory

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

Value

a list with:

SummaryTable: data.frame with the columns Variables and FLG_acc_ud_shape
SummaryPlot: ggplot2 distribution plot comparing expected with observed distribution

ALGORITHM OF THIS IMPLEMENTATION:

This implementation is restricted to data of type float or integer.
Missing codes are removed from resp_vars (if defined in the metadata)
The user must specify the column of the metadata containing probability distribution (currently only: normal, uniform, gamma)
Parameters of each distribution can be estimated from the data or are specified by the user
A histogram-like plot contrasts the empirical vs. the technical distribution

Smoothes and plots adjusted longitudinal measurements and longitudinal trends from logistic regression models

Description

The following R implementation executes calculations for quality indicator "Unexpected location" (see here. Local regression (LOESS) is a versatile statistical method to explore an averaged course of time series measurements (Cleveland, Devlin, and Grosse 1988). In context of epidemiological data, repeated measurements using the same measurement device or by the same examiner can be considered a time series. LOESS allows to explore changes in these measurements over time.

Usage

acc_loess(
  resp_vars,
  group_vars = NULL,
  time_vars,
  co_vars = NULL,
  study_data,
  label_col = VAR_NAMES,
  item_level = "item_level",
  min_obs_in_subgroup = 30,
  resolution = 80,
  comparison_lines = list(type = c("mean/sd", "quartiles"), color = "grey30", linetype =
    2, sd_factor = 0.5),
  mark_time_points = getOption("dataquieR.acc_loess.mark_time_points",
    dataquieR.acc_loess.mark_time_points_default),
  plot_observations = getOption("dataquieR.acc_loess.plot_observations",
    dataquieR.acc_loess.plot_observations_default),
  plot_format = getOption("dataquieR.acc_loess.plot_format",
    dataquieR.acc_loess.plot_format_default),
  meta_data = item_level,
  meta_data_v2,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot",
    dataquieR.max_group_var_levels_in_plot_default),
  enable_GAM = getOption("dataquieR.GAM_for_LOESS", dataquieR.GAM_for_LOESS.default),
  exclude_constant_subgroups =
    getOption("dataquieR.acc_loess.exclude_constant_subgroups",
    dataquieR.acc_loess.exclude_constant_subgroups.default),
  min_bandwidth = getOption("dataquieR.acc_loess.min_bw",
    dataquieR.acc_loess.min_bw.default),
  min_proportion = getOption("dataquieR.acc_loess.min_proportion",
    dataquieR.acc_loess.min_proportion.default)
)

Arguments

resp_vars

variable the name of the continuous measurement variable

group_vars

variable the name of the observer, device or reader variable

time_vars

variable the name of the variable giving the time of measurement

co_vars

variable list a vector of covariables for adjustment, for example age and sex. Can be NULL (default) for no adjustment.

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

min_obs_in_subgroup

integer (optional argument) If group_vars is specified, this argument can be used to specify the minimum number of observations required for each of the subgroups. Subgroups with fewer observations are excluded. The default number is 30.

resolution

numeric the maximum number of time points used for plotting the trend lines

comparison_lines

list type and style of lines with which trend lines are to be compared. Can be mean +/- 0.5 standard deviation (the factor can be specified differently in sd_factor) or quartiles (Q1, Q2, and Q3). Arguments color and linetype are passed to ggplot2::geom_line().

mark_time_points

logical mark time points with observations (caution, there may be many marks)

plot_observations

logical show observations as scatter plot in the background. If there are co_vars specified, the values of the observations in the plot will also be adjusted for the specified covariables.

plot_format

enum AUTO | COMBINED | FACETS | BOTH. Return the plot as one combined plot for all groups or as facet plots (one figure per group). BOTH will return both variants, AUTO will decide based on the number of observers.

meta_data

data.frame old name for item_level

meta_data_v2

n_group_max

integer maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

enable_GAM

logical Can LOESS computations be replaced by general additive models to reduce memory consumption for large datasets?

exclude_constant_subgroups

logical Should subgroups with constant values be excluded?

min_bandwidth

numeric lower limit for the LOESS bandwidth, should be greater than 0 and less than or equal to 1. In general, increasing the bandwidth leads to a smoother trend line.

min_proportion

numeric lower limit for the proportion of the smaller group (cases or controls) for creating a LOESS figure, should be greater than 0 and less than 0.4.

Details

If mark_time_points or plot_observations is selected, but would result in plotting more than 400 points, only a sample of the data will be displayed.

Limitations

The application of LOESS requires model fitting, i.e. the smoothness of a model is subject to a smoothing parameter (span). Particularly in the presence of interval-based missing data, high variability of measurements combined with a low number of observations in one level of the group_vars may distort the fit. Since our approach handles data without knowledge of such underlying characteristics, finding the best fit is complicated if computational costs should be minimal. The default of LOESS in R uses a span of 0.75, which provides in most cases reasonable fits. The function acc_loess adapts the span for each level of the group_vars (with at least as many observations as specified in min_obs_in_subgroup and with at least three time points) based on the respective number of observations. LOESS consumes a lot of memory for larger datasets. That is why acc_loess switches to a generalized additive model with integrated smoothness estimation (gam by mgcv) if there are 1000 observations or more for at least one level of the group_vars (similar to geom_smooth from ggplot2).

Value

a list with:

SummaryPlotList: list with two plots if plot_format = "BOTH", otherwise one of the two figures described below:
- Loess_fits_facets: The plot contains LOESS-smoothed curves for each level of the group_vars in a separate panel. Added trend lines represent mean and standard deviation or quartiles (specified in comparison_lines) for moving windows over the whole data.
- Loess_fits_combined: This plot combines all curves into one panel. Given a low number of levels in the group_vars, this plot eases comparisons. However, if the number increases this plot may be too crowded and unclear.

Estimate marginal means, see emmeans::emmeans

Description

This function examines the impact of so-called process variables on a measurement variable. This implementation combines a descriptive and a model-based approach. Process variables that can be considered in this implementation must be categorical. It is currently not possible to consider more than one process variable within one function call. The measurement variable can be adjusted for (multiple) covariables, such as age or sex, for example.

Marginal means rests on model-based results, i.e. a significantly different marginal mean depends on sample size. Particularly in large studies, small and irrelevant differences may become significant. The contrary holds if sample size is low.

Usage

acc_margins(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  threshold_type = "empirical",
  threshold_value,
  min_obs_in_subgroup = 5,
  min_obs_in_cat = 5,
  dichotomize_categorical_resp = TRUE,
  cut_off_linear_model_for_ord = 10,
  meta_data = item_level,
  meta_data_v2,
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default),
  include_numbers_in_figures = getOption("dataquieR.acc_margins_num",
    dataquieR.acc_margins_num_default),
  n_violin_max = getOption("dataquieR.max_group_var_levels_with_violins",
    dataquieR.max_group_var_levels_with_violins_default)
)

Arguments

resp_vars

variable the name of the measurement variable

group_vars

variable list len=1-1. the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

threshold_type

enum empirical | user | none. In case empirical is chosen, a multiplier of the scale measure is used. In case of user, a value of the mean or probability (binary data) has to be defined see ⁠Implementation and use of thresholds⁠ in the online documentation). In case of none, no thresholds are displayed and no flagging of unusual group levels is applied.

threshold_value

numeric a multiplier or absolute value (see ⁠Implementation and use of thresholds⁠ in the online documentation).

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis. Subgroups with less observations are excluded.

min_obs_in_cat

integer This optional argument specifies the minimum number of observations that is required to include a category (level) of the outcome (resp_vars) in the analysis. Categories with less observations are combined into one group. If the collapsed category contains less observations than required, it will be excluded from the analysis.

dichotomize_categorical_resp

logical Should nominal response variables always be transformed to binary variables?

cut_off_linear_model_for_ord

integer from=0. This optional argument specifies the minimum number of observations for individual levels of an ordinal outcome (resp_var) that is required to run a linear model instead of an ordered regression (i.e., a cut-off value above which linear models are considered a good approximation). The argument can be set to NULL if ordered regression models are preferred for ordinal data in any case.

meta_data

data.frame old name for item_level

meta_data_v2

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations? Note that ordinal grouping variables will not be reordered.

include_numbers_in_figures

logical Should the figure report the number of observations for each level of the grouping variable?

n_violin_max

integer from=0. This optional argument specifies the maximum number of levels of the group_var for which violin plots will be shown in the figure.

Details

Limitations

Selecting the appropriate distribution is complex. Dozens of continuous, discrete or mixed distributions are conceivable in the context of epidemiological data. Their exact exploration is beyond the scope of this data quality approach. The present function uses the help function util_dist_selection, the assigned SCALE_LEVEL and the DATA_TYPE to discriminate the following cases:

continuous data
binary data
count data with <= 20 distinct values
count data with > 20 distinct values (treated as continuous)
nominal data
ordinal data

Continuous data and count data with more than 20 distinct values are analyzed by linear models. Count data with up to 20 distinct values are modeled by a Poisson regression. For binary data, the implementation uses logistic regression. Nominal response variables will either be transformed to binary variables or analyzed by multinomial logistic regression models. The latter option is only available if the argument dichotomize_categorical_resp is set to FALSE and if the package nnet is installed. The transformation to a binary variable can be user-specified using the metadata columns RECODE_CASES and/or RECODE_CONTROL. Otherwise, the most frequent category will be assigned to cases and the remaining categories to control. For ordinal response variables, the argument cut_off_linear_model_for_ord controls whether the data is analyzed in the same way as continuous data: If every level of the variable has at least as many observations as specified in the argument, the data will be analyzed by a linear model. Otherwise, the data will be modeled by a ordered regression, if the package ordinal is installed.

Value

a list with:

SummaryTable: data.frame underlying the plot
ResultData: data.frame
SummaryPlot: ggplot2::ggplot() margins plot

Calculate and plot Mahalanobis distances

Description

A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:

the classical approach from Tukey: 1.5 * IQR from the 1st (Q_{25}) or 3rd (Q_{75}) quartile.
the 3SD approach, i.e. any measurement of the Mahalanobis distance not in the interval of \bar{x} \pm 3*\sigma is considered an outlier.
the approach from Hubert for skewed distributions which is embedded in the R package robustbase
a completely heuristic approach named \sigma-gap.

For further details, please see the vignette for univariate outlier.

Usage

acc_multivariate_outlier(
  variable_group = NULL,
  id_vars = NULL,
  label_col = VAR_NAMES,
  study_data,
  item_level = "item_level",
  n_rules = 4,
  max_non_outliers_plot = 10000,
  criteria = c("tukey", "3sd", "hubert", "sigmagap"),
  meta_data = item_level,
  meta_data_v2,
  scale = getOption("dataquieR.acc_multivariate_outlier.scale",
    dataquieR.acc_multivariate_outlier.scale_default),
  multivariate_outlier_check = TRUE
)

Arguments

variable_group

variable list the names of the continuous measurement variables building a group, for that multivariate outliers make sense.

id_vars

variable optional, an ID variable of the study data. If not specified row numbers are used.

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

n_rules

numeric from=1 to=4. the no. of rules that must be violated to classify as outlier

max_non_outliers_plot

integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic.

criteria

set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers.

meta_data

data.frame old name for item_level

meta_data_v2

scale

logical Should min-max-scaling be applied per variable?

multivariate_outlier_check

logical really check, pipeline use, only.

Value

a list with:

SummaryTable: data.frame underlying the plot
SummaryPlot: ggplot2::ggplot2 outlier plot
FlaggedStudyData data.frame contains the original data frame with the additional columns tukey, ⁠3SD⁠, hubert, and sigmagap. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.

ALGORITHM OF THIS IMPLEMENTATION:

Implementation is restricted to variables of type float
Remove missing codes from the study data (if defined in the metadata)
The covariance matrix is estimated for all variables from variable_group
The Mahalanobis distance of each observation is calculated MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)
The four rules mentioned above are applied on this distance for each observation in the study data
An output data frame is generated that flags each outlier
A parallel coordinate plot indicates respective outliers

List function.

Identify univariate outliers by four different approaches

Description

A classical but still popular approach to detect univariate outlier is the boxplot method introduced by Tukey 1977. The boxplot is a simple graphical tool to display information about continuous univariate data (e.g., median, lower and upper quartile). Outliers are defined as values deviating more than 1.5 \times IQR from the 1st (Q25) or 3rd (Q75) quartile. The strength of Tukey's method is that it makes no distributional assumptions and thus is also applicable to skewed or non mound-shaped data Marsh and Seo, 2006. Nevertheless, this method tends to identify frequent measurements which are falsely interpreted as true outliers.

A somewhat more conservative approach in terms of symmetric and/or normal distributions is the 3SD approach, i.e. any measurement not in the interval of mean(x) +/- 3 * \sigma is considered an outlier.

Both methods mentioned above are not ideally suited to skewed distributions. As many biomarkers such as laboratory measurements represent in skewed distributions the methods above may be insufficient. The approach of Hubert and Vandervieren 2008 adjusts the boxplot for the skewness of the distribution. This approach is implemented in several R packages such as robustbase::mc which is used in this implementation of dataquieR.

Another completely heuristic approach is also included to identify outliers. The approach is based on the assumption that the distances between measurements of the same underlying distribution should homogeneous. For comprehension of this approach:

consider an ordered sequence of all measurements.
between these measurements all distances are calculated.
the occurrence of larger distances between two neighboring measurements may than indicate a distortion of the data. For the heuristic definition of a large distance 1 * \sigma has been been chosen.

Note, that the plots are not deterministic, because they use ggplot2::geom_jitter.

Usage

acc_robust_univariate_outlier(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  exclude_roles,
  n_rules = length(unique(criteria)),
  max_non_outliers_plot = 10000,
  criteria = c("tukey", "3sd", "hubert", "sigmagap"),
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the continuous measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

exclude_roles

variable roles a character (vector) of variable roles not included

n_rules

integer from=1 to=4. the no. rules that must be violated to flag a variable as containing outliers. The default is 4, i.e. all.

max_non_outliers_plot

integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic.

criteria

set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Hint: The function is designed for unimodal data only.

Value

a list with:

SummaryTable: data.frame with the columns Variables, Mean, SD, Median, Skewness, Tukey (N), ⁠3SD (N)⁠, Hubert (N), Sigma-gap (N), NUM_acc_ud_outlu, ⁠Outliers, low (N)⁠, ⁠Outliers, high (N)⁠ Grading
- SummaryData: data.frame with the columns Variables, Mean, SD, Median, Skewness, Tukey (N), ⁠3SD (N)⁠, Hubert (N), Sigma-gap (N), Outliers (N), ⁠Outliers, low (N)⁠, ⁠Outliers, high (N)⁠
- SummaryPlotList: ggplot2::ggplot univariate outlier plots

ALGORITHM OF THIS IMPLEMENTATION:

Select all variables of type float in the study data
Remove missing codes from the study data (if defined in the metadata)
Remove measurements deviating from limits defined in the metadata
Identify outliers according to the approaches of Tukey (Tukey 1977), 3SD (Saleem et al. 2021), Hubert (Hubert and Vandervieren 2008), and SigmaGap (heuristic)
An output data frame is generated which indicates the no. possible outliers, the direction of deviations (Outliers, low; Outliers, high) for all methods and a summary score which sums up the deviations of the different rules
A scatter plot is generated for all examined variables, flagging observations according to the no. violated rules (step 5).

Compare observed versus expected distributions

Description

This implementation contrasts the empirical distribution of a measurement variables against assumed distributions. The approach is adapted from the idea of rootograms (Tukey 1977) which is also applicable for count data (Kleiber and Zeileis 2016).

Usage

acc_shape_or_scale(
  resp_vars,
  study_data,
  label_col,
  item_level = "item_level",
  dist_col,
  guess,
  par1,
  par2,
  end_digits,
  flip_mode = "noflip",
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable the name of the continuous measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

dist_col

variable attribute the name of the variable attribute in meta_data that provides the expected distribution of a study variable

guess

logical estimate parameters

par1

numeric first parameter of the distribution if applicable

par2

numeric second parameter of the distribution if applicable

end_digits

logical internal use. check for end digits preferences

flip_mode

meta_data

data.frame old name for item_level

meta_data_v2

Value

a list with:

ResultData: data.frame underlying the plot
SummaryPlot: ggplot2::ggplot2 probability distribution plot
SummaryTable: data.frame with the columns Variables and FLG_acc_ud_shape

ALGORITHM OF THIS IMPLEMENTATION:

This implementation is restricted to data of type float or integer.
Missing codes are removed from resp_vars (if defined in the metadata)
The user must specify the column of the metadata containing probability distribution (currently only: normal, uniform, gamma)
Parameters of each distribution can be estimated from the data or are specified by the user
A histogram-like plot contrasts the empirical vs. the technical distribution

Identify univariate outliers by four different approaches

Description

consider an ordered sequence of all measurements.
between these measurements all distances are calculated.
the occurrence of larger distances between two neighboring measurements may than indicate a distortion of the data. For the heuristic definition of a large distance 1 * \sigma has been been chosen.

Note, that the plots are not deterministic, because they use ggplot2::geom_jitter.

Usage

acc_univariate_outlier(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  exclude_roles,
  n_rules = length(unique(criteria)),
  max_non_outliers_plot = 10000,
  criteria = c("tukey", "3sd", "hubert", "sigmagap"),
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the continuous measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

exclude_roles

variable roles a character (vector) of variable roles not included

n_rules

integer from=1 to=4. the no. rules that must be violated to flag a variable as containing outliers. The default is 4, i.e. all.

max_non_outliers_plot

integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic.

criteria

set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Hint: The function is designed for unimodal data only.

Value

a list with:

SummaryTable: data.frame with the columns Variables, Mean, SD, Median, Skewness, Tukey (N), ⁠3SD (N)⁠, Hubert (N), Sigma-gap (N), NUM_acc_ud_outlu, ⁠Outliers, low (N)⁠, ⁠Outliers, high (N)⁠ Grading
- SummaryData: data.frame with the columns Variables, Mean, SD, Median, Skewness, Tukey (N), ⁠3SD (N)⁠, Hubert (N), Sigma-gap (N), Outliers (N), ⁠Outliers, low (N)⁠, ⁠Outliers, high (N)⁠
- SummaryPlotList: ggplot2::ggplot univariate outlier plots

ALGORITHM OF THIS IMPLEMENTATION:

Select all variables of type float in the study data
Remove missing codes from the study data (if defined in the metadata)
Remove measurements deviating from limits defined in the metadata
Identify outliers according to the approaches of Tukey (Tukey 1977), 3SD (Saleem et al. 2021), Hubert (Hubert and Vandervieren 2008), and SigmaGap (heuristic)
An output data frame is generated which indicates the no. possible outliers, the direction of deviations (Outliers, low; Outliers, high) for all methods and a summary score which sums up the deviations of the different rules
A scatter plot is generated for all examined variables, flagging observations according to the no. violated rules (step 5).

Utility function to compute model-based ICC depending on the (statistical) data type

Description

This function is still under construction. It is designed to run for any statistical data type as follows:

Variables with only two distinct values will be modeled by mixed effects logistic regression.
Nominal variables will be transformed to binary variables. This can be user-specified using the metadata columns RECODE_CASES and/or RECODE_CONTROL. Otherwise, the most frequent category will be assigned to cases and the remaining categories to control. As for other binary variables, the ICC will be computed using a mixed effects logistic regression.
Ordinal variables will be analyzed by linear mixed effects models, if every level of the variable has at least as many observations as specified in the argument cut_off_linear_model_for_ord. Otherwise, the data will be modeled by a mixed effects ordered regression, if the package ordinal is available.
Metric variables with integer values are analyzed by linear mixed effects models.
For variables with data type float, the existing implementation acc_varcomp is called, which also uses linear mixed effects models.

Usage

acc_varcomp(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  min_obs_in_subgroup = 10,
  min_subgroups = 5,
  cut_off_linear_model_for_ord = 10,
  threshold_value = lifecycle::deprecated(),
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable the name of the measurement variable

group_vars

variable the name of the examiner, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex, for adjustment

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

min_obs_in_subgroup

min_subgroups

integer from=0. This optional argument specifies the minimum number of subgroups (level) of the group_var that is required to run the analysis. If there are less subgroups, the analysis is not conducted.

cut_off_linear_model_for_ord

integer from=0. This optional argument specifies the minimum number of observations for individual levels of an ordinal outcome (resp_var) that is required to run a linear mixed effects model instead of a mixed effects ordered regression (i.e., a cut-off value above which linear models are considered a good approximation). The argument can be set to NULL if ordered regression models are preferred for ordinal data in any case.

threshold_value

Deprecated.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Not yet described

Value

The function returns two data frames, 'SummaryTable' and 'SummaryData', that differ only in the names of the columns.

Convert a full `dataquieR` report to a `data.frame`

Description

Deprecated

Usage

## S3 method for class 'dataquieR_resultset'
as.data.frame(x, ...)

Arguments

x

Deprecated

...

Deprecated

Value

Deprecated

Convert a full `dataquieR` report to a `list`

Description

Deprecated

Usage

## S3 method for class 'dataquieR_resultset'
as.list(x, ...)

Arguments

x

Deprecated

...

Deprecated

Value

Deprecated

inefficient way to convert a report to a list. try `prep_set_backend()`

Description

inefficient way to convert a report to a list. try prep_set_backend()

Usage

## S3 method for class 'dataquieR_resultset2'
as.list(x, ...)

Arguments

x

dataquieR_resultset2

...

no used

Value

list

Data frame with contradiction rules

Description

Two versions exist, the newer one is used by con_contradictions_redcap and is described here., the older one used by con_contradictions is described here.

Summarize missingness columnwise (in variable)

Description

Item-Missingness (also referred to as item nonresponse (De Leeuw et al. 2003)) describes the missingness of single values, e.g. blanks or empty data cells in a data set. Item-Missingness occurs for example in case a respondent does not provide information for a certain question, a question is overlooked by accident, a programming failure occurs or a provided answer were missed while entering the data.

Usage

com_item_missingness(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  show_causes = TRUE,
  cause_label_df,
  include_sysmiss = TRUE,
  threshold_value,
  suppressWarnings = FALSE,
  assume_consistent_codes = TRUE,
  expand_codes = assume_consistent_codes,
  drop_levels = FALSE,
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT"),
  pretty_print = lifecycle::deprecated(),
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

show_causes

logical if TRUE, then the distribution of missing codes is shown

cause_label_df

data.frame missing code table. If missing codes have labels the respective data frame can be specified here or in the metadata as assignments, see cause_label_df

include_sysmiss

logical Optional, if TRUE system missingness (NAs) is evaluated in the summary plot

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100

suppressWarnings

logical warn about consistency issues with missing and jump lists

assume_consistent_codes

logical if TRUE and no labels are given and the same missing/jump code is used for more than one variable, the labels assigned for this code are treated as being be the same for all variables.

expand_codes

logical if TRUE, code labels are copied from other variables, if the code is the same and the label is set somewhere

drop_levels

logical if TRUE, do not display unused missing codes in the figure legend.

expected_observations

enum HIERARCHY | ALL | SEGMENT. If ALL, all observations are expected to comprise all study segments. If SEGMENT, the PART_VAR is expected to point to a variable with values of 0 and 1, indicating whether the variable was expected to be observed for each data row. If HIERARCHY, this is also checked recursively, so, if a variable points to such a participation variable, and that other variable does has also a PART_VAR entry pointing to a variable, the observation of the initial variable is only expected, if both segment variables are 1.

pretty_print

logical deprecated. If you want to have a human readable output, use SummaryData instead of SummaryTable

meta_data

data.frame old name for item_level

meta_data_v2

Value

a list with:

SummaryTable: data frame about item missingness per response variable
SummaryData: data frame about item missingness per response variable formatted for user
SummaryPlot: ggplot2 heatmap plot, if show_causes was TRUE
ReportSummaryTable: data frame underlying SummaryPlot

ALGORITHM OF THIS IMPLEMENTATION:

Lists of missing codes and, if applicable, jump codes are selected from the metadata
The no. of system missings (NA) in each variable is calculated
The no. of used missing codes is calculated for each variable
The no. of used jump codes is calculated for each variable
Two result dataframes (1: on the level of observations, 2: a summary for each variable) are generated
OPTIONAL: if show_causes is selected, one summary plot for all resp_vars is provided

Compute Indicators for Qualified Item Missingness

Description

Usage

com_qualified_item_missingness(
  resp_vars,
  study_data,
  label_col = NULL,
  item_level = "item_level",
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT"),
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

expected_observations

enum HIERARCHY | ALL | SEGMENT. Report the number of observations expected using the old PART_VAR concept. See com_item_missingness for an explanation.

meta_data

data.frame old name for item_level

meta_data_v2

Value

A list with:

SummaryTable: data.frame containing data quality checks for "Non-response rate" (PCT_com_qum_nonresp) and "Refusal rate" (PCT_com_qum_refusal) for each response variable in resp_vars.
SummaryData: a data.frame containing data quality checks for “Non-response rate” and "Refusal rate" for a report

Examples

## Not run: 
prep_load_workbook_like_file("inst/extdata/Metadata_example_v3-6.xlsx")
clean <- prep_get_data_frame("item_level")
clean <- subset(clean, `Metadata name` == "Example" &
  !dataquieR:::util_empty(VAR_NAMES))
clean$`Metadata name` <- NULL
clean[, "MISSING_LIST_TABLE"] <- "missing_matchtable1"
prep_add_data_frames(item_level = clean)
clean <- prep_get_data_frame("missing_matchtable1")
clean <- clean[clean$`Metadata name` == "Example", , FALSE]
clean <-
  clean[suppressWarnings(as.character(as.integer(clean$CODE_VALUE)) ==
    as.character(clean$CODE_VALUE)), , FALSE]
clean$CODE_VALUE <- as.integer(clean$CODE_VALUE)
clean <- clean[!is.na(clean$`Metadata name`), , FALSE]
clean$`Metadata name` <- NULL
prep_add_data_frames(missing_matchtable1 = clean)
ship <- prep_get_data_frame("ship")
number_of_mis <- ceiling(nrow(ship) / 20)
resp_vars <- sample(colnames(ship), ceiling(ncol(ship) / 20), FALSE)
mistab <- prep_get_data_frame("missing_matchtable1")
valid_replacement_codes <-
  mistab[mistab$CODE_INTERPRET != "I", CODE_VALUE,
    drop =
    TRUE] # sample only replacement codes on item level. I uses the actual
          # values
for (rv in resp_vars) {
  values <- sample(as.numeric(valid_replacement_codes), number_of_mis,
    replace = TRUE)
  if (inherits(ship[[rv]], "POSIXct")) {
    values <- as.POSIXct(values, origin = min(as.POSIXct(Sys.Date()), 0))
  }
  ship[sample(seq_len(nrow(ship)), number_of_mis, replace = FALSE), rv] <-
    values
}
com_qualified_item_missingness(resp_vars = NULL, ship, "item_level", LABEL)
com_qualified_item_missingness(resp_vars = "Diabetes Age onset", ship,
  "item_level", LABEL)
com_qualified_item_missingness(resp_vars = NULL, "study_data", "meta_data",
  LABEL)
study_data <- ship
meta_data <- prep_get_data_frame("item_level")
label <- LABEL

## End(Not run)

Compute Indicators for Qualified Segment Missingness

Description

Usage

com_qualified_segment_missingness(
  label_col = NULL,
  study_data,
  item_level = "item_level",
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT"),
  meta_data = item_level,
  meta_data_v2,
  meta_data_segment,
  segment_level
)

Arguments

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

expected_observations

enum HIERARCHY | ALL | SEGMENT. Report the number of observations expected using the old PART_VAR concept. See com_item_missingness for an explanation.

meta_data

data.frame old name for item_level

meta_data_v2

meta_data_segment

data.frame Segment level metadata

segment_level

data.frame alias for meta_data_segment

Value

A list with:

SegmentTable: data.frame containing data quality checks for "Non-response rate" (PCT_com_qum_nonresp) and "Refusal rate" (PCT_com_qum_refusal) for each segment.
SegmentData: a data.frame containing data quality checks for "Unexpected location" and "Unexpected proportion" per segment for a report

Summarizes missingness for individuals in specific segments

Description

This implementation can be applied in two use cases:

participation in study segments is not recorded by respective variables, e.g. a participant's refusal to attend a specific examination is not recorded.
participation in study segments is recorded by respective variables.

Use case (1) will be common in smaller studies. For the calculation of segment missingness it is assumed that study variables are nested in respective segments. This structure must be specified in the static metadata. The R-function identifies all variables within each segment and returns TRUE if all variables within a segment are missing, otherwise FALSE.

Use case (2) assumes a more complex structure of study data and metadata. The study data comprise so-called intro-variables (either TRUE/FALSE or codes for non-participation). The column PART_VAR in the metadata is filled by variable-IDs indicating for each variable the respective intro-variable. This structure has the benefit that subsequent calculation of item missingness obtains correct denominators for the calculation of missingness rates.

Usage

com_segment_missingness(
  study_data,
  item_level = "item_level",
  strata_vars = NULL,
  group_vars = NULL,
  label_col,
  threshold_value,
  direction,
  color_gradient_direction,
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT"),
  exclude_roles = c(VARIABLE_ROLES$PROCESS),
  meta_data = item_level,
  meta_data_v2,
  segment_level,
  meta_data_segment
)

Arguments

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

strata_vars

variable the name of a variable used for stratification, defaults to NULL for not grouping output

group_vars

variable the name of a variable used for grouping, defaults to NULL for not grouping output

label_col

variable attribute the name of the column in the metadata with labels of variables

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100

direction

enum low | high. "high" or "low", i.e. are deviations above/below the threshold critical. This argument is deprecated and replaced by color_gradient_direction.

color_gradient_direction

enum above | below. "above" or "below", i.e. are deviations above or below the threshold critical? (default: above)

expected_observations

exclude_roles

variable roles a character (vector) of variable roles not included

meta_data

data.frame old name for item_level

meta_data_v2

segment_level

data.frame alias for meta_data_segment

meta_data_segment

data.frame Segment level metadata. Optional.

Details

Implementation and use of thresholds

This implementation uses one threshold to discriminate critical from non-critical values. If direction is above than all values below the threshold_value are normal (displayed in dark blue in the plot and flagged with GRADING = 0 in the dataframe). All values above the threshold_value are considered critical. The more they deviate from the threshold the displayed color shifts to dark red. All critical values are highlighted with GRADING = 1 in the summary data frame. By default, highest values are always shown in dark red irrespective of the absolute deviation.

If direction is below than all values above the threshold_value are normal (displayed in dark blue, GRADING = 0).

Hint

This function does not support a resp_vars argument but exclude_roles to specify variables not relevant for detecting a missing segment.

List function.

Value

a list with:

ResultData: data frame about segment missingness
SummaryPlot: ggplot2 heatmap plot: a heatmap-like graphic that highlights critical values depending on the respective threshold_value and direction.
ReportSummaryTable: data frame underlying SummaryPlot

Counts all individuals with no measurements at all

Description

This implementation examines a crude version of unit missingness or unit-nonresponse (Kalton and Kasprzyk 1986), i.e. if all measurement variables in the study data are missing for an observation it has unit missingness.

The function can be applied on stratified data. In this case strata_vars must be specified.

Usage

com_unit_missingness(
  id_vars = NULL,
  strata_vars = NULL,
  label_col,
  study_data,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2
)

Arguments

id_vars

variable list optional, a (vectorized) call of ID-variables that should not be considered in the calculation of unit- missingness

strata_vars

variable optional, a string or integer variable used for stratification

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

Details

This implementations calculates a crude rate of unit-missingness. This type of missingness may have several causes and is an important research outcome. For example, unit-nonresponse may be selective regarding the targeted study population or technical reasons such as record-linkage may cause unit-missingness.

It has to be discriminated form segment and item missingness, since different causes and mechanisms may be the reason for unit-missingness.

Hint

This function does not support a resp_vars argument but id_vars, which have a roughly inverse logic behind: id_vars with values do not prevent a row from being considered missing, because an ID is the only hint for a unit that elsewise would not occur in the data at all.

List function.

Value

A list with:

FlaggedStudyData: data.frame with id-only-rows flagged in a column Unit_missing
SummaryData: data.frame with numbers and percentages of unit missingness

Checks user-defined contradictions in study data

Description

This approach considers a contradiction if impossible combinations of data are observed in one participant. For example, if age of a participant is recorded repeatedly the value of age is (unfortunately) not able to decline. Most cases of contradictions rest on comparison of two variables.

Important to note, each value that is used for comparison may represent a possible characteristic but the combination of these two values is considered to be impossible. The approach does not consider implausible or inadmissible values.

Usage

con_contradictions(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  threshold_value,
  check_table,
  summarize_categories = FALSE,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100

check_table

data.frame contradiction rules table. Table defining contradictions. See details for its required structure.

summarize_categories

logical Needs a column 'tag' in the check_table. If set, a summary output is generated for the defined categories plus one plot per category.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Algorithm of this implementation:

Select all variables in the data with defined contradiction rules (static metadata column CONTRADICTIONS)
Remove missing codes from the study data (if defined in the metadata)
Remove measurements deviating from limits defined in the metadata
Assign label to levels of categorical variables (if applicable)
Apply contradiction checks on predefined sets of variables
Identification of measurements fulfilling contradiction rules. Therefore two output data frames are generated:
- on the level of observation to flag each contradictory value combination, and
- a summary table for each contradiction check.
A summary plot illustrating the number of contradictions is generated.

List function.

Value

If summarize_categories is FALSE: A list with:

FlaggedStudyData: The first output of the contradiction function is a data frame of similar dimension regarding the number of observations in the study data. In addition, for each applied check on the variables an additional column is added which flags observations with a contradiction given the applied check.
SummaryTable: The second output summarizes this information into one data frame. This output can be used to provide an executive overview on the amount of contradictions. This output is meant for automatic digestion within pipelines.
SummaryData: The third output is the same as SummaryTable but for human readers.
SummaryPlot: The fourth output visualizes summarized information of SummaryData.

if summarize_categories is TRUE, other objects are returned: one per category named by that category (e.g. "Empirical") containing a result for contradictions within that category only. Additionally, in the slot all_checks a result as it would have been returned with summarize_categories set to FALSE. Finally, a slot SummaryData is returned containing sums per Category and an according ggplot2::ggplot in SummaryPlot.

Checks user-defined contradictions in study data

Description

Usage

con_contradictions_redcap(
  study_data,
  item_level = "item_level",
  label_col,
  threshold_value,
  meta_data_cross_item = "cross-item_level",
  use_value_labels,
  summarize_categories = FALSE,
  meta_data = item_level,
  cross_item_level,
  `cross-item_level`,
  meta_data_v2
)

Arguments

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100

meta_data_cross_item

data.frame contradiction rules table. Table defining contradictions. See online documentation for its required structure.

use_value_labels

logical Deprecated in favor of DATA_PREPARATION. If set to TRUE, labels can be used in the REDCap syntax to specify contraction checks for categorical variables. If set to FALSE, contractions have to be specified using the coded values. In case that this argument is not set in the function call, it will be set to TRUE if the metadata contains a column VALUE_LABELS which is not empty.

summarize_categories

logical Needs a column CONTRADICTION_TYPE in the meta_data_cross_item. If set, a summary output is generated for the defined categories plus one plot per category. TODO: Not yet controllable by metadata.

meta_data

data.frame old name for item_level

cross_item_level

data.frame alias for meta_data_cross_item

meta_data_v2

`cross-item_level`

data.frame alias for meta_data_cross_item

Details

Algorithm of this implementation:

Remove missing codes from the study data (if defined in the metadata)
Remove measurements deviating from limits defined in the metadata
Assign label to levels of categorical variables (if applicable)
Apply contradiction checks (given as REDCap-like rules in a separate metadata table)
Identification of measurements fulfilling contradiction rules. Therefore two output data frames are generated:
- on the level of observation to flag each contradictory value combination, and
- a summary table for each contradiction check.
A summary plot illustrating the number of contradictions is generated.

List function.

Value

If summarize_categories is FALSE: A list with:

FlaggedStudyData: The first output of the contradiction function is a data frame of similar dimension regarding the number of observations in the study data. In addition, for each applied check on the variables an additional column is added which flags observations with a contradiction given the applied check.
VariableGroupData: The second output summarizes this information into one data frame. This output can be used to provide an executive overview on the amount of contradictions.
VariableGroupTable: A subset of VariableGroupData used within the pipeline.
SummaryPlot: The third output visualizes summarized information of SummaryData.

If summarize_categories is TRUE, other objects are returned: A list with one element Other, a list with the following entries: One per category named by that category (e.g. "Empirical") containing a result for contradiction checks within that category only. Additionally, in the slot all_checks, a result as it would have been returned with summarize_categories set to FALSE. Finally, in the top-level list, a slot SummaryData is returned containing sums per Category and an according ggplot2::ggplot in SummaryPlot.

Detects variable levels not specified in metadata

Description

For each categorical variable, value lists should be defined in the metadata. This implementation will examine, if all observed levels in the study data are valid.

Usage

con_inadmissible_categorical(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  threshold_value = 0,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Algorithm of this implementation:

Remove missing codes from the study data (if defined in the metadata)
Interpretation of variable specific VALUE_LABELS as supplied in the metadata.
Identification of measurements not corresponding to the expected categories. Therefore two output data frames are generated:
- on the level of observation to flag each undefined category, and
- a summary table for each variable.
Values not corresponding to defined categories are removed in a data frame of modified study data

Value

a list with:

SummaryData: data frame summarizing inadmissible categories with the columns:
- Variables: variable name/label
- OBSERVED_CATEGORIES: the categories observed in the study data
- DEFINED_CATEGORIES: the categories defined in the metadata
- NON_MATCHING: the categories observed but not defined
- NON_MATCHING_N: the number of observations with categories not defined
- NON_MATCHING_N_PER_CATEGORY: the number of observations for each of the unexpected categories
SummaryTable: data frame for the dataquieR pipeline reporting the number and percentage of inadmissible categorical values
ModifiedStudyData: study data having inadmissible categories removed
FlaggedStudyData: study data having cases with inadmissible categories flagged

Detects variable levels not specified in standardized vocabulary

Description

For each categorical variable, value lists should be defined in the metadata. This implementation will examine, if all observed levels in the study data are valid.

Usage

con_inadmissible_vocabulary(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  threshold_value = 0,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

threshold_value

numeric from=0 to=100. a numerical value ranging from 0-100.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Algorithm of this implementation:

Remove missing codes from the study data (if defined in the metadata)
Interpretation of variable specific VALUE_LABELS as supplied in the metadata.
Identification of measurements not corresponding to the expected categories. Therefore two output data frames are generated:
- on the level of observation to flag each undefined category, and
- a summary table for each variable.
Values not corresponding to defined categories are removed in a data frame of modified study data

Value

a list with:

SummaryData: data frame summarizing inadmissible categories with the columns:
- Variables: variable name/label
- OBSERVED_CATEGORIES: the categories observed in the study data
- DEFINED_CATEGORIES: the categories defined in the metadata
- NON_MATCHING: the categories observed but not defined
- NON_MATCHING_N: the number of observations with categories not defined
- NON_MATCHING_N_PER_CATEGORY: the number of observations for each of the unexpected categories
- GRADING: indicator TRUE/FALSE if inadmissible categorical values were observed (more than indicated by the threshold_value)
SummaryTable: data frame for the dataquieR pipeline reporting the number and percentage of inadmissible categorical values
ModifiedStudyData: study data having inadmissible categories removed
FlaggedStudyData: study data having cases with inadmissible categories flagged

Examples

## Not run: 
sdt <- data.frame(DIAG = c("B050", "B051", "B052", "B999"),
                  MED0 = c("S01XA28", "N07XX18", "ABC", NA), stringsAsFactors = FALSE)
mdt <- tibble::tribble(
~ VAR_NAMES, ~ DATA_TYPE, ~ STANDARDIZED_VOCABULARY_TABLE, ~ SCALE_LEVEL, ~ LABEL,
"DIAG", "string", "<ICD10>", "nominal", "Diagnosis",
"MED0", "string", "<ATC>", "nominal", "Medication"
)
con_inadmissible_vocabulary(NULL, sdt, mdt, label_col = LABEL)
prep_load_workbook_like_file("meta_data_v2")
il <- prep_get_data_frame("item_level")
il$STANDARDIZED_VOCABULARY_TABLE[[11]] <- "<ICD10GM>"
il$DATA_TYPE[[11]] <- DATA_TYPES$INTEGER
il$SCALE_LEVEL[[11]] <- SCALE_LEVELS$NOMINAL
prep_add_data_frames(item_level = il)
r <- dq_report2("study_data", dimensions = "con")
r <- dq_report2("study_data", dimensions = "con",
     advanced_options = list(dataquieR.non_disclosure = TRUE))
r

## End(Not run)

Detects variable values exceeding limits defined in metadata

Description

Inadmissible numerical values can be of type integer or float. This implementation requires the definition of intervals in the metadata to examine the admissibility of numerical study data.

This helps identify inadmissible measurements according to hard limits (for multiple variables).

Usage

con_limit_deviations(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  limits = NULL,
  flip_mode = "noflip",
  return_flagged_study_data = FALSE,
  return_limit_categorical = TRUE,
  meta_data = item_level,
  meta_data_v2,
  show_obs = TRUE
)

Arguments

resp_vars

variable list the name of the measurement variables

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

limits

enum HARD_LIMITS | SOFT_LIMITS | DETECTION_LIMITS. what limits from metadata to check for

flip_mode

return_flagged_study_data

logical return FlaggedStudyData in the result

return_limit_categorical

logical if TRUE return limit deviations also for categorical variables

meta_data

data.frame old name for item_level

meta_data_v2

show_obs

logical Should (selected) individual observations be marked in the figure for continuous variables?

Details

Algorithm of this implementation:

Remove missing codes from the study data (if defined in the metadata)
Interpretation of variable specific intervals as supplied in the metadata.
Identification of measurements outside defined limits. Therefore two output data frames are generated:
- on the level of observation to flag each deviation, and
- a summary table for each variable.
A list of plots is generated for each variable examined for limit deviations. The histogram-like plots indicate respective limits as well as deviations.
Values exceeding limits are removed in a data frame of modified study data

Value

a list with:

FlaggedStudyData data.frame related to the study data by a 1:1 relationship, i.e. for each observation is checked whether the value is below or above the limits. Optional, see return_flagged_study_data.
SummaryTable data.frame summarizing limit deviations for each variable.
SummaryData data.frame summarizing limit deviations for each variable for a report.
SummaryPlotList list of ggplot2::ggplots The plots for each variable are either a histogram (continuous) or a barplot (discrete).
ReportSummaryTable: heatmap-like data frame about limit violations

contradiction_functions

Description

Detect abnormalities help functions

Usage

contradiction_functions

Format

An object of class list of length 11.

Details

2 variables:

A_not_equal_B, if A != B
A_greater_equal_B, if A >= B
A_greater_than_B, if A > B
A_less_than_B, if A < B
A_less_equal_B, if A <= B
A_present_not_B, if A & is.na(B)
A_present_and_B, if A & !(is.na(B))
A_present_and_B_levels, if ⁠A & B %in% {set of levels}⁠
A_levels_and_B_gt_value, if ⁠A %in% {set of levels} & B > value⁠
A_levels_and_B_lt_value, if ⁠A %in% {set of levels} & B < value⁠
A_levels_and_B_levels, if ⁠A %in% {set of levels} & B %in% {set of levels}⁠

description of the contradiction functions

Description

description of the contradiction functions

Usage

contradiction_functions_descriptions

Format

An object of class list of length 11.

Log Level

Description

TODO

Add stack-trace in condition messages (to be deprecated)

Description

TODO

Metadata describes more than the current study data

Description

none: no check will be provided about the match of variables and records available in the study data and described in the metadata
exact: There must be a 1:1 match between the study data and metadata regarding data frames and segments variables and records
subset_u: study data are a subset of metadata. All variables from the study data are expected to be present in the metadata, but one or more variables in the metadata are not expected to be present in the study data. In this case a variable present in the study data but not in the metadata would produce an issue.
subset_m: metadata are a subset of study data. All variables in the metadata are expected to be present in the study data, but one or more variables in the study data are not expected to be present in the metadata.

Set caller for error conditions (to be deprecated)

Description

TODO

Enable to switch to a general additive model instead of LOESS

Description

If this option is set to TRUE, time course plots will use general additive models (GAM) instead of LOESS when the number of observations exceeds a specified threshold. LOESS computations for large datasets have a high memory consumption.

Maximum length for variable labels

Description

All variable labels will be shortened to fit this maximum length. Cannot be larger than 200 for technical reasons.

Maximum length for value labels

Description

value labels are restricted to this length

Set caller for message conditions (to be deprecated)

Description

TODO

Default availability of multivariate outlier checks in reports

Description

can be

TRUE: for cross-item_level-groups with MULTIVARIATE_OUTLIER_CHECK empty, do a multivariate outlier check
FALSE: for cross-item_level-groups with MULTIVARIATE_OUTLIER_CHECK empty, don't do a multivariate outlier check
"auto": for cross-item_level-groups with MULTIVARIATE_OUTLIER_CHECK empty, do multivariate outlier checks, if there is no entry in the column CONTRADICTION_TERM.

Assume, all VALUE_LABELS are HTML escaped

Description

TODO

Set caller for warning conditions (to be deprecated)

Description

TODO

Exclude subgroups with constant values from LOESS figure

Description

If this option is set to TRUE, time course plots will only show subgroups with more than one distinct value. This might improve the readability of the figure.

Display time-points in LOESS plots

Description

TODO

Lower limit for the LOESS bandwidth

Description

The value should be greater than 0 and less than or equal to 1. In general, increasing the bandwidth leads to a smoother trend line.

Lower limit for the proportion of cases or controls to create a smoothed time trend figure

Description

The value should be greater than 0 and less than 0.4. If the proportion of cases or controls is lower than the specified value, the LOESS figure will not be created for the specified binary outcome.

default for Plot-Format in `acc_loess()`

Description

TODO

Display observations in LOESS plots

Description

TODO

Include number of observations for each level of the grouping variable in the 'margins' figure

Description

If this option is set to FALSE, the figures created by acc_margins will not include the number of observations for each level of the grouping variable. This can be used to obtain clean static plots.

Sort levels of the grouping variable in the 'margins' figures

Description

If this option is set to TRUE, the levels of the grouping variable in the figure are sorted in descending order according to the number of observations so that levels with more observations are easier to identify. Otherwise, the original order of the levels is retained.

Apply min-max scaling in parallel coordinates figure to inspect multivariate outliers

Description

boolean, TRUE or FALSE

Color for empirical contradictions

Description

TODO

Color for logical contradictions

Description

TODO

Call `browser()` on errors

Description

TODO

Removal of hard limits from data before calculating descriptive statistics.

Description

can be

TRUE: values outside hard limits will be removed from the data before calculating descriptive statistics
FALSE: values outside hard limits will not be removed from the original data

Disable automatic post-processing of `dataquieR` function results

Description

TODO

Try to avoid fallback to string columns when reading files

Description

If a file does not feature column data types ore features data types cell-based, choose that type which matches the majority of the sampled cells of a column for the column's data type.

Details

This may make you miss data type problems but it could fix them, so prep_get_data_frame() works better.

Flip-Mode to Use for figures

Description

TODO

Converting MISSING_LIST/JUMP_LIST to a MISSING_LIST_TABLE create on list per item

Description

TODO

Control, how the `label_col` argument is used.

Description

TODO

Name of the data.frame featuring a format for grading-values

Description

TODO

Name of the data.frame featuring GRADING_RULESET

Description

TODO

Control, if `dataquieR` tries to guess missing-codes from the study data in absence of metadata

Description

TODO

Language-Suffix for metadata Label-Columns

Description

TODO

Maximum number of levels of the grouping variable shown individually in figures

Description

If there are more examiners or devices than can be shown individually, they will be collapsed into "other".

Maximum number of levels of the grouping variable shown with individual histograms ('violins') in 'margins' figures

Description

If there are more examiners or devices, the figure will be reduced to box-plots to save space.

Minimum number of observations per grouping variable that is required to include an individual level of the grouping variable in a figure

Description

Levels of the grouping variable with fewer observations than specified here will be excluded from the figure.

Remove all observation-level-real-data from reports

Description

TODO

function to call on progress increase

Description

TODO

function to call on progress message update

Description

TODO

Number of levels to consider a variable ordinal in absence of SCALE_LEVEL

Description

TODO

Number of levels to consider a variable metric in absence of SCALE_LEVEL

Description

TODO

Disable all interactively used metadata-based function argument provision

Description

TODO

Internal constructor for the internal class dataquieR_resultset.

Description

creates an object of the class dataquieR_resultset.

Usage

dataquieR_resultset(...)

Arguments

...

properties stored in the object

Details

The class features the following methods:

as.data.frame.dataquieR_resultset, * as.list.dataquieR_resultset, * print.dataquieR_resultset, * summary.dataquieR_resultset

Value

an object of the class dataquieR_resultset.

Class dataquieR_resultset2.

Description

Class dataquieR_resultset2.

Verify an object of class dataquieR_resultset

Description

Deprecated

Usage

dataquieR_resultset_verify(...)

Arguments

...

Deprecated

Value

Deprecated

Compute Pairwise Correlations

Description

works on variable groups (cross-item_level), which are expected to show a Pearson correlation

Usage

des_scatterplot_matrix(
  label_col,
  study_data,
  item_level = "item_level",
  meta_data_cross_item = "cross-item_level",
  meta_data = item_level,
  meta_data_v2,
  cross_item_level,
  `cross-item_level`
)

Arguments

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data_cross_item

meta_data_cross

meta_data

data.frame old name for item_level

meta_data_v2

cross_item_level

data.frame alias for meta_data_cross_item

`cross-item_level`

data.frame alias for meta_data_cross_item

Details

Descriptor # TODO: This can be an indicator

Value

a list with the slots:

SummaryPlotList: for each variable group a ggplot2::ggplot object with pairwise correlation plots
SummaryData: table with columns VARIABLE_LIST, cors, max_cor, min_cor
SummaryTable: like SummaryData, but machine readable and with stable column names.

Examples

## Not run: 
devtools::load_all()
prep_load_workbook_like_file("meta_data_v2")
des_scatterplot_matrix("study_data")

## End(Not run)

Compute Descriptive Statistics

Description

generates a descriptive overview of the variables in resp_vars.

Usage

des_summary(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  hard_limits_removal = getOption("dataquieR.des_summary_hard_lim_remove",
    dataquieR.des_summary_hard_lim_remove_default),
  ...
)

Arguments

resp_vars

variable the name of the measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

hard_limits_removal

logical if TRUE values outside hard limits are removed from the data before calculating descriptive statistics. The default is FALSE

...

arguments to be passed to all called indicator functions if applicable.

Details

TODO

Value

a list with:

SummaryTable: data.frame
SummaryData: data.frame

Examples

## Not run: 
prep_load_workbook_like_file("meta_data_v2")
xx <- des_summary(study_data = "study_data", meta_data =
                   prep_get_data_frame("item_level"))
util_html_table(xx$SummaryData)
util_html_table(des_summary(study_data = prep_get_data_frame("study_data"),
                   meta_data = prep_get_data_frame("item_level"))$SummaryData)


## End(Not run)

Compute Descriptive Statistics - categorical variables

Description

generates a descriptive overview of the categorical variables (nominal and ordinal) in resp_vars.

Usage

des_summary_categorical(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  hard_limits_removal = getOption("dataquieR.des_summary_hard_lim_remove",
    dataquieR.des_summary_hard_lim_remove_default),
  ...
)

Arguments

resp_vars

variable the name of the categorical measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

hard_limits_removal

logical if TRUE values outside hard limits are removed from the data before calculating descriptive statistics. The default is FALSE

...

arguments to be passed to all called indicator functions if applicable.

Details

TODO

Value

a list with:

SummaryTable: data.frame
SummaryData: data.frame

Examples

## Not run: 
prep_load_workbook_like_file("meta_data_v2")
xx <- des_summary_categorical(study_data = "study_data", meta_data =
                              prep_get_data_frame("item_level"))
util_html_table(xx$SummaryData)
util_html_table(des_summary_categorical(study_data = prep_get_data_frame("study_data"),
                   meta_data = prep_get_data_frame("item_level"))$SummaryData)

## End(Not run)

Compute Descriptive Statistics - continuous variables

Description

generates a descriptive overview of continuous variables (ratio and interval) in resp_vars.

Usage

des_summary_continuous(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  hard_limits_removal = getOption("dataquieR.des_summary_hard_lim_remove",
    dataquieR.des_summary_hard_lim_remove_default),
  ...
)

Arguments

resp_vars

variable the name of the continuous measurement variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

hard_limits_removal

logical if TRUE values outside hard limits are removed from the data before calculating descriptive statistics. The default is FALSE

...

arguments to be passed to all called indicator functions if applicable.

Details

TODO

Value

a list with:

SummaryTable: data.frame
SummaryData: data.frame

Examples

## Not run: 
prep_load_workbook_like_file("meta_data_v2")
xx <- des_summary_continuous(study_data = "study_data", meta_data =
                              prep_get_data_frame("item_level"))
util_html_table(xx$SummaryData)
util_html_table(des_summary_continuous(study_data = prep_get_data_frame("study_data"),
                   meta_data = prep_get_data_frame("item_level"))$SummaryData)

## End(Not run)

Get the dimensions of a `dq_report2` result

Description

Get the dimensions of a dq_report2 result

Usage

## S3 method for class 'dataquieR_resultset2'
dim(x)

Arguments

x

a dataquieR_resultset2 result

Value

dimensions

Names of DQ dimensions

Description

a vector of data quality dimensions. The supported dimensions are Completeness, Consistency and Accuracy.

Usage

dimensions

Format

An object of class character of length 3.

Value

Only a definition, not a function, so no return value

Names of a `dataquieR` report object (v2.0)

Description

Names of a dataquieR report object (v2.0)

Usage

## S3 method for class 'dataquieR_resultset2'
dimnames(x)

Arguments

x

the result object

Value

the names

Dimension Titles for Prefixes

Description

order does matter, because it defines the order in the dq_report2.

Usage

dims

Format

An object of class character of length 5.

Generate a full DQ report

Description

Deprecated

Usage

dq_report(...)

Arguments

...

Deprecated

Value

Deprecated

Generate a full DQ report, v2

Description

Generate a full DQ report, v2

Usage

dq_report2(
  study_data,
  item_level = "item_level",
  label_col = LABEL,
  meta_data_segment = "segment_level",
  meta_data_dataframe = "dataframe_level",
  meta_data_cross_item = "cross-item_level",
  meta_data_item_computation = "item_computation_level",
  meta_data = item_level,
  meta_data_v2,
  ...,
  dimensions = c("Completeness", "Consistency"),
  cores = list(mode = "socket", logging = FALSE, cpus = util_detect_cores(),
    load.balancing = TRUE),
  specific_args = list(),
  advanced_options = list(),
  author = prep_get_user_name(),
  title = "Data quality report",
  subtitle = as.character(Sys.Date()),
  user_info = NULL,
  debug_parallel = FALSE,
  resp_vars = character(0),
  filter_indicator_functions = character(0),
  filter_result_slots = c("^Summary", "^Segment", "^DataTypePlotList",
    "^ReportSummaryTable", "^Dataframe", "^Result", "^VariableGroup"),
  mode = c("default", "futures", "queue", "parallel"),
  mode_args = list(),
  notes_from_wrapper = list(),
  storr_factory = NULL,
  amend = FALSE,
  cross_item_level,
  `cross-item_level`,
  segment_level,
  dataframe_level,
  item_computation_level,
  .internal = rlang::env_inherits(rlang::caller_env(), parent.env(environment()))
)

Arguments

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional: Data frame level metadata

meta_data_cross_item

data.frame – optional: Cross-item level metadata

meta_data_item_computation

data.frame optional. computation rules for computed variables.

meta_data

data.frame old name for item_level

meta_data_v2

...

arguments to be passed to all called indicator functions if applicable.

dimensions

dimensions Vector of dimensions to address in the report. Allowed values in the vector are Completeness, Consistency, and Accuracy. The generated report will only cover the listed data quality dimensions. Accuracy is computational expensive, so this dimension is not enabled by default. Completeness should be included, if Consistency is included, and Consistency should be included, if Accuracy is included to avoid misleading detections of e.g. missing codes as outliers, please refer to the data quality concept for more details. Integrity is always included.

cores

integer number of cpu cores to use or a named list with arguments for parallelMap::parallelStart or NULL, if parallel has already been started by the caller. Can also be a cluster.

specific_args

list named list of arguments specifically for one of the called functions, the of the list elements correspond to the indicator functions whose calls should be modified. The elements are lists of arguments.

advanced_options

list options to set during report computation, see options()

author

character author for the report documents.

title

character optional argument to specify the title for the data quality report

subtitle

character optional argument to specify a subtitle for the data quality report

user_info

list additional info stored with the report, e.g., comments, title, ...

debug_parallel

logical print blocks currently evaluated in parallel

resp_vars

variable list the name of the measurement variables for the report. If missing, all variables will be used. Only item level indicator functions are filtered, so far.

filter_indicator_functions

character regular expressions, only if an indicator function's name matches one of these, it'll be used for the report. If of length zero, no filtering is performed.

filter_result_slots

character regular expressions, only if an indicator function's result's name matches one of these, it'll be used for the report. If of length zero, no filtering is performed.

mode

character work mode for parallel execution. default is "default", the values mean: - default: use queue except cores has been set explicitly - futures: use the future package - queue: use a queue as described in the examples from the callr package by Csárdi and Chang and start sub-processes as workers that evaluate the queue. - parallel: use the cluster from cores to evaluate all calls of indicator functions using the classic R parallel back-ends

mode_args

list of arguments for the selected mode. As of writing this manual, only for the mode queue the argument step is supported, which gives the number of function calls that are run by one worker at a time. the default is 15, which gives on most of the tested systems a good balance between synchronization overhead and idling workers.

notes_from_wrapper

list a list containing notes about changed labels by dq_report_by (otherwise NULL)

storr_factory

function NULL, or a function returning a storr object as back-end for the report's results. If used with cores > 1, the storage must be accessible from all cores and capable of concurrent writing according to storr. Hint: dataquieR currently only supports storr::storr_rds(), officially, while other back- ends may nevertheless work, yet, they are not tested.

amend

logical if there is already data in.storr_factory, use it anyways – unsupported, so far!

cross_item_level

data.frame alias for meta_data_cross_item

segment_level

data.frame alias for meta_data_segment

dataframe_level

data.frame alias for meta_data_dataframe

item_computation_level

data.frame alias for meta_data_item_computation

.internal

logical internal use, only.

`cross-item_level`

data.frame alias for meta_data_cross_item

Details

See dq_report_by for a way to generate stratified or splitted reports easily.

Value

a dataquieR_resultset2 that can be printed creating a HTML-report.

Examples

## Not run: 
prep_load_workbook_like_file("inst/extdata/meta_data_v2.xlsx")
meta_data <- prep_get_data_frame("item_level")
meta_data_cross <- prep_get_data_frame("cross-item_level")
x <- dq_report2("study_data", dimensions = NULL, label_col = "LABEL")
xx <- pbapply::pblapply(x, util_eval_to_dataquieR_result, env = environment())
xx <- pbapply::pblapply(tail(x), util_eval_to_dataquieR_result, env = environment())
xx <- parallel
cat(vapply(x, deparse1, FUN.VALUE = character(1)), sep = "\n", file = "all_calls.txt")
rstudioapi::navigateToFile("all_calls.txt")
eval(x$`acc_multivariate_outlier.Blood pressure checks`)

prep_load_workbook_like_file("meta_data_v2")
rules <- tibble::tribble(
  ~resp_vars,  ~RULE,
  "BMI", '[BODY_WEIGHT_0]/(([BODY_HEIGHT_0]/100)^2)',
  "R", '[WAIST_CIRC_0]/2/[pi]', # in m^3
  "VOL_EST", '[pi]*([WAIST_CIRC_0]/2/[pi])^2*[BODY_HEIGHT_0] / 1000', # in l
 )
prep_load_workbook_like_file("ship_meta_v2")
prep_add_data_frames(computed_items = rules)
r <- dq_report2("ship", dimensions = NULL, label_col = "LABEL")

## End(Not run)

Generate a stratified full DQ report

Description

Generate a stratified full DQ report

Usage

dq_report_by(
  study_data,
  item_level = "item_level",
  meta_data_segment = "segment_level",
  meta_data_dataframe = "dataframe_level",
  meta_data_cross_item = "cross-item_level",
  meta_data_item_computation = "item_computation_level",
  missing_tables = NULL,
  label_col,
  meta_data_v2,
  segment_column = NULL,
  strata_column = NULL,
  strata_select = NULL,
  selection_type = NULL,
  segment_select = NULL,
  segment_exclude = NULL,
  strata_exclude = NULL,
  subgroup = NULL,
  resp_vars = character(0),
  id_vars = NULL,
  advanced_options = list(),
  storr_factory = NULL,
  amend = FALSE,
  ...,
  output_dir = NULL,
  input_dir = NULL,
  also_print = FALSE,
  disable_plotly = FALSE,
  view = TRUE,
  meta_data = item_level,
  cross_item_level,
  `cross-item_level`,
  segment_level,
  dataframe_level,
  item_computation_level
)

Arguments

study_data

data.frame the data frame that contains the measurements: it can be an R object (e.g., bia), a data frame (e.g., "C:/Users/data/bia.dta"), a vector containing data frames files (e.g., c("C:/Users/data/bia.dta", ⁠C:/Users/data/biames.dta"⁠)), or it can be left empty and the data frames are provided in the data frame level metadata. If only the file name without path is provided (e.g., "bia.dta"), the file name needs the extension and the path must be provided in the argument input_dir. It can also contain only the file name in case of example data from the package dataquieR (e.g., "study_data" or "ship")

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional if study_data is present: Data frame level metadata

meta_data_cross_item

data.frame – optional: Cross-item level metadata

meta_data_item_computation

data.frame – optional: Computed items metadata

missing_tables

character the name of the data frame containing the missing codes, it can be a vector if more than one table is provided. Example: c("missing_table1", "missing_table2")

label_col

variable attribute the name of the column in the metadata containing the labels of the variables

meta_data_v2

character path or file name of the workbook like metadata file, see prep_load_workbook_like_file for details. ALL LOADED DATAFRAMES WILL BE PURGED, using prep_purge_data_frame_cache, if you specify meta_data_v2

segment_column

variable attribute name of a metadata attribute usable to split the report in sections of variables, e.g. all blood-pressure related variables. By default, reports are split by STUDY_SEGMENT if available and no segment_column nor strata_column or subgroup are defined. To create an un-split report please write explicitly the argument 'segment_column = NULL'

strata_column

variable name of a study variable to stratify the report by, e.g. the study centers. Both labels and VAR_NAMES are accepted. In case of NAs in the selected variable, a separate report containing the NAs subset will be created

strata_select

character if given, the strata of strata_column are limited to the content of this vector. A character vector or a regular expression can be provided (e.g., "^a.*$"). This argument can not be used if no strata_column is provided

selection_type

character optional, can only be specified if a strata_select or strata_exclude is specified. If not present the function try to guess what the user typed as strata_select or strata_exclude. There are 3 options: value indicating that the stratum selected is a value and not a value_label. For example "0"; v_label indicating that the stratum specified is a label. For example "male". regex indicating that the user specified strata using a regular expression. For example "^Ber" to select all strata starting with that letters

segment_select

character if given, the levels of segment_column are limited to the content of this vector. A character vector or a regular expression (e.g., ".*_EXAM$") can be provided. This argument can not be used if no segment_column is provided.

segment_exclude

character optional, can only be specified if a segment_column is specified. The levels of segment_column will not include the content of this argument. A character vector or a regular expression can be provided (e.g., "^STU").

strata_exclude

character optional, can only be specified if a strata_column is specified. The strata of strata_column will not include the content of this argument. A character vector or a regular expression can be provided (e.g., "^STU").

subgroup

character optional, to define subgroups of cases. Rules are to be written as REDCap rules. Only VAR_NAMES are accepted in the rules.

resp_vars

variable the names of the measurement variables, if missing or NULL, all variables will be included

id_vars

variable a vector containing the name/s of the variables containing ids, to be used to merge multiple data frames if provided in study_data and to be add to referred vars

advanced_options

list options to set during report computation, see options()

storr_factory

amend

logical if there is already data in.storr_factory, use it anyways – unsupported, so far!

...

arguments to be passed through to dq_report or dq_report2

output_dir

character if given, the output is not returned but saved in this directory

input_dir

character if given, the study data files that have no path and that are not URL are searched in this directory. Also meta_data_v2 is searched in this directory if no path is provided

also_print

logical if output_dir is not NULL, also create HTML output for each report using print.dataquieR_resultset2() written to the path output_dir

disable_plotly

logical do not use plotly, even if installed

view

logical open the returned report

meta_data

data.frame old name for item_level

cross_item_level

data.frame alias for meta_data_cross_item

segment_level

data.frame alias for meta_data_segment

dataframe_level

data.frame alias for meta_data_dataframe

item_computation_level

data.frame alias for meta_data_item_computation

`cross-item_level`

data.frame alias for meta_data_cross_item

Value

invisible(). named list of named lists of dq_report2 reports or, if output_dir has been specified, invisible(NULL)

Examples

## Not run:  # really long-running example.
prep_load_workbook_like_file("meta_data_v2")
rep <- dq_report_by("study_data", label_col =
  LABEL, strata_column = "CENTER_0")
rep <- dq_report_by("study_data",
  label_col = LABEL, strata_column = "CENTER_0",
  segment_column = NULL
)
unlink("/tmp/testRep/", force = TRUE, recursive = TRUE)
dq_report_by("study_data",
  label_col = LABEL, strata_column = "CENTER_0",
  segment_column = STUDY_SEGMENT, output_dir = "/tmp/testRep"
)
unlink("/tmp/testRep/", force = TRUE, recursive = TRUE)
dq_report_by("study_data",
  label_col = LABEL, strata_column = "CENTER_0",
  segment_column = NULL, output_dir = "/tmp/testRep"
)
dq_report_by("study_data",
  label_col = LABEL,
  segment_column = STUDY_SEGMENT, output_dir = "/tmp/testRep"
)
dq_report_by("study_data",
  label_col = LABEL,
  segment_column = STUDY_SEGMENT, output_dir = "/tmp/testRep",
  also_print = TRUE
)
dq_report_by(study_data = "study_data", meta_data_v2 = "meta_data_v2",
  advanced_options = list(dataquieR.study_data_cache_max = 0,
  dataquieR.study_data_cache_metrics = TRUE,
  dataquieR.study_data_cache_metrics_env = environment()),
  cores = NULL, dimensions = "int")
dq_report_by(study_data = "study_data", meta_data_v2 = "meta_data_v2",
  advanced_options = list(dataquieR.study_data_cache_max = 0),
  cores = NULL, dimensions = "int")

## End(Not run)

HTML Dependency for report headers in `clipboard`

Description

HTML Dependency for report headers in clipboard

Usage

html_dependency_clipboard()

Value

the dependency

HTML Dependency for `dataquieR`

Description

generate all dependencies used in static dataquieR reports

Usage

html_dependency_dataquieR(iframe = FALSE)

Arguments

iframe

logical(1) if TRUE, create the dependency used in figure iframes.

Value

the dependency

HTML Dependency for report headers in `DT::datatable`

Description

HTML Dependency for report headers in DT::datatable

Usage

html_dependency_report_dt()

Value

the dependency

HTML Dependency for `tippy`

Description

HTML Dependency for tippy

Usage

html_dependency_tippy()

Value

the dependency

HTML Dependency for vertical headers in `DT::datatable`

Description

HTML Dependency for vertical headers in DT::datatable

Usage

html_dependency_vert_dt()

Value

the dependency

Wrapper function to check for studies data structure

Description

This function tests for unexpected elements and records, as well as duplicated identifiers and content. The unexpected element record check can be conducted by providing the number of expected records or an additional table with the expected records. It is possible to conduct the checks by study segments or to consider only selected segments.

Usage

int_all_datastructure_dataframe(
  meta_data_dataframe = "dataframe_level",
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  dataframe_level
)

Arguments

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

DataframeTable: data frame with selected check results, used for the data quality report.

Examples

## Not run: 
out_dataframe <- int_all_datastructure_dataframe(
  meta_data_dataframe = "meta_data_dataframe",
  meta_data = "ship_meta"
)
md0 <- prep_get_data_frame("ship_meta")
md0
md0$VAR_NAMES
md0$VAR_NAMES[[1]] <- "Id" # is this missmatch reported -- is the data frame
                           # also reported, if nothing is wrong with it
out_dataframe <- int_all_datastructure_dataframe(
  meta_data_dataframe = "meta_data_dataframe",
  meta_data = md0
)

# This is the "normal" procedure for inside pipeline
# but outside this function  checktype is exact by default
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "subset_u")
lapply(setNames(nm = prep_get_data_frame("meta_data_dataframe")$DF_NAME),
  int_sts_element_dataframe, meta_data = md0)
md0$VAR_NAMES[[1]] <-
  "id" # is this missmatch reported -- is the data frame also reported,
       # if nothing is wrong with it
lapply(setNames(nm = prep_get_data_frame("meta_data_dataframe")$DF_NAME),
  int_sts_element_dataframe, meta_data = md0)
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "exact")

## End(Not run)

Wrapper function to check for segment data structure

Description

Usage

int_all_datastructure_segment(
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  segment_level,
  meta_data_segment = "segment_level"
)

Arguments

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

segment_level

data.frame alias for meta_data_segment

meta_data_segment

data.frame the data frame that contains the metadata for the segment level, mandatory

Value

a list with

SegmentTable: data frame with selected check results, used for the data quality report.

Examples

## Not run: 
out_segment <- int_all_datastructure_segment(
  meta_data_segment = "meta_data_segment",
  study_data = "ship",
  meta_data = "ship_meta"
)

study_data <- cars
meta_data <- dataquieR::prep_create_meta(VAR_NAMES = c("speedx", "distx"),
  DATA_TYPE = c("integer", "integer"), MISSING_LIST = "|", JUMP_LIST = "|",
  STUDY_SEGMENT = c("Intro", "Ex"))

out_segment <- int_all_datastructure_segment(
  meta_data_segment = "meta_data_segment",
  study_data = study_data,
  meta_data = meta_data
)

## End(Not run)

Check declared data types of metadata in study data

Description

Checks data types of the study data and for the data type declared in the metadata

Usage

int_datatype_matrix(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  split_segments = FALSE,
  max_vars_per_plot = 20,
  threshold_value = 0,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable the names of the measurement variables, if missing or NULL, all variables will be checked

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

split_segments

logical return one matrix per study segment

max_vars_per_plot

integer from=0. The maximum number of variables per single plot.

threshold_value

numeric from=0 to=100. percentage failing conversions allowed to still classify a study variable convertible.

meta_data

data.frame old name for item_level

meta_data_v2

Details

This is a preparatory support function that compares study data with associated metadata. A prerequisite of this function is that the no. of columns in the study data complies with the no. of rows in the metadata.

For each study variable, the function searches for its data type declared in static metadata and returns a heatmap like matrix indicating data type mismatches in the study data.

List function.

Value

a list with:

SummaryTable: data frame containing data quality check for "data type mismatch" (CLS_int_vfe_type, PCT_int_vfe_type). The following categories are possible: categories: "Non-matching datatype", "Non-Matching datatype, convertible", "Matching datatype"
SummaryData: data frame containing data quality check for "data type mismatch" for a report
SummaryPlot: ggplot2::ggplot2 heatmap plot, graphical representation of SummaryTable
DataTypePlotList: list of plots per (maybe artificial) segment
ReportSummaryTable: data frame underlying SummaryPlot

Check for duplicated content

Description

This function tests for duplicates entries in the data set. It is possible to check duplicated entries by study segments or to consider only selected segments.

Usage

int_duplicate_content(
  level = c("dataframe", "segment"),
  study_data,
  item_level = "item_level",
  label_col,
  meta_data = item_level,
  meta_data_v2,
  ...
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_v2

...

Depending on level, passed to either util_int_duplicate_content_segment or util_int_duplicate_content_dataframe

Value

a list. Depending on level, see util_int_duplicate_content_segment or util_int_duplicate_content_dataframe for a description of the outputs.

Check for duplicated IDs

Description

This function tests for duplicates entries in identifiers. It is possible to check duplicated identifiers by study segments or to consider only selected segments.

Usage

int_duplicate_ids(
  level = c("dataframe", "segment"),
  study_data,
  item_level = "item_level",
  label_col,
  meta_data = item_level,
  meta_data_v2,
  ...
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_v2

...

Depending on level, passed to either util_int_duplicate_ids_segment or util_int_duplicate_ids_dataframe

Value

a list. Depending on level, see util_int_duplicate_ids_segment or util_int_duplicate_ids_dataframe for a description of the outputs.

Encoding Errors

Description

Detects errors in the character encoding of string variables

Usage

int_encoding_errors(
  resp_vars = NULL,
  study_data,
  label_col,
  meta_data_dataframe = "dataframe_level",
  item_level = "item_level",
  ref_encs,
  meta_data = item_level,
  meta_data_v2,
  dataframe_level
)

Arguments

resp_vars

variable the names of the measurement variables, if missing or NULL, all variables will be checked

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

item_level

data.frame the data frame that contains metadata attributes of study data

ref_encs

reference encodings (names are resp_vars)

meta_data

data.frame old name for item_level

meta_data_v2

dataframe_level

data.frame alias for meta_data_dataframe

Details

Strings are stored based on code tables, nowadays, typically as UTF-8. However, other code systems are still in use, so, sometimes, strings from different systems are mixed in the data. This indicator checks for such problems and returns the count of entries per variable, that do not match the reference coding system, which is estimated from the study data (addition of metadata field is planned).

If not specified in the metadata (columns ENCODING in item- or data-frame- level, the encoding is guessed from the data). Otherwise, it may be any supported encoding as returned by iconvlist().

Value

a list with:

SummaryTable: data.frame with information on such problems
SummaryData: data.frame human readable version of SummaryTable
FlaggedStudyData: data.frame tells for each entry in study data if its encoding is OK. has the same dimensions as study_data

Detect Expected Observations

Description

For each participant, check, if an observation was expected, given the PART_VARS from item-level metadata

Usage

int_part_vars_structure(
  label_col,
  study_data,
  item_level = "item_level",
  expected_observations = c("HIERARCHY", "SEGMENT"),
  disclose_problem_paprt_var_data = FALSE,
  meta_data = item_level,
  meta_data_v2
)

Arguments

label_col

character mapping attribute colnames(study_data) vs. meta_data[label_col]

study_data

study_data must have all relevant PART_VARS to avoid false-positives on PART_VARS missing from study_data

item_level

meta_data must be complete to avoid false positives on non-existing PART_VARS

expected_observations

enum HIERARCHY | SEGMENT. How should PART_VARS be handled: - SEGMENT: if PART_VAR is 1, an observation is expected - HIERARCHY: the default, if the PART_VAR is 1 for this variable and also for all PART_VARS of PART_VARS up in the hierarchy, an observation is expected.

disclose_problem_paprt_var_data

logical show the problematic data (PART_VAR only)

meta_data

data.frame old name for item_level

meta_data_v2

Details

Value

empty list, so far – the function only warns.

Determine missing and/or superfluous data elements

Description

Depends on dataquieR.ELEMENT_MISSMATCH_CHECKTYPE option, see there

Usage

int_sts_element_dataframe(
  item_level = "item_level",
  meta_data_dataframe = "dataframe_level",
  meta_data = item_level,
  meta_data_v2,
  check_type = getOption("dataquieR.ELEMENT_MISSMATCH_CHECKTYPE",
    dataquieR.ELEMENT_MISSMATCH_CHECKTYPE_default),
  dataframe_level
)

Arguments

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

meta_data

data.frame old name for item_level

meta_data_v2

check_type

enum none | exact | subset_u | subset_m. See dataquieR.ELEMENT_MISSMATCH_CHECKTYPE

dataframe_level

data.frame alias for meta_data_dataframe

Details

Value

list with names lots:

DataframeData: data frame with the unexpected elements check results.
DataframeTable: data.frame table with all errors, used for the data quality report: - PCT_int_sts_element: Percentage of element mismatches - NUM_int_sts_element: Number of element mismatches - resp_vars: affected element names

Examples

## Not run: 
prep_load_workbook_like_file("~/tmp/df_level_test.xlsx")
meta_data_dataframe <- "dataframe_level"
meta_data <- "item_level"

## End(Not run)

Checks for element set

Description

Depends on dataquieR.ELEMENT_MISSMATCH_CHECKTYPE option, see there – # TODO: Rind out, how to document and link it here using Roxygen.

Usage

int_sts_element_segment(
  study_data,
  item_level = "item_level",
  label_col,
  meta_data = item_level,
  meta_data_v2
)

Arguments

study_data

data.frame the data frame that contains the measurements, mandatory.

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_v2

Details

Value

a list with

SegmentData: data frame with the unexpected elements check results. - Segment: name of the corresponding segment, if applicable, ALL otherwise
SegmentTable: data frame with the unexpected elements check results, used for the data quality report. - Segment: name of the corresponding segment, if applicable, ALL otherwise

Examples

## Not run: 
study_data <- cars
meta_data <- dataquieR::prep_create_meta(VAR_NAMES = c("speedx", "distx"),
  DATA_TYPE = c("integer", "integer"), MISSING_LIST = "|", JUMP_LIST = "|",
  STUDY_SEGMENT = c("Intro", "Ex"))
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "none")
int_sts_element_segment(study_data, meta_data)
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "exact")
int_sts_element_segment(study_data, meta_data)
study_data <- cars
meta_data <- dataquieR::prep_create_meta(VAR_NAMES = c("speedx", "distx"),
  DATA_TYPE = c("integer", "integer"), MISSING_LIST = "|", JUMP_LIST = "|",
  STUDY_SEGMENT = c("Intro", "Intro"))
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "none")
int_sts_element_segment(study_data, meta_data)
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "exact")
int_sts_element_segment(study_data, meta_data)
study_data <- cars
meta_data <- dataquieR::prep_create_meta(VAR_NAMES = c("speed", "distx"),
  DATA_TYPE = c("integer", "integer"), MISSING_LIST = "|", JUMP_LIST = "|",
  STUDY_SEGMENT = c("Intro", "Intro"))
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "none")
int_sts_element_segment(study_data, meta_data)
options(dataquieR.ELEMENT_MISSMATCH_CHECKTYPE = "exact")
int_sts_element_segment(study_data, meta_data)

## End(Not run)

Check for unexpected data element count

Description

This function contrasts the expected element number in each study in the metadata with the actual element number in each study data frame.

Usage

int_unexp_elements(
  identifier_name_list,
  data_element_count,
  meta_data_dataframe = "dataframe_level",
  meta_data_v2,
  dataframe_level
)

Arguments

identifier_name_list

character a character vector indicating the name of each study data frame, mandatory.

data_element_count

integer an integer vector with the number of expected data elements, mandatory.

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

meta_data_v2

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

DataframeData: data frame with the results of the quality check for unexpected data elements
DataframeTable: data frame with selected unexpected data elements check results, used for the data quality report.

Check for unexpected data record count at the data frame level

Description

This function contrasts the expected record number in each study in the metadata with the actual record number in each study data frame.

Usage

int_unexp_records_dataframe(
  identifier_name_list,
  data_record_count,
  meta_data_dataframe = "dataframe_level",
  meta_data_v2,
  dataframe_level
)

Arguments

identifier_name_list

character a character vector indicating the name of each study data frame, mandatory.

data_record_count

integer an integer vector with the number of expected data records per study data frame, mandatory.

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

meta_data_v2

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

DataframeData: data frame with the results of the quality check for unexpected data elements
DataframeTable: data frame with selected unexpected data elements check results, used for the data quality report.

Check for unexpected data record count within segments

Description

This function contrasts the expected record number in each study segment in the metadata with the actual record number in each segment data frame.

Usage

int_unexp_records_segment(
  study_segment,
  study_data,
  label_col,
  item_level = "item_level",
  data_record_count,
  meta_data = item_level,
  meta_data_segment = "segment_level",
  meta_data_v2,
  segment_level
)

Arguments

study_segment

character a character vector indicating the name of each study data frame, mandatory.

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

data_record_count

integer an integer vector with the number of expected data records, mandatory.

meta_data

data.frame old name for item_level

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_v2

segment_level

data.frame alias for meta_data_segment

Details

The current implementation does not take into account jump or missing codes, the function is rather based on checking whether NAs are present in the study data

Value

a list with

SegmentData: data frame with the results of the quality check for unexpected data elements
SegmentTable: data frame with selected unexpected data elements check results, used for the data quality report.

Check for unexpected data record set

Description

This function tests that the identifiers match a provided record set. It is possible to check for unexpected data record sets by study segments or to consider only selected segments.

Usage

int_unexp_records_set(
  level = c("dataframe", "segment"),
  study_data,
  item_level = "item_level",
  label_col,
  meta_data = item_level,
  meta_data_v2,
  ...
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_v2

...

Depending on level, passed to either util_int_unexp_records_set_segment or util_int_unexp_records_set_dataframe

Value

a list. Depending on level, see util_int_unexp_records_set_segment or util_int_unexp_records_set_dataframe for a description of the outputs.

`.menu_env` – an environment for HTML menu creation

Description

used by the dq_report2-pipeline

Usage

.menu_env

Format

An object of class environment of length 3.

Description

Generate the menu for a report

Arguments

pages

encapsulated list with report pages as tagList objects, its names are the desired file names

Value

the html-taglist for the menu

Creates a drop-down menu

Description

Creates a drop-down menu

Arguments

title

name of the entry in the main menu

menu_description

description, displayed, if the main menu entry itself is clicked

...

the sub-menu-entries

id

id for the entry, defaults to modified title

Value

html div object

Create a single menu entry

Description

Create a single menu entry

Arguments

title

of the entry

id

linked href, defaults to modified title. can be a word, then a single-page-link with an anchor tag is created.

...

additional arguments for the menu link

Value

html-a-tag object

Data frame with metadata about the study data on variable level

Description

Variable level metadata.

Well known columns on the `meta_data_cross-item` sheet

Description

Metadata describing groups of variables, e.g., for their multivariate distribution or for defining contradiction rules.

Well known columns on the `meta_data_dataframe` sheet

Description

Metadata describing data delivered on one data frame/table sheet, e.g., a full questionnaire, not its items.

`.meta_data_env` – an environment for easy metadata access

Description

used by the dq_report2-pipeline

Usage

.meta_data_env

Format

An object of class environment of length 8.

Extract co-variables for a given item

Description

Extract co-variables for a given item

Arguments

entity

vector of item-identifiers

Value

a vector with co-variables for each entity-entry, having the explode attribute set to FALSE

Extract `MULTIVARIATE_OUTLIER_CHECK` for variable group

Description

Extract MULTIVARIATE_OUTLIER_CHECK for variable group

Extract selected outlier criteria for a given item or variable group

Arguments

entity

vector of item- or variable group identifiers

Details

In the environment, target_meta_data should be set either to item_level or to cross-item_level.

Value

a vector with id-variables for each entity-entry, having the explode attribute set to FALSE

Extract group variables for a given item

Description

Extract group variables for a given item

Arguments

entity

vector of item-identifiers

Value

a vector with possible group-variables (can be more than one per item) for each entity-entry, having the explode attribute set to TRUE

Extract id variables for a given item or variable group

Description

Extract id variables for a given item or variable group

Arguments

entity

vector of item- or variable group identifiers

Details

In the environment, target_meta_data should be set either to item_level or to cross-item_level.

Value

a vector with id-variables for each entity-entry, having the explode attribute set to FALSE

Extract outlier rules-number-threshold for a given item or variable group

Description

Extract outlier rules-number-threshold for a given item or variable group

Arguments

entity

vector of item- or variable group identifiers

Details

In the environment, target_meta_data should be set either to item_level or to cross-item_level.

Value

a vector with id-variables for each entity-entry, having the explode attribute set to FALSE

Extract measurement time variable for a given item

Description

Extract measurement time variable for a given item

Arguments

entity

vector of item-identifiers

Value

a vector with time-variables (usually one per item) for each entity-entry, having the explode attribute set to TRUE

Well known columns on the `meta_data_segment` sheet

Description

Metadata describing study segments, e.g., a full questionnaire, not its items.

return the number of result slots in a report

Description

return the number of result slots in a report

Usage

nres(x)

Arguments

x

the dataquieR report (v2.0)

Value

the number of used result slots

Convert a pipeline result data frame to named encapsulated lists

Description

Deprecated

Usage

pipeline_recursive_result(...)

Arguments

...

Deprecated

Value

Deprecated

Call (nearly) one "Accuracy" function with many parameterizations at once automatically

Description

Deprecated

Usage

pipeline_vectorized(...)

Arguments

...

Deprecated

Value

Deprecated

Plot a `dataquieR` summary

Description

Plot a dataquieR summary

Usage

## S3 method for class 'dataquieR_summary'
plot(x, y, ..., filter, dont_plot = FALSE, stratify_by)

Arguments

x

the dataquieR summary, see summary() and dq_report2()

y

not yet used

...

not yet used

filter

if given, this filters the summary, e.g., filter = call_names == "com_qualified_item_missingness"

dont_plot

suppress the actual plotting, just return a printable object derived from x

stratify_by

column to stratify the summary, may be one string.

Value

invisible html object

Utility function to plot a combined figure for distribution checks

Description

Data quality indicator checks "Unexpected location" with histograms and plots of empirical cumulative distributions for the subgroups.

Usage

prep_acc_distributions_with_ecdf(
  resp_vars = NULL,
  group_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot",
    dataquieR.max_group_var_levels_in_plot_default),
  n_obs_per_group_min = getOption("dataquieR.min_obs_per_group_var_in_plot",
    dataquieR.min_obs_per_group_var_in_plot_default)
)

Arguments

resp_vars

variable list the name of the measurement variable

group_vars

variable list the name of the observer, device or reader variable

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

n_group_max

maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

n_obs_per_group_min

minimum number of data points per group to create a graph for an individual category of the group_vars variable

Value

A SummaryPlot.

Convert missing codes in metadata format v1.0 and a missing-cause-table to v2.0 missing list / jump list assignments

Description

The function has to working modes. If replace_meta_data is TRUE, by default, if cause_label_df contains a column named resp_vars, then the missing/jump codes in meta_data[, c(MISSING_CODES, JUMP_CODES)] will be overwritten, otherwise, it will be labeled using the cause_label_df.

Usage

prep_add_cause_label_df(
  item_level = "item_level",
  cause_label_df,
  label_col = VAR_NAMES,
  assume_consistent_codes = TRUE,
  replace_meta_data = ("resp_vars" %in% colnames(cause_label_df)),
  meta_data = item_level,
  meta_data_v2
)

Arguments

item_level

data.frame the data frame that contains metadata attributes of study data

cause_label_df

data.frame missing code table. If missing codes have labels the respective data frame can be specified here, see cause_label_df

label_col

variable attribute the name of the column in the metadata with labels of variables

assume_consistent_codes

logical if TRUE and no labels are given and the same missing/jump code is used for more than one variable, the labels assigned for this code will be the same for all variables.

replace_meta_data

logical if TRUE, ignore existing missing codes and jump codes and replace them with data from the cause_label_df. Otherwise, copy the labels from cause_label_df to the existing code columns.

meta_data

data.frame old name for item_level

meta_data_v2

Details

If a column resp_vars exists, then rows with a value in resp_vars will only be used for the corresponding variable.

Value

data.frame updated metadata including all the code labels in missing/jump lists

Insert missing codes for `NA`s based on rules

Description

Insert missing codes for NAs based on rules

Usage

prep_add_computed_variables(
  study_data,
  meta_data,
  label_col,
  rules,
  use_value_labels
)

Arguments

study_data

data.frame the data frame that contains the measurements

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

rules

data.frame with the columns:

VAR_NAMES: VAR_NAMES of the variable to compute
RULE: A rule in REDcap style (see, e.g., REDcap help, REDcap how-to), and REDcap branching logic that defines, how to compute the new values

use_value_labels

logical In rules for factors, use the value labels, not the codes. Defaults to TRUE, if any VALUE_LABELS are given in the metadata.

Value

a list with the entry:

ModifiedStudyData: Study data with the new variables

Examples

## Not run: 
study_data <- prep_get_data_frame("ship")
prep_load_workbook_like_file("ship_meta_v2")
meta_data <- prep_get_data_frame("item_level")
rules <- tibble::tribble(
  ~VAR_NAMES,  ~RULE,
  "BMI", '[BODY_WEIGHT_0]/(([BODY_HEIGHT_0]/100)^2)',
  "R", '[WAIST_CIRC_0]/2/[pi]', # in m^3
  "VOL_EST", '[pi]*([WAIST_CIRC_0]/2/[pi])^2*[BODY_HEIGHT_0] / 1000', # in l
 )
 r <- prep_add_computed_variables(study_data, meta_data,
   label_col = "LABEL", rules, use_value_labels = FALSE)

## End(Not run)

Add data frames to the pre-loaded / cache data frame environment

Description

These can be referred to by their names, then, wherever dataquieR expects a data.frame – just pass a character instead. If this character is not found, dataquieR would additionally look for files with the name and for URLs. You can also refer to specific sheets of a workbook or specific object from an RData by appending a pipe symbol and its name. A second pipe symbol allows to extract certain columns from such sheets (but they will remain data frames).

Usage

prep_add_data_frames(..., data_frame_list = list())

Arguments

...

data frames, if passed with names, these will be the names of these tables in the data frame environment. If not, then the names in the calling environment will be used.

data_frame_list

a named list with data frames. Also these will be added and names will be handled as for the ... argument.

Value

data.frame ⁠invisible(the cache environment)⁠

Insert missing codes for `NA`s based on rules

Description

Insert missing codes for NAs based on rules

Usage

prep_add_missing_codes(
  resp_vars,
  study_data,
  meta_data_v2,
  item_level = "item_level",
  label_col,
  rules,
  use_value_labels,
  overwrite = FALSE,
  meta_data = item_level
)

Arguments

resp_vars

variable list the name of the measurement variables to be modified, all from rules, if omitted

study_data

data.frame the data frame that contains the measurements

meta_data_v2

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

rules

data.frame with the columns:

resp_vars: Variable, whose NA-values should be replaced by jump codes
CODE_CLASS: Either MISSING or JUMP: Is the currently described case an expected missing value (JUMP) or not (MISSING)
CODE_VALUE: The jump code or missing code
CODE_LABEL: A label describing the reason for the missing value
RULE: A rule in REDcap style (see, e.g., REDcap help, REDcap how-to), and REDcap branching logic that describes cases for the missing

use_value_labels

logical In rules for factors, use the value labels, not the codes. Defaults to TRUE, if any VALUE_LABELS are given in the metadata.

overwrite

logical Also insert missing codes, if the values are not NA

meta_data

data.frame old name for item_level attributes of study data

Value

a list with the entries:

ModifiedStudyData: Study data with NAs replaced by the CODE_VALUE
ModifiedMetaData: Metadata having the new codes amended in the columns JUMP_LIST or MISSING_LIST, respectively

Support function to augment metadata during data quality reporting

Description

adds an annotation to static metadata

Usage

prep_add_to_meta(
  VAR_NAMES,
  DATA_TYPE,
  LABEL,
  VALUE_LABELS,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  ...
)

Arguments

VAR_NAMES

character Names of the Variables to add

DATA_TYPE

character Data type for the added variables

LABEL

character Labels for these variables

VALUE_LABELS

character Value labels for the values of the variables as usually pipe separated and assigned with =: 1 = male | 2 = female

item_level

data.frame the metadata to extend

meta_data

data.frame old name for item_level

meta_data_v2

...

Further defined variable attributes, see prep_create_meta

Details

Add metadata e.g. of transformed/new variable This function is not yet considered stable, but we already export it, because it could help. Therefore, we have some inconsistencies in the formals still.

Value

a data frame with amended metadata.

Re-Code labels with their respective codes according to the `meta_data`

Description

Re-Code labels with their respective codes according to the meta_data

Usage

prep_apply_coding(
  study_data,
  meta_data_v2,
  item_level = "item_level",
  meta_data = item_level
)

Arguments

study_data

data.frame the data frame that contains the measurements

meta_data_v2

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

Value

data.frame modified study data with labels replaced by the codes

Check for package updates

Description

Check for package updates

Usage

prep_check_for_dataquieR_updates(
  beta = FALSE,
  deps = TRUE,
  ask = interactive()
)

Arguments

beta

logical check for beta version too

deps

logical check for missing (optional) dependencies

ask

logical ask for updates

Value

invisible(NULL)

Verify and normalize metadata on data frame level

Description

if possible, mismatching data types are converted ("true" becomes TRUE)

Usage

prep_check_meta_data_dataframe(
  meta_data_dataframe = "dataframe_level",
  meta_data_v2,
  dataframe_level
)

Arguments

meta_data_dataframe

data.frame data frame or path/url of a metadata sheet for the data frame level

meta_data_v2

dataframe_level

data.frame alias for meta_data_dataframe

Details

missing columns are added, filled with NA, if this is valid, i.e., n.a. for DF_NAME as the key column

Value

standardized metadata sheet as data frame

Examples

## Not run: 
mds <- prep_check_meta_data_dataframe("ship_meta_dataframe|dataframe_level") # also converts
print(mds)
prep_check_meta_data_dataframe(mds)
mds1 <- mds
mds1$DF_RECORD_COUNT <- NULL
print(prep_check_meta_data_dataframe(mds1)) # fixes the missing column by NAs
mds1 <- mds
mds1$DF_UNIQUE_ROWS[[2]] <- "xxx" # not convertible
# print(prep_check_meta_data_dataframe(mds1)) # fail
mds1 <- mds
mds1$DF_UNIQUE_ID[[2]] <- 12
# print(prep_check_meta_data_dataframe(mds1)) # fail

## End(Not run)

Verify and normalize metadata on segment level

Description

if possible, mismatching data types are converted ("true" becomes TRUE)

Usage

prep_check_meta_data_segment(
  meta_data_segment = "segment_level",
  meta_data_v2,
  segment_level
)

Arguments

meta_data_segment

data.frame data frame or path/url of a metadata sheet for the segment level

meta_data_v2

segment_level

data.frame alias for meta_data_segment

Details

missing columns are added, filled with NA, if this is valid, i.e., n.a. for STUDY_SEGMENT as the key column

Value

standardized metadata sheet as data frame

Examples

## Not run: 
mds <- prep_check_meta_data_segment("ship_meta_v2|segment_level") # also converts
print(mds)
prep_check_meta_data_segment(mds)
mds1 <- mds
mds1$SEGMENT_RECORD_COUNT <- NULL
print(prep_check_meta_data_segment(mds1)) # fixes the missing column by NAs
mds1 <- mds
mds1$SEGMENT_UNIQUE_ROWS[[2]] <- "xxx" # not convertible
# print(prep_check_meta_data_segment(mds1)) # fail

## End(Not run)

Checks the validity of metadata w.r.t. the provided column names

Description

This function verifies, if a data frame complies to metadata conventions and provides a given richness of meta information as specified by level.

Usage

prep_check_meta_names(
  item_level = "item_level",
  level,
  character.only = FALSE,
  meta_data = item_level,
  meta_data_v2
)

Arguments

item_level

data.frame the data frame that contains metadata attributes of study data

level

enum level of requirement (see also VARATT_REQUIRE_LEVELS). set to NULL to deactivate the check of richness.

character.only

logical a logical indicating whether level can be assumed to be character strings.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Note, that only the given level is checked despite, levels are somehow hierarchical.

Value

a logical with:

invisible(TRUE). In case of problems with the metadata, a condition is raised (stop()).

Examples

## Not run: 
prep_check_meta_names(data.frame(VAR_NAMES = 1, DATA_TYPE = 2,
                      MISSING_LIST = 3))

prep_check_meta_names(
  data.frame(
    VAR_NAMES = 1, DATA_TYPE = 2, MISSING_LIST = 3,
    LABEL = "LABEL", VALUE_LABELS = "VALUE_LABELS",
    JUMP_LIST = "JUMP_LIST", HARD_LIMITS = "HARD_LIMITS",
    GROUP_VAR_OBSERVER = "GROUP_VAR_OBSERVER",
    GROUP_VAR_DEVICE = "GROUP_VAR_DEVICE",
    TIME_VAR = "TIME_VAR",
    PART_VAR = "PART_VAR",
    STUDY_SEGMENT = "STUDY_SEGMENT",
    LOCATION_RANGE = "LOCATION_RANGE",
    LOCATION_METRIC = "LOCATION_METRIC",
    PROPORTION_RANGE = "PROPORTION_RANGE",
    MISSING_LIST_TABLE = "MISSING_LIST_TABLE",
    CO_VARS = "CO_VARS",
    LONG_LABEL = "LONG_LABEL"
  ),
  RECOMMENDED
)

prep_check_meta_names(
  data.frame(
    VAR_NAMES = 1, DATA_TYPE = 2, MISSING_LIST = 3,
    LABEL = "LABEL", VALUE_LABELS = "VALUE_LABELS",
    JUMP_LIST = "JUMP_LIST", HARD_LIMITS = "HARD_LIMITS",
    GROUP_VAR_OBSERVER = "GROUP_VAR_OBSERVER",
    GROUP_VAR_DEVICE = "GROUP_VAR_DEVICE",
    TIME_VAR = "TIME_VAR",
    PART_VAR = "PART_VAR",
    STUDY_SEGMENT = "STUDY_SEGMENT",
    LOCATION_RANGE = "LOCATION_RANGE",
    LOCATION_METRIC = "LOCATION_METRIC",
    PROPORTION_RANGE = "PROPORTION_RANGE",
    DETECTION_LIMITS = "DETECTION_LIMITS", SOFT_LIMITS = "SOFT_LIMITS",
    CONTRADICTIONS = "CONTRADICTIONS", DISTRIBUTION = "DISTRIBUTION",
    DECIMALS = "DECIMALS", VARIABLE_ROLE = "VARIABLE_ROLE",
    DATA_ENTRY_TYPE = "DATA_ENTRY_TYPE",
    CO_VARS = "CO_VARS",
    END_DIGIT_CHECK = "END_DIGIT_CHECK",
    VARIABLE_ORDER = "VARIABLE_ORDER", LONG_LABEL =
      "LONG_LABEL", recode = "recode",
      MISSING_LIST_TABLE = "MISSING_LIST_TABLE"
  ),
  OPTIONAL
)

# Next one will fail
try(
  prep_check_meta_names(data.frame(VAR_NAMES = 1, DATA_TYPE = 2,
    MISSING_LIST = 3), TECHNICAL)
)

## End(Not run)

Support function to scan variable labels for applicability

Description

Adjust labels in meta_data to be valid variable names in formulas for diverse r functions, such as glm or lme4::lmer.

Usage

prep_clean_labels(
  label_col,
  item_level = "item_level",
  no_dups = FALSE,
  meta_data = item_level,
  meta_data_v2
)

Arguments

label_col

character label attribute to adjust or character vector to adjust, depending on meta_data argument is given or missing.

item_level

data.frame metadata data frame: If label_col is a label attribute to adjust, this is the metadata table to process on. If missing, label_col must be a character vector with values to adjust.

no_dups

logical disallow duplicates in input or output vectors of the function, then, prep_clean_labels would call stop() on duplicated labels.

meta_data

data.frame old name for item_level

meta_data_v2

Details

Hint: The following is still true, but the functions should be capable of doing potentially needed fixes on-the-fly automatically, so likely you will not need this function any more.

Currently, labels as given by label_col arguments in the most functions are directly used in formula, so that they become natural part of the outputs, but different models expect differently strict syntax for such formulas, especially for valid variable names. prep_clean_labels removes all potentially inadmissible characters from variable names (no guarantee, that some exotic model still rejects the names, but minimizing the number of exotic characters). However, variable names are modified, may become unreadable or indistinguishable from other variable names. For the latter case, a stop call is possible, controlled by the no_dups argument.

A warning is emitted, if modifications were necessary.

Value

a data.frame with:

if meta_data is set, a list with:
- modified meta_data[, label_col] column
if meta_data is not set, adjusted labels that then were directly given in label_col

Examples

## Not run: 
meta_data1 <- data.frame(
  LABEL =
    c(
      "syst. Blood pressure (mmHg) 1",
      "1st heart frequency in MHz",
      "body surface (\\u33A1)"
    )
)
print(meta_data1)
print(prep_clean_labels(meta_data1$LABEL))
meta_data1 <- prep_clean_labels("LABEL", meta_data1)
print(meta_data1)

## End(Not run)

Combine two report summaries

Description

Combine two report summaries

Usage

prep_combine_report_summaries(..., summaries_list, amend_segment_names = FALSE)

Arguments

...

objects returned by prep_extract_summary

summaries_list

if given, list of objects returned by prep_extract_summary

amend_segment_names

logical use names of the summaries_list and argument names as segment prefixes

Value

combined summaries

Verify item-level metadata

Description

are the provided item-level meta_data plausible given study_data?

Usage

prep_compare_meta_with_study(
  study_data,
  label_col,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2
)

Arguments

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

Value

an invisible() list with the entries.

pred data.frame metadata predicted from study_data, reduced to such metadata also available in the provided metadata
prov data.frame provided metadata, reduced to such metadata also available in the provided study_data
ml_error character VAR_NAMES of variables with potentially wrong MISSING_LIST
sl_error character VAR_NAMES of variables with potentially wrong SCALE_LEVEL
dt_error character VAR_NAMES of variables with potentially wrong DATA_TYPE

Support function to create data.frames of metadata

Description

Create a metadata data frame and map names. Generally, this function only creates a data.frame, but using this constructor instead of calling data.frame(..., stringsAsFactors = FALSE), it becomes possible, to adapt the metadata data.frame in later developments, e.g. if we decide to use classes for the metadata, or if certain standard names of variable attributes change. Also, a validity check is possible to implement here.

Usage

prep_create_meta(..., stringsAsFactors = FALSE, level, character.only = FALSE)

Arguments

...

named column vectors, names will be mapped using WELL_KNOWN_META_VARIABLE_NAMES, if included in WELL_KNOWN_META_VARIABLE_NAMES can also be a data frame, then its column names will be mapped using WELL_KNOWN_META_VARIABLE_NAMES

stringsAsFactors

logical if the argument is a list of vectors, a data frame will be created. In this case, stringsAsFactors controls, whether characters will be auto-converted to Factors, which defaults here always to false independent from the default.stringsAsFactors.

level

enum level of requirement (see also VARATT_REQUIRE_LEVELS) set to NULL, if not a complete metadata frame is created.

character.only

logical a logical indicating whether level can be assumed to be character strings.

Details

For now, this calls data.frame, but it already renames variable attributes, if they have a different name assigned in WELL_KNOWN_META_VARIABLE_NAMES, e.g. WELL_KNOWN_META_VARIABLE_NAMES$RECODE maps to recode in lower case.

NB: dataquieR exports all names from WELL_KNOWN_META_VARIABLE_NAME as symbols, so RECODE also contains "recode".

Value

a data frame with:

metadata attribute names mapped and
metadata checked using prep_check_meta_names and do some more verification about conventions, such as check for valid intervals in limits)

Instantiate a new metadata file

Description

Instantiate a new metadata file

Usage

prep_create_meta_data_file(
  file_name,
  study_data,
  open = TRUE,
  overwrite = FALSE
)

Arguments

file_name

character file path to write to

study_data

data.frame optional, study data to guess metadata from

open

logical open the file after creation

overwrite

logical overwrite file, if exists

Value

invisible(NULL)

Create a factory function for `storr` objects for backing a dataquieR_resultset2

Description

Create a factory function for storr objects for backing a dataquieR_resultset2

Usage

prep_create_storr_factory(db_dir = tempfile(), namespace = "objects")

Arguments

db_dir

character path to the directory for the back-end, if one is created on the fly.

namespace

character namespace for the report, so that one back-end can back several reports

the returned function will try to create a storr object using a temporary folder or the folder in db_dir, if specified. The database will either be the storr_rds.

Value

storr object or NULL, if package storr is not available

Get data types from data

Description

Get data types from data

Usage

prep_datatype_from_data(
  resp_vars = colnames(study_data),
  study_data,
  .dont_cast_off_cols = FALSE
)

Arguments

resp_vars

variable names of the variables to fetch the data type from the data

study_data

data.frame the data frame that contains the measurements Hint: Only data frames supported, no URL or file names.

.dont_cast_off_cols

logical internal use, only

Value

vector of data types

Examples

## Not run: 
dataquieR::prep_datatype_from_data(cars)

## End(Not run)

Convert two vectors from a code-value-table to a key-value list

Description

Convert two vectors from a code-value-table to a key-value list

Usage

prep_deparse_assignments(
  codes,
  labels = codes,
  split_char = SPLIT_CHAR,
  mode = c("numeric_codes", "string_codes")
)

Arguments

codes

codes, numeric or dates (as default, but string codes can be enabled using the option 'mode', see below)

labels

character labels, same length as codes

split_char

character split character character to split code assignments

mode

character one of two options to insist on numeric or datetime codes (default) or to allow for string codes

Value

a vector with assignment strings for each row of cbind(codes, labels)

Get the dataquieR `DATA_TYPE` of `x`

Description

Get the dataquieR DATA_TYPE of x

Usage

prep_dq_data_type_of(x)

Arguments

x

object to define the dataquieR data type of

Value

the dataquieR data type as listed in DATA_TYPES

Expand code labels across variables

Description

Code labels are copied from other variables, if the code is the same and the label is set only for some variables

Usage

prep_expand_codes(
  item_level = "item_level",
  suppressWarnings = FALSE,
  mix_jumps_and_missings = FALSE,
  meta_data_v2,
  meta_data = item_level
)

Arguments

item_level

data.frame the data frame that contains metadata attributes of study data

suppressWarnings

logical show warnings, if labels are expanded

mix_jumps_and_missings

logical ignore the class of the codes for label expansion, i.e., use missing code labels as jump code labels, if the values are the same.

meta_data_v2

meta_data

data.frame old name for item_level

Value

data.frame an updated metadata data frame.

Examples

## Not run: 
meta_data <- prep_get_data_frame("meta_data")
meta_data$JUMP_LIST[meta_data$VAR_NAMES == "v00003"] <- "99980 = NOOP"
md <- prep_expand_codes(meta_data)
md$JUMP_LIST
md$MISSING_LIST
md <- prep_expand_codes(meta_data, mix_jumps_and_missings = TRUE)
md$JUMP_LIST
md$MISSING_LIST
meta_data <- prep_get_data_frame("meta_data")
meta_data$MISSING_LIST[meta_data$VAR_NAMES == "v00003"] <- "99980 = NOOP"
md <- prep_expand_codes(meta_data)
md$JUMP_LIST
md$MISSING_LIST

## End(Not run)

Extract all missing/jump codes from metadata and export a cause-label-data-frame

Description

Extract all missing/jump codes from metadata and export a cause-label-data-frame

Usage

prep_extract_cause_label_df(
  item_level = "item_level",
  label_col = VAR_NAMES,
  meta_data_v2,
  meta_data = item_level
)

Arguments

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data_v2

meta_data

data.frame old name for item_level

Value

list with the entries

meta_data data.frame a data frame that contains updated metadata – you still need to add a column MISSING_LIST_TABLE and add the cause_label_df as such to the metadata cache using prep_add_data_frames(), manually.
cause_label_df data.frame missing code table. If missing codes have labels the respective data frame are specified here, see cause_label_df.

Extract old function based summary from data quality results

Description

Extract old function based summary from data quality results

Usage

prep_extract_classes_by_functions(r)

Arguments

r

dq_report2

Value

data.frame long format, compatible with prep_summary_to_classes()

Extract summary from data quality results

Description

Generic function, currently supports dq_report2 and dataquieR_result

Usage

prep_extract_summary(r, ...)

Arguments

r

dq_report2 or dataquieR_result object

...

further arguments, maybe needed for some implementations

Value

list with two slots Data and Table with data.frames featuring all metrics columns from the report or result in x, the STUDY_SEGMENT and the VAR_NAMES. In case of Data, the columns are formatted nicely but still with the standardized column names – use util_translate_indicator_metrics() to rename them nicely. In case of Table, just as they are.

Extract report summary from reports

Description

Extract report summary from reports

Usage

## S3 method for class 'dataquieR_result'
prep_extract_summary(r, ...)

Arguments

r

dataquieR_result a result from adq_report2 report

...

not used

Value

list with two slots Data and Table with data.frames featuring all metrics columns from the report r, the STUDY_SEGMENT and the VAR_NAMES. In case of Data, the columns are formatted nicely but still with the standardized column names – use util_translate_indicator_metrics() to rename them nicely. In case of Table, just as they are.

Extract report summary from reports

Description

Extract report summary from reports

Usage

## S3 method for class 'dataquieR_resultset2'
prep_extract_summary(r, ...)

Arguments

r

dq_report2 a dq_report2 report

...

not used

Value

Read data from files/URLs

Description

data_frame_name can be a file path or an URL you can append a pipe and a sheet name for Excel files or object name e.g. for RData files. Numbers may also work. All file formats supported by your rio installation will work.

Usage

prep_get_data_frame(
  data_frame_name,
  .data_frame_list = .dataframe_environment(),
  keep_types = FALSE,
  column_names_only = FALSE
)

Arguments

data_frame_name

character name of the data frame to read, see details

.data_frame_list

environment cache for loaded data frames

keep_types

logical keep types as possibly defined in a file, if the data frame is loaded from one. set TRUE for study data.

column_names_only

logical if TRUE imports only headers (column names) of the data frame and no content (an empty data frame)

Details

The data frames will be cached automatically, you can define an alternative environment for this using the argument .data_frame_list, and you can purge the cache using prep_purge_data_frame_cache.

Use prep_add_data_frames to manually add data frames to the cache, e.g., if you have loaded them from more complex sources, before.

Value

data.frame a data frame

Examples

## Not run: 
bl <- as.factor(prep_get_data_frame(
  paste0("https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus",
    "/Projekte_RKI/COVID-19_Todesfaelle.xlsx?__blob=",
    "publicationFile|COVID_Todesfälle_BL|Bundesland"))[[1]])

n <- as.numeric(prep_get_data_frame(paste0(
  "https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/",
  "Projekte_RKI/COVID-19_Todesfaelle.xlsx?__blob=",
  "publicationFile|COVID_Todesfälle_BL|Anzahl verstorbene",
  " COVID-19 Fälle"))[[1]])
plot(bl, n)
# Working names would be to date (2022-10-21), e.g.:
#
# https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/ \
#    Projekte_RKI/COVID-19_Todesfaelle.xlsx?__blob=publicationFile
# https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/  \
#    Projekte_RKI/COVID-19_Todesfaelle.xlsx?__blob=publicationFile|2
# https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/ \
#    Projekte_RKI/COVID-19_Todesfaelle.xlsx?__blob=publicationFile|name
# study_data
# ship
# meta_data
# ship_meta
#
prep_get_data_frame("meta_data | meta_data")

## End(Not run)

Fetch a label for a variable based on its purpose

Description

Fetch a label for a variable based on its purpose

Usage

prep_get_labels(
  resp_vars,
  item_level = "item_level",
  label_col,
  max_len = MAX_LABEL_LEN,
  label_class = c("SHORT", "LONG"),
  label_lang = getOption("dataquieR.lang", ""),
  resp_vars_are_var_names_only = FALSE,
  resp_vars_match_label_col_only = FALSE,
  meta_data = item_level,
  meta_data_v2,
  force_label_col = getOption("dataquieR.force_label_col",
    dataquieR.force_label_col_default)
)

Arguments

resp_vars

variable list the variable names to fetch for

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

max_len

integer the maximum label length to return, if not possible w/o causing ambiguous labels, the labels may still be longer

label_class

enum SHORT | LONG. which sort of label according to the metadata model should be returned

label_lang

character optional language suffix, if available in the metadata. Can be controlled by the option dataquieR.lang.

resp_vars_are_var_names_only

logical If TRUE, do not use other labels than VAR_NAMES for finding resp_vars in meta_data

resp_vars_match_label_col_only

logical If TRUE, do not use other labels than those, referred by label_col for finding resp_vars in meta_data

meta_data

data.frame old name for item_level

meta_data_v2

force_label_col

enum auto | FALSE | TRUE. if TRUE, always use labels according label_col, FALSE means use labels matching best the function's requirements, auto means FALSE, if in a dq_report() and TRUE, otherwise.

Value

character suitable labels for each resp_vars, names of this vector are VAR_NAMES

Examples

## Not run: 
prep_load_workbook_like_file("meta_data_v2")
prep_get_labels("SEX_0", label_class = "SHORT", max_len = 2)

## End(Not run)

Get data frame for a given segment

Description

Get data frame for a given segment

Usage

prep_get_study_data_segment(
  segment,
  study_data,
  item_level = "item_level",
  meta_data = item_level,
  meta_data_v2,
  segment_level,
  meta_data_segment = "segment_level"
)

Arguments

segment

character name of the segment to return data for

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data

data.frame old name for item_level

meta_data_v2

segment_level

data.frame alias for meta_data_segment

meta_data_segment

data.frame – optional: Segment level metadata

Value

data.frame the data for the segment

Return the logged-in User's Full Name

Description

If whoami is not installed, the user name from Sys.info() is returned.

Usage

prep_get_user_name()

Details

Can be overridden by options or environment:

options(FULLNAME = "Stephan Struckmann")

Sys.setenv(FULLNAME = "Stephan Struckmann")

Value

character the user's name

Get machine variant for snapshot tests

Description

Get machine variant for snapshot tests

Usage

prep_get_variant()

Value

character the variant

Guess encoding of text or text files

Description

Guess encoding of text or text files

Usage

prep_guess_encoding(x, file)

Arguments

x

character string to guess encoding for

file

character file to guess encoding for

Value

encoding

Prepare a label as part of a link for `RMD` files

Description

Prepare a label as part of a link for RMD files

Usage

prep_link_escape(s, html = FALSE)

Arguments

s

the label

html

prepare the label for direct HTML output instead of RMD

Value

the escaped label

List Loaded Data Frames

Description

List Loaded Data Frames

Usage

prep_list_dataframes()

Value

names of all loaded data frames

All valid `⁠voc:⁠` vocabularies

Description

All valid ⁠voc:⁠ vocabularies

Usage

prep_list_voc()

Value

character() all ⁠voc:⁠ suffixes allowed for prep_get_data_frame().

Examples

## Not run: 
prep_list_dataframes()
prep_list_voc()
prep_get_data_frame("<ICD10>")
my_voc <-
  tibble::tribble(
    ~ voc, ~ url,
    "test", "data:datasets|iris|Species+Sepal.Length")
prep_add_data_frames(`<>` = my_voc)
prep_list_dataframes()
prep_list_voc()
prep_get_data_frame("<test>")
prep_get_data_frame("<ICD10>")
my_voc <-
  tibble::tribble(
    ~ voc, ~ url,
    "ICD10", "data:datasets|iris|Species+Sepal.Length")
prep_add_data_frames(`<>` = my_voc)
prep_list_dataframes()
prep_list_voc()
prep_get_data_frame("<ICD10>")

## End(Not run)

Pre-load a folder with named (usually more than) one table(s)

Description

These can thereafter be referred to by their names only. Such files are, e.g., spreadsheet-workbooks or RData-files.

Usage

prep_load_folder_with_metadata(folder, keep_types = FALSE, ...)

Arguments

folder

the folder name to load.

keep_types

logical keep types as possibly defined in the file. set TRUE for study data.

...

arguments passed to []

Details

Note, that this function in contrast to prep_get_data_frame does neither support selecting specific sheets/columns from a file.

Value

⁠invisible(the cache environment)⁠

Load a `dq_report2`

Description

Load a dq_report2

Usage

prep_load_report(file)

Arguments

file

character the file name to load from

Value

dataquieR_resultset2 the report

Load a report from a back-end

Description

Load a report from a back-end

Usage

prep_load_report_from_backend(
  namespace = "objects",
  db_dir,
  storr_factory = prep_create_storr_factory(namespace = namespace, db_dir = db_dir)
)

Arguments

namespace

the namespace to read the report's results from

db_dir

character path to the directory for the back-end, if a storr_rds or storr_torr is used.

storr_factory

a function returning a storr object holding the report

Value

dataquieR_resultset2 the report

Examples

## Not run: 
r <- dataquieR::dq_report2("study_data", meta_data_v2 = "meta_data_v2",
                           dimensions = NULL)
storr_factory <- prep_create_storr_factory()
r_storr <- prep_set_backend(r, storr_factory)
r_restorr <- prep_set_backend(r_storr, NULL)
r_loaded <- prep_load_report_from_backend(storr_factory)

## End(Not run)

Pre-load a file with named (usually more than) one table(s)

Description

These can thereafter be referred to by their names only. Such files are, e.g., spreadsheet-workbooks or RData-files.

Usage

prep_load_workbook_like_file(file, keep_types = FALSE)

Arguments

file

the file name to load.

keep_types

logical keep types as possibly defined in the file. set TRUE for study data.

Details

Note, that this function in contrast to prep_get_data_frame does neither support selecting specific sheets/columns from a file.

Value

⁠invisible(the cache environment)⁠

Support function to allocate labels to variables

Description

Map variables to certain attributes, e.g. by default their labels.

Usage

prep_map_labels(
  x,
  item_level = "item_level",
  to = LABEL,
  from = VAR_NAMES,
  ifnotfound,
  warn_ambiguous = FALSE,
  meta_data_v2,
  meta_data = item_level
)

Arguments

x

character variable names, character vector, see parameter from

item_level

data.frame metadata data frame, if, as a dataquieR developer, you do not have item-level-metadata, you should use util_map_labels instead to avoid consistency checks on for item-level meta_data.

to

character variable attribute to map to

from

character variable identifier to map from

ifnotfound

list A list of values to be used if the item is not found: it will be coerced to a list if necessary.

warn_ambiguous

logical print a warning if mapping variables from from to to produces ambiguous identifiers.

meta_data_v2

meta_data

data.frame old name for item_level

Details

This function basically calls colnames(study_data) <- meta_data$LABEL, ensuring correct merging/joining of study data columns to the corresponding metadata rows, even if the orders differ. If a variable/study_data-column name is not found in meta_data[[from]] (default from = VAR_NAMES), either stop is called or, if ifnotfound has been assigned a value, that value is returned. See mget, which is internally used by this function.

The function not only maps to the LABEL column, but to can be any metadata variable attribute, so the function can also be used, to get, e.g. all HARD_LIMITS from the metadata.

Value

a character vector with:

mapped values

Examples

## Not run: 
meta_data <- prep_create_meta(
  VAR_NAMES = c("ID", "SEX", "AGE", "DOE"),
  LABEL = c("Pseudo-ID", "Gender", "Age", "Examination Date"),
  DATA_TYPE = c(DATA_TYPES$INTEGER, DATA_TYPES$INTEGER, DATA_TYPES$INTEGER,
                 DATA_TYPES$DATETIME),
  MISSING_LIST = ""
)
stopifnot(all(prep_map_labels(c("AGE", "DOE"), meta_data) == c("Age",
                                                 "Examination Date")))

## End(Not run)

Merge a list of study data frames to one (sparse) study data frame

Description

Merge a list of study data frames to one (sparse) study data frame

Usage

prep_merge_study_data(study_data_list)

Arguments

study_data_list

list the list

Value

data.frame study_data

Convert item-level metadata from v1.0 to v2.0

Description

This function is idempotent..

Usage

prep_meta_data_v1_to_item_level_meta_data(
  item_level = "item_level",
  verbose = TRUE,
  label_col = LABEL,
  cause_label_df,
  meta_data = item_level
)

Arguments

item_level

data.frame the old item-level-metadata

verbose

logical display all estimated decisions, defaults to TRUE, except if called in a dq_report2 pipeline.

label_col

variable attribute the name of the column in the metadata with labels of variables

cause_label_df

data.frame missing code table, see cause_label_df. Optional. If this argument is given, you can add missing code tables.

meta_data

data.frame old name for item_level

Details

The options("dataquieR.force_item_specific_missing_codes") (default FALSE) tells the system, to always fill in res_vars columns to the MISSING_LIST_TABLE, even, if the column already exists, but is empty.

Value

data.frame the updated metadata

Support function to identify the levels of a process variable with minimum number of observations

Description

utility function to subset data based on minimum number of observation per level

Usage

prep_min_obs_level(study_data, group_vars, min_obs_in_subgroup)

Arguments

study_data

data.frame the data frame that contains the measurements

group_vars

variable list the name grouping variable

min_obs_in_subgroup

integer optional argument if a "group_var" is used. This argument specifies the minimum no. of observations that is required to include a subgroup (level) of the "group_var" in the analysis. Subgroups with less observations are excluded. The default is 30.

Details

This functions removes observations having fewer than min_obs_in_subgroup distinct values in a group variable, e.g. blood pressure measurements performed by an examiner having fewer than e.g. 50 measurements done. It displays a warning, if samples/rows are removed and returns the modified study data frame.

Value

a data frame with:

a subsample of original data

Open a data frame in Excel

Description

Open a data frame in Excel

Usage

prep_open_in_excel(dfr)

Arguments

dfr

the data frame

Details

if the file cannot be read on function exit, NULL will be returned

Value

potentially modified data frame after dialog was closed

Support function for a parallel `pmap`

Description

parallel version of purrr::pmap

Usage

prep_pmap(.l, .f, ..., cores = 0)

Arguments

.l

data.frame with one call per line and one function argument per column

.f

function to call with the arguments from .l

...

additional, static arguments for calling .f

cores

number of cpu cores to use or a (named) list with arguments for parallelMap::parallelStart or NULL, if parallel has already been started by the caller. Set to 0 to run without parallelization.

Value

list of results of the function calls

Author(s)

Aurèle

S Struckmann

Prepare and verify study data with metadata

Description

This function ensures, that a data frame ds1 with suitable variable names study_data and meta_data exist as base data.frames.

Usage

prep_prepare_dataframes(
  .study_data,
  .meta_data,
  .label_col,
  .replace_hard_limits,
  .replace_missings,
  .sm_code = NULL,
  .allow_empty = FALSE,
  .adjust_data_type = TRUE,
  .amend_scale_level = TRUE,
  .apply_factor_metadata = FALSE,
  .apply_factor_metadata_inadm = FALSE,
  .internal = rlang::env_inherits(rlang::caller_env(), parent.env(environment()))
)

Arguments

.study_data

if provided, use this data set as study_data

.meta_data

if provided, use this data set as meta_data

.label_col

if provided, use this as label_col

.replace_hard_limits

replace HARD_LIMIT violations by NA, defaults to FALSE.

.replace_missings

replace missing codes, defaults to TRUE

.sm_code

missing code for NAs, if they have been re-coded by util_combine_missing_lists

.allow_empty

allow ds1 to be empty, i.e., 0 rows and/or 0 columns

.adjust_data_type

ensure that the data type of variables in the study data corresponds to their data type specified in the metadata

.amend_scale_level

ensure that SCALE_LEVEL is available in the item-level meta_data. internally used to prevent recursion, if called from prep_scalelevel_from_data_and_metadata().

.apply_factor_metadata

logical convert categorical variables to labeled factors.

.apply_factor_metadata_inadm

logical convert categorical variables to labeled factors keeping inadmissible values. Implies, that .apply_factor_metadata will be set to TRUE, too.

.internal

logical internally called, modify caller's environment.

Details

This function defines ds1 and modifies study_data and meta_data in the environment of its caller (see eval.parent). It also defines or modifies the object label_col in the calling environment. Almost all functions exported by dataquieR call this function initially, so that aspects common to all functions live here, e.g. testing, if an argument meta_data has been given and features really a data.frame. It verifies the existence of required metadata attributes (VARATT_REQUIRE_LEVELS). It can also replace missing codes by NAs, and calls prep_study2meta to generate a minimum set of metadata from the study data on the fly (should be amended, so on-the-fly-calling is not recommended for an instructive use of dataquieR).

The function also detects tibbles, which are then converted to base-R data.frames, which are expected by dataquieR.

If .internal is TRUE, differently from the other utility function that work in their caller's environment, this function modifies objects in the calling function's environment. It defines a new object ds1, it modifies study_data and/or meta_data and label_col.

Value

ds1 the study data with mapped column names

Examples

## Not run: 
acc_test1 <- function(resp_variable, aux_variable,
                      time_variable, co_variables,
                      group_vars, study_data, meta_data) {
  prep_prepare_dataframes()
  invisible(ds1)
}
acc_test2 <- function(resp_variable, aux_variable,
                      time_variable, co_variables,
                      group_vars, study_data, meta_data, label_col) {
  ds1 <- prep_prepare_dataframes(study_data, meta_data)
  invisible(ds1)
}
environment(acc_test1) <- asNamespace("dataquieR")
# perform this inside the package (not needed for functions that have been
# integrated with the package already)

environment(acc_test2) <- asNamespace("dataquieR")
# perform this inside the package (not needed for functions that have been
# integrated with the package already)
acc_test3 <- function(resp_variable, aux_variable, time_variable,
                      co_variables, group_vars, study_data, meta_data,
                      label_col) {
  prep_prepare_dataframes()
  invisible(ds1)
}
acc_test4 <- function(resp_variable, aux_variable, time_variable,
                      co_variables, group_vars, study_data, meta_data,
                      label_col) {
  ds1 <- prep_prepare_dataframes(study_data, meta_data)
  invisible(ds1)
}
environment(acc_test3) <- asNamespace("dataquieR")
# perform this inside the package (not needed for functions that have been
# integrated with the package already)

environment(acc_test4) <- asNamespace("dataquieR")
# perform this inside the package (not needed for functions that have been
# integrated with the package already)
meta_data <- prep_get_data_frame("meta_data")
study_data <- prep_get_data_frame("study_data")
try(acc_test1())
try(acc_test2())
acc_test1(study_data = study_data)
try(acc_test1(meta_data = meta_data))
try(acc_test2(study_data = 12, meta_data = meta_data))
print(head(acc_test1(study_data = study_data, meta_data = meta_data)))
print(head(acc_test2(study_data = study_data, meta_data = meta_data)))
print(head(acc_test3(study_data = study_data, meta_data = meta_data)))
print(head(acc_test3(study_data = study_data, meta_data = meta_data,
  label_col = LABEL)))
print(head(acc_test4(study_data = study_data, meta_data = meta_data)))
print(head(acc_test4(study_data = study_data, meta_data = meta_data,
  label_col = LABEL)))
try(acc_test2(study_data = NULL, meta_data = meta_data))

## End(Not run)

Clear data frame cache

Description

Clear data frame cache

Usage

prep_purge_data_frame_cache()

Value

nothing

Remove a specified element from the data frame cache

Description

Remove a specified element from the data frame cache

Usage

prep_remove_from_cache(object_to_remove)

Arguments

object_to_remove

character name of the object to be removed as character string (quoted), or character vector containing the names of the objects to remove from the cache

Value

nothing

Examples

## Not run: 
prep_load_workbook_like_file("meta_data_v2") #load metadata in the cache
ls(.dataframe_environment()) #get the list of dataframes in the cache

#remove cross-item_level from the cache
prep_remove_from_cache("cross-item_level")

#remove dataframe_level and expected_id from the cache
prep_remove_from_cache(c("dataframe_level", "expected_id"))

#remove missing_table and segment_level from the cache
x<- c("missing_table", "segment_level")
prep_remove_from_cache(x)

## End(Not run)

Create a `ggplot2` pie chart

Description

Create a ggplot2 pie chart

Usage

prep_render_pie_chart_from_summaryclasses_ggplot2(
  data,
  meta_data = "item_level"
)

Arguments

data

data as returned by prep_summary_to_classes but summarized by one column (currently, we support indicator_metric, STUDY_SEGMENT, and VAR_NAMES)

meta_data

Value

a ggplot2::ggplot2 plot

Create a `plotly` pie chart

Description

Create a plotly pie chart

Usage

prep_render_pie_chart_from_summaryclasses_plotly(
  data,
  meta_data = "item_level"
)

Arguments

data

data as returned by prep_summary_to_classes but summarized by one column (currently, we support indicator_metric, call_names, STUDY_SEGMENT, and VAR_NAMES)

meta_data

Value

a htmltools compatible object

Guess the data type of a vector

Description

Guess the data type of a vector

Usage

prep_robust_guess_data_type(x, k = 50, it = 200)

Arguments

x

a vector with characters

k

numeric sample size, if less than ⁠floor(length(x) / (it/20)))⁠, minimum sample size is 1.

it

integer number of iterations when taking samples

Value

a guess of the data type of x. An attribute orig_type is also attached to give the more detailed guess returned by readr::guess_parser().

Algorithm

This function takes x and tries to guess the data type of random subsets of this vector using readr::guess_parser(). The RNG is initialized with a constant, so the function stays deterministic. It does such sub-sample based checks it times, the majority of the detected datatype determines the guessed data type.

Save a `dq_report2`

Description

Save a dq_report2

Usage

prep_save_report(report, file, compression_level = 3)

Arguments

report

dataquieR_resultset2 the report

file

character the file name to write to

compression_level

integer from=0 to=9. Compression level. 9 is very slow.

Value

invisible(NULL)

Heuristics to amend a SCALE_LEVEL column and a UNIT column in the metadata

Description

...if missing

Usage

prep_scalelevel_from_data_and_metadata(
  resp_vars = lifecycle::deprecated(),
  study_data,
  item_level = "item_level",
  label_col = LABEL,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list deprecated, the function always addresses all variables.

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_v2

Value

data.frame modified metadata

Examples

## Not run: 
  prep_load_workbook_like_file("meta_data_v2")
  prep_scalelevel_from_data_and_metadata(study_data = "study_data")

## End(Not run)

Change the back-end of a report

Description

with this function, you can move a report from/to a storr storage.

Usage

prep_set_backend(r, storr_factory = NULL, amend = FALSE)

Arguments

r

dataquieR_resultset2 the report

storr_factory

storr the storr storage or NULL, to move the report fully back into the RAM.

amend

logical if there is already data in.storr_factory, use it anyways – unsupported, so far!

Value

dataquieR_resultset2 but now with the desired back-end

Guess a metadata data frame from study data.

Description

Guess a minimum metadata data frame from study data. Minimum required variable attributes are:

Usage

prep_study2meta(
  study_data,
  level = c(VARATT_REQUIRE_LEVELS$REQUIRED, VARATT_REQUIRE_LEVELS$RECOMMENDED),
  cumulative = TRUE,
  convert_factors = FALSE,
  guess_missing_codes = getOption("dataquieR.guess_missing_codes",
    dataquieR.guess_missing_codes_default)
)

Arguments

study_data

data.frame the data frame that contains the measurements

level

enum levels to provide (see also VARATT_REQUIRE_LEVELS)

cumulative

logical include attributes of all levels up to level

convert_factors

logical convert factor columns to coded integers. if selected, then also the study data will be updated and returned.

guess_missing_codes

logical try to guess missing codes from the data

Details

dataquieR:::util_get_var_att_names_of_level(VARATT_REQUIRE_LEVELS$REQUIRED)
#>            VAR_NAMES            DATA_TYPE   MISSING_LIST_TABLE 
#>          "VAR_NAMES"          "DATA_TYPE" "MISSING_LIST_TABLE"

The function also tries to detect missing codes.

Value

a meta_data data frame or a list with study data and metadata, if convert_factors == TRUE.

Examples

## Not run: 
dataquieR::prep_study2meta(Orange, convert_factors = FALSE)

## End(Not run)

Classify metrics from a report summary table

Description

Classify metrics from a report summary table

Usage

prep_summary_to_classes(report_summary)

Arguments

report_summary

list() as returned by prep_extract_summary()

Value

data.frame classes for the report summary table, long format

Prepare a label as part of a title text for `RMD` files

Description

Prepare a label as part of a title text for RMD files

Usage

prep_title_escape(s, html = FALSE)

Arguments

s

the label

html

prepare the label for direct HTML output instead of RMD

Value

the escaped label

Remove data disclosing details

Description

new function: no warranty, so far.

Usage

prep_undisclose(x)

Arguments

x

an object to un-disclose, a

Value

undisclosed object

Combine all missing and value lists to one big table

Description

Combine all missing and value lists to one big table

Usage

prep_unsplit_val_tabs(meta_data = "item_level", val_tab = NULL)

Arguments

meta_data

data.frame item level meta data to be used, defaults to "item_level"

val_tab

character name of the table being created: This table will be added to the data frame cache (or overwritten). If NULL, the table will only be returned

Value

data.frame the combined table

Get value labels from data

Description

Detects factors and converts them to compatible metadata/study data.

Usage

prep_valuelabels_from_data(resp_vars = colnames(study_data), study_data)

Arguments

resp_vars

variable names of the variables to fetch the value labels from the data

study_data

data.frame the data frame that contains the measurements

Value

a list with:

VALUE_LABELS: vector of value labels and modified study data
ModifiedStudyData: study data with factors as integers

Examples

## Not run: 
dataquieR::prep_datatype_from_data(iris)

## End(Not run)

Print a `DataSlot` object

Description

Print a DataSlot object

Usage

## S3 method for class 'DataSlot'
print(x, ...)

Arguments

x

the object

...

not used

Value

see print

print implementation for the class `ReportSummaryTable`

Description

Use this function to print results objects of the class ReportSummaryTable.

Usage

## S3 method for class 'ReportSummaryTable'
print(
  x,
  relative = lifecycle::deprecated(),
  dt = FALSE,
  fillContainer = FALSE,
  displayValues = FALSE,
  view = TRUE,
  ...,
  flip_mode = "auto"
)

Arguments

x

ReportSummaryTable objects to print

relative

deprecated

dt

logical use DT::datatables, if installed

fillContainer

logical if dt is TRUE, control table size, see DT::datatables.

displayValues

logical if dt is TRUE, also display the actual values

view

logical if view is FALSE, do not print but return the output, only

...

not used, yet

flip_mode

Value

the printed object

Print a `Slot` object

Description

displays all warnings and stuff. then it prints x.

Usage

## S3 method for class 'Slot'
print(x, ...)

Arguments

x

the object

...

not used

Value

calls the next print method

Print a `StudyDataSlot` object

Description

Print a StudyDataSlot object

Usage

## S3 method for class 'StudyDataSlot'
print(x, ...)

Arguments

x

the object

...

not used

Value

see print

Print a `TableSlot` object

Description

Print a TableSlot object

Usage

## S3 method for class 'TableSlot'
print(x, ...)

Arguments

x

the object

...

not used

Value

see print

Print a dataquieR result returned by dq_report2

Description

Print a dataquieR result returned by dq_report2

Usage

## S3 method for class 'dataquieR_result'
print(x, ...)

Arguments

x

list a dataquieR result from dq_report2 or util_eval_to_dataquieR_result

...

passed to print. Additionally, the argument slot may be passed to print only specific sub-results.

Value

see print

Generate a RMarkdown-based report from a dataquieR report

Description

Generate a RMarkdown-based report from a dataquieR report

Usage

## S3 method for class 'dataquieR_resultset'
print(...)

Arguments

...

deprecated

Value

deprecated

Generate a HTML-based report from a dataquieR report

Description

Generate a HTML-based report from a dataquieR report

Usage

## S3 method for class 'dataquieR_resultset2'
print(
  x,
  dir,
  view = TRUE,
  disable_plotly = FALSE,
  block_load_factor = 4,
  advanced_options = list(),
  dashboard = NA,
  ...
)

Arguments

x

dataquieR report v2.

dir

character directory to store the rendered report's files, a temporary one, if omitted. Directory will be created, if missing, files may be overwritten inside that directory

view

logical display the report

disable_plotly

logical do not use plotly, even if installed

block_load_factor

numeric multiply size of parallel compute blocks by this factor.

advanced_options

list options to set during report computation, see options()

dashboard

logical dashboard mode: TRUE: create a dashboard only, FALSE: don't create a dashboard at all, NA or missing: create a "normal" report with a dashboard included.

...

additional arguments:

Value

file names of the generated report's HTML files

Print a `dataquieR` summary

Description

Print a dataquieR summary

Usage

## S3 method for class 'dataquieR_summary'
print(
  x,
  ...,
  grouped_by = c("call_names", "indicator_metric"),
  dont_print = FALSE,
  folder_of_report = NULL
)

Arguments

x

the dataquieR summary, see summary() and dq_report2()

...

not yet used

grouped_by

define the columns of the resulting matrix. It can be either "call_names", one column per function, or "indicator_metric", one column per indicator or both c("call_names", "indicator_metric"). The last combination is the default

dont_print

suppress the actual printing, just return a printable object derived from x

folder_of_report

a named vector with the location of variable and call_names

Value

invisible html object

print implementation for the class `interval`

Description

such objects, for now, only occur in RECCap rules, so this function is meant for internal use, mostly – for now.

Usage

## S3 method for class 'interval'
print(x, ...)

Arguments

x

interval objects to print

...

not used yet

Value

the printed object

print a list of `dataquieR_result` objects

Description

print a list of dataquieR_result objects

Usage

## S3 method for class 'list'
print(x, ...)

Arguments

x

list() only, if all elements inherit from dataquieR_result, this implementation runs

...

passed to other implementations

Value

undefined

Print a `master_result` object

Description

Print a master_result object

Usage

## S3 method for class 'master_result'
print(x, ...)

Arguments

x

the object

...

not used

Value

invisible(NULL)

Check applicability of DQ functions on study data

Description

Checks applicability of DQ functions based on study data and metadata characteristics

Usage

pro_applicability_matrix(
  study_data,
  item_level = "item_level",
  split_segments = FALSE,
  label_col,
  max_vars_per_plot = 20,
  meta_data_segment,
  meta_data_dataframe,
  flip_mode = "noflip",
  meta_data_v2,
  meta_data = item_level,
  segment_level,
  dataframe_level
)

Arguments

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

split_segments

logical return one matrix per study segment

label_col

variable attribute the name of the column in the metadata with labels of variables

max_vars_per_plot

integer from=0. The maximum number of variables per single plot.

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional: Data frame level metadata

flip_mode

meta_data_v2

meta_data

data.frame old name for item_level

segment_level

data.frame alias for meta_data_segment

dataframe_level

data.frame alias for meta_data_dataframe

Details

For each existing R-implementation, the function searches for necessary static metadata and returns a heatmap like matrix indicating the applicability of each data quality implementation.

In addition, the data type defined in the metadata is compared with the observed data type in the study data.

Value

a list with:

SummaryTable: data frame about the applicability of each indicator function (each function in a column). its integer values can be one of the following four categories: 0. Non-matching datatype + Incomplete metadata, 1. Non-matching datatype + complete metadata, 2. Matching datatype + Incomplete metadata, 3. Matching datatype + complete metadata, 4. Not applicable according to data type
ApplicabilityPlot: ggplot2::ggplot2 heatmap plot, graphical representation of SummaryTable
ApplicabilityPlotList: list of plots per (maybe artificial) segment
ReportSummaryTable: data frame underlying ApplicabilityPlot

Combine `ReportSummaryTable` outputs

Description

Using this rbind implementation, you can combine different heatmap-like results of the class ReportSummaryTable.

Usage

## S3 method for class 'ReportSummaryTable'
rbind(...)

Arguments

...

ReportSummaryTable objects to combine.

Return names of result slots (e.g., 3rd dimension of dataquieR results)

Description

Return names of result slots (e.g., 3rd dimension of dataquieR results)

Usage

resnames(x)

Arguments

x

the objects

Value

character vector with names

Return names of result slots (e.g., 3rd dimension of dataquieR results)

Description

Return names of result slots (e.g., 3rd dimension of dataquieR results)

Usage

## S3 method for class 'dataquieR_resultset2'
resnames(x)

Arguments

x

the objects

Value

character vector with names

Data frame with the study data whose quality is being assessed

Description

Study data is expected in wide format. If should contain all variables for all segments in one large table, even, if some variables are not measured for all observational utils (study participants).

Summarize a dataquieR report

Description

Deprecated

Usage

## S3 method for class 'dataquieR_resultset'
summary(...)

Arguments

...

Deprecated

Value

Deprecated

Generate a report summary table

Description

Generate a report summary table

Usage

## S3 method for class 'dataquieR_resultset2'
summary(
  object,
  aspect = c("applicability", "error", "anamat", "indicator_or_descriptor"),
  FUN,
  collapse = "\n<br />\n",
  ...
)

Arguments

object

a square result set

aspect

an aspect/problem category of results

FUN

function to apply to the cells of the result table

collapse

passed to FUN

...

not used

Value

a summary of a dataquieR report

Examples

## Not run: 
  util_html_table(summary(report),
       filter = "top", options = list(scrollCollapse = TRUE, scrollY = "75vh"),
       is_matrix_table = TRUE, rotate_headers = TRUE, output_format = "HTML"
  )

## End(Not run)

Utility function for 3SD deviations rule

Description

This function calculates outliers according to the rule of 3SD deviations.

Usage

util_3SD(x)

Arguments

x

numeric data to check for outliers

Value

binary vector

Abbreviate snake_case function names to shortened `CamelCase`

Description

Abbreviate snake_case function names to shortened CamelCase

Usage

util_abbreviate(x)

Arguments

x

a vector of indicator function names

Value

abbreviations

Abbreviate a vector of strings

Description

Abbreviate a vector of strings

Usage

util_abbreviate_unique(initial, max_value_label_len)

Arguments

initial

character vector with stuff to abbreviate

max_value_label_len

integer maximum length (may not strictly be met, if not possible keeping a maybe detected uniqueness of initial)

Value

character uniquely abbreviated initial

Utility function for smoothed longitudinal trends from logistic regression models

Description

This function is under development. It computes a logistic regression for binary variables and visualizes smoothed time trends of the residuals by LOESS or GAM. The function can also be called for non-binary outcome variables. These will be transformed to binary variables, either using user-specified groups in the metadata columns RECODE_CASES and/or RECODE_CONTROL (see util_dichotomize), or it will attempt to recode the variables automatically. For nominal variables, it will consider the most frequent category as 'cases' and every other category as 'control', if there are more than two categories. Nominal variables with only two distinct values will be transformed by assigning the less frequent category to 'cases' and the more frequent category to 'control'. For variables of other statistical data types, values inside the interquartile range are considered as 'control', values outside this range as 'cases'. Variables with few different values are transformed in a simplified way to obtain two groups.

Usage

util_acc_loess_bin(
  resp_vars,
  label_col = NULL,
  study_data,
  item_level = "item_level",
  group_vars = NULL,
  time_vars,
  co_vars = NULL,
  min_obs_in_subgroup = 30,
  resolution = 80,
  plot_format = getOption("dataquieR.acc_loess.plot_format",
    dataquieR.acc_loess.plot_format_default),
  meta_data = item_level,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot",
    dataquieR.max_group_var_levels_in_plot_default),
  enable_GAM = getOption("dataquieR.GAM_for_LOESS", dataquieR.GAM_for_LOESS.default),
  exclude_constant_subgroups =
    getOption("dataquieR.acc_loess.exclude_constant_subgroups",
    dataquieR.acc_loess.exclude_constant_subgroups.default),
  min_bandwidth = getOption("dataquieR.acc_loess.min_bw",
    dataquieR.acc_loess.min_bw.default),
  min_proportion = getOption("dataquieR.acc_loess.min_proportion",
    dataquieR.acc_loess.min_proportion.default)
)

Arguments

resp_vars

variable the name of the (binary) measurement variable

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

group_vars

variable the name of the observer, device or reader variable

time_vars

variable the name of the variable giving the time of measurement

co_vars

variable list a vector of co-variables, e.g. age and sex for adjustment

min_obs_in_subgroup

resolution

integer the maximum number of time points used for plotting the trend lines

plot_format

meta_data

data.frame the data frame that contains metadata attributes of study data

n_group_max

integer maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

enable_GAM

logical Can LOESS computations be replaced by general additive models to reduce memory consumption for large datasets?

exclude_constant_subgroups

logical Should subgroups with constant values be excluded?

min_bandwidth

numeric lower limit for the LOESS bandwidth, should be greater than 0 and less than or equal to 1. In general, increasing the bandwidth leads to a smoother trend line.

min_proportion

numeric lower limit for the proportion of the smaller group (cases or controls) for creating a LOESS figure, should be greater than 0 and less than 0.4.

Details

Value

a list with:

SummaryPlotList: a plot.

Utility function for smoothes and plots adjusted longitudinal measurements

Description

Usage

util_acc_loess_continuous(
  resp_vars,
  label_col = NULL,
  study_data,
  item_level = "item_level",
  group_vars = NULL,
  time_vars,
  co_vars = NULL,
  min_obs_in_subgroup = 30,
  resolution = 80,
  comparison_lines = list(type = c("mean/sd", "quartiles"), color = "grey30", linetype =
    2, sd_factor = 0.5),
  mark_time_points = getOption("dataquieR.acc_loess.mark_time_points",
    dataquieR.acc_loess.mark_time_points_default),
  plot_observations = getOption("dataquieR.acc_loess.plot_observations",
    dataquieR.acc_loess.plot_observations_default),
  plot_format = getOption("dataquieR.acc_loess.plot_format",
    dataquieR.acc_loess.plot_format_default),
  meta_data = item_level,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot",
    dataquieR.max_group_var_levels_in_plot_default),
  enable_GAM = getOption("dataquieR.GAM_for_LOESS", dataquieR.GAM_for_LOESS.default),
  exclude_constant_subgroups =
    getOption("dataquieR.acc_loess.exclude_constant_subgroups",
    dataquieR.acc_loess.exclude_constant_subgroups.default),
  min_bandwidth = getOption("dataquieR.acc_loess.min_bw",
    dataquieR.acc_loess.min_bw.default)
)

Arguments

resp_vars

variable the name of the continuous (or binary) measurement variable

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

group_vars

variable the name of the observer, device or reader variable

time_vars

variable the name of the variable giving the time of measurement

co_vars

variable list a vector of co-variables for adjustment, for example age and sex. Can be NULL (default) for no adjustment.

min_obs_in_subgroup

resolution

integer the maximum number of time points used for plotting the trend lines

comparison_lines

mark_time_points

logical mark time points with observations (caution, there may be many marks)

plot_observations

logical show observations as scatter plot in the background. If there are co_vars specified, the values of the observations in the plot will also be adjusted for the specified covariables.

plot_format

meta_data

data.frame the data frame that contains metadata attributes of study data

n_group_max

integer maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

enable_GAM

logical Can LOESS computations be replaced by general additive models to reduce memory consumption for large datasets?

exclude_constant_subgroups

logical Should subgroups with constant values be excluded?

min_bandwidth

numeric lower limit for the LOESS bandwidth, should be greater than 0 and less than or equal to 1. In general, increasing the bandwidth leads to a smoother trend line.

Details

If mark_time_points or plot_observations is selected, but would result in plotting more than 400 points, only a sample of the data will be displayed.

Limitations

The application of LOESS requires model fitting, i.e. the smoothness of a model is subject to a smoothing parameter (span). Particularly in the presence of interval-based missing data, high variability of measurements combined with a low number of observations in one level of the group_vars may distort the fit. Since our approach handles data without knowledge of such underlying characteristics, finding the best fit is complicated if computational costs should be minimal. The default of LOESS in R uses a span of 0.75, which provides in most cases reasonable fits. The function util_acc_loess_continuous adapts the span for each level of the group_vars (with at least as many observations as specified in min_obs_in_subgroup and with at least three time points) based on the respective number of observations. LOESS consumes a lot of memory for larger datasets. That is why util_acc_loess_continuous switches to a generalized additive model with integrated smoothness estimation (gam by mgcv) if there are 1000 observations or more for at least one level of the group_vars (similar to geom_smooth from ggplot2).

Value

a list with:

SummaryPlotList: list with two plots if plot_format = "BOTH", otherwise one of the two figures described below:
- Loess_fits_facets: The plot contains LOESS-smoothed curves for each level of the group_vars in a separate panel. Added trend lines represent mean and standard deviation or quartiles (specified in comparison_lines) for moving windows over the whole data.
- Loess_fits_combined: This plot combines all curves into one panel. Given a low number of levels in the group_vars, this plot eases comparisons. However, if the number increases this plot may be too crowded and unclear.

Estimates variance components

Description

Variance based models and intraclass correlations (ICC) are approaches to examine the impact of so-called process variables on the measurements. This implementation is model-based.

NB: The term ICC is frequently used to describe the agreement between different observers, examiners or even devices. In respective settings a good agreement is pursued. ICC-values can vary between ⁠[-1;1]⁠ and an ICC close to 1 is desired (Koo and Li 2016, Müller and Büttner 1994).

However, in multi-level analysis the ICC is interpreted differently. Please see Snijders et al. (Sniders and Bosker 1999). In this context the proportion of variance explained by respective group levels indicate an influence of (at least one) level of the respective group_vars. An ICC close to 0 is desired.

Usage

util_acc_varcomp(
  resp_vars = NULL,
  label_col = NULL,
  study_data,
  item_level = "item_level",
  group_vars,
  co_vars = NULL,
  min_obs_in_subgroup = 30,
  min_subgroups = 5,
  meta_data = item_level,
  meta_data_v2
)

Arguments

resp_vars

variable list the names of the continuous measurement variables

label_col

variable attribute the name of the column in the metadata with labels of variables

study_data

data.frame the data frame that contains the measurements

item_level

data.frame the data frame that contains metadata attributes of study data

group_vars

variable list the names of the resp. observer, device or reader variables

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

min_obs_in_subgroup

integer from=0. optional argument if a "group_var" is used. This argument specifies the minimum no. of observations that is required to include a subgroup (level) of the "group_var" in the analysis. Subgroups with fewer observations are excluded. The default is 30.

min_subgroups

integer from=0. optional argument if a "group_var" is used. This argument specifies the minimum no. of subgroups (levels) included "group_var". If the variable defined in "group_var" has fewer subgroups it is not used for analysis. The default is 5.

meta_data

data.frame old name for item_level

meta_data_v2

Value

a list with:

SummaryTable: data frame with ICCs per rvs
SummaryData: data frame with ICCs per rvs
ScalarValue_max_icc: maximum variance contribution value by group_vars
ScalarValue_argmax_icc: variable with maximum variance contribution by group_vars

ALGORITHM OF THIS IMPLEMENTATION:

This implementation is yet restricted to data of type float.
Missing codes are removed from resp_vars (if defined in the metadata)
Deviations from limits, as defined in the metadata, are removed
A linear mixed-effects model is estimated for resp_vars using co_vars and group_vars for adjustment.
An output data frame is generated for group_vars indicating the ICC.

Adjust the data types of study data, if needed

Description

Adjust the data types of study data, if needed

Usage

util_adjust_data_type(study_data, meta_data, relevant_vars_for_warnings)

Arguments

study_data

data.frame the study data

meta_data

meta_data VAR_NAMES relevant for warnings about conversion errors

relevant_vars_for_warnings

character

Value

data.frame modified study data

Place all geom_texts also in `plotly` right from the x position

Description

Place all geom_texts also in plotly right from the x position

Usage

util_adjust_geom_text_for_plotly(plotly)

Arguments

plotly

the plotly

Value

modified plotly-built object

Create a caption from an alias name of a `dq_report2` result

Description

Create a caption from an alias name of a dq_report2 result

Usage

util_alias2caption(alias, long = FALSE)

Arguments

alias

alias name

long

return result based on menu_title_report, matrix_column_title_report otherwise

Value

caption

All indicator functions of `dataquieR`

Description

All indicator functions of dataquieR

Usage

util_all_ind_functions()

Value

character names of all indicator functions

Get all `PART_VARS` for a response variable (from item-level metadata)

Description

Get all PART_VARS for a response variable (from item-level metadata)

Usage

util_all_intro_vars_for_rv(
  rv,
  study_data,
  meta_data,
  label_col = LABEL,
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT")
)

Arguments

rv

character the response variable's name

study_data

meta_data

label_col

character the metadata attribute to map meta_data on study_data based on colnames(study_data)

expected_observations

enum HIERARCHY | ALL | SEGMENT. How should PART_VARS be handled: - ALL: Ignore, all observations are expected - SEGMENT: if PART_VAR is 1, an observation is expected - HIERARCHY: the default, if the PART_VAR is 1 for this variable and also for all PART_VARS of PART_VARS up in the hierarchy, an observation is expected.

Value

character all PART_VARS for rv from item level metadata. For expected_observations = HIERARCHY, the more general PART_VARS (i.e., up, in the hierarchy) are more left in the vector, e.g.: ⁠PART_STUDY, PART_PHYSICAL_EXAMINATIONS, PART_BLOODPRESSURE⁠

convenience function to abbreviate all(util_is_integer(...))

Description

convenience function to abbreviate all(util_is_integer(...))

Usage

util_all_is_integer(x)

Arguments

x

the object to test

Value

TRUE, if all entries are integer-like, FALSE otherwise

Test, if package `anytime` is installed

Description

Test, if package anytime is installed

Usage

util_anytime_installed()

Value

TRUE if anytime is installed.

utility function for the applicability of contradiction checks

Description

Test for applicability of contradiction checks

Usage

util_app_cd(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable

utility function for the applicability of contradiction checks

Description

Test for applicability of contradiction checks

Usage

util_app_con_contradictions_redcap(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable

utility function for the applicability of of distribution plots

Description

Test for applicability of distribution plots

Usage

util_app_dc(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function to test for applicability of detection limits checks

Description

Test for applicability of detection limits checks

Usage

util_app_dl(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function for the applicability of of end digits preferences checks

Description

Test for applicability of end digits preferences checks

Usage

util_app_ed(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function to test for applicability of hard limits checks

Description

Test for applicability of hard limits checks

Usage

util_app_hl(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function for the applicability of categorical admissibility

Description

Test for applicability of categorical admissibility

Usage

util_app_iac(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function for the applicability of numeric admissibility

Description

Test for applicability of numeric admissibility

Usage

util_app_iav(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function applicability of item missingness

Description

Test for applicability of item missingness

Usage

util_app_im(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable

utility function for applicability of LOESS smoothed time course plots

Description

Test for applicability of LOESS smoothed time course plots

Usage

util_app_loess(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function to test for applicability of marginal means plots

Description

Test for applicability of detection limits checks

Usage

util_app_mar(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1 = matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function applicability of multivariate outlier detection

Description

Test for applicability of multivariate outlier detection

Usage

util_app_mol(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function for the applicability of outlier detection

Description

Test for applicability of univariate outlier detection

Usage

util_app_ol(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function to test for applicability of soft limits checks

Description

Test for applicability of soft limits checks

Usage

util_app_sl(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility function applicability of segment missingness

Description

Test for applicability of segment missingness

Usage

util_app_sm(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable

utility function applicability of distribution function's shape or scale check

Description

Test for applicability of checks for deviation form expected probability distribution shapes/scales

Usage

util_app_sos(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

utility applicability variance components

Description

Test for applicability of ICC

Usage

util_app_vc(x, dta)

Arguments

x

data.frame metadata

dta

logical vector, 1=matching data type, 0 = non-matching data type

Value

factor 0-3 for each variable in metadata

0 data type mismatch and not applicable
1 data type mismatches but applicable
2 data type matches but not applicable
3 data type matches and applicable
4 not applicable because of not suitable data type

Convert a category to an ordered factor (`1:5`)

Description

Convert a category to an ordered factor (1:5)

Usage

util_as_cat(category)

Arguments

category

vector with categories

Value

an ordered factor

Convert a category to a number (`1:5`)

Description

Convert a category to a number (1:5)

Usage

util_as_integer_cat(category)

Arguments

category

vector with categories

Value

an integer

Convert factors to label-corresponding numeric values

Description

Converts a vector factor aware of numeric values not being scrambled.

Usage

util_as_numeric(v, warn)

Arguments

v

the vector

warn

if not missing: character with error message stating conversion error

Value

the converted vector

Return the pre-computed `plotly` from a `dataquieR` result

Description

Return the pre-computed plotly from a dataquieR result

Usage

util_as_plotly_from_res(res, ...)

Arguments

res

the dataquieR result

...

not used

Value

a plotly object

Convert `x` to valid missing codes

Description

Convert x to valid missing codes

Usage

util_as_valid_missing_codes(x)

Arguments

x

character a vector of values

Value

converted x

utility function to assign labels to levels

Description

function to assign labels to levels of a variable

Usage

util_assign_levlabs(
  variable,
  string_of_levlabs,
  splitchar,
  assignchar,
  ordered = TRUE,
  variable_name = "",
  warn_if_inadmissible = TRUE
)

Arguments

variable

vector vector with values of a study variable

string_of_levlabs

character len=1. value labels, e.g. 1 = no | 2 = yes

splitchar

character len=1. splitting character(s) in string_of_levlabs, usually SPLIT_CHAR

assignchar

character len=1. assignment operator character(s) in string_of_levlabs, usually = or ⁠: ⁠

ordered

the function converts variable to a factor, by default to an ordered factor assuming LHS of assignments being meaningful numbers, e.g. 1 = low | 2 = medium | 3 = high. If no special order is given, set ordered to FALSE, e.g. for 1 = male | 2 = female or 1 = low | 2 = high | 3 = medium.

variable_name

character the name of the variable being converted for warning messages

warn_if_inadmissible

logical warn on con_inadmissible_categorical values

Details

DEPRECATED from v2.5.0

Value

a factor with labels assigned to categorical variables (if available)

Attach attributes to an object and return it

Description

Attach attributes to an object and return it

Usage

util_attach_attr(x, ...)

Arguments

x

the object

...

named arguments, each becomes an attributes

Value

x, having the desired attributes attached

Put in back-ticks

Description

also escape potential back-ticks in x

Usage

util_bQuote(x)

Arguments

x

a string

Value

x in back-ticks

utility function to set string in backticks

Description

Quote a set of variable names with backticks

Usage

util_backtickQuote(x)

Arguments

x

variable names

Value

quoted variable names

Utility function to create bar plots

Description

A helper function for simple bar plots. The layout is intended for data with positive numbers only (e.g., counts/frequencies).

Usage

util_bar_plot(
  plot_data,
  cat_var,
  num_var,
  relative = FALSE,
  show_numbers = TRUE,
  fill_var = NULL,
  colors = "#2166AC",
  show_color_legend = FALSE,
  flip = FALSE
)

Arguments

plot_data

the data for the plot. It should consist of one column specifying the categories, and a second column giving the respective numbers / counts per category. It may contain another column to specify the coloring of the bars (fill_var).

cat_var

column name of the categorical variable in plot_data

num_var

column name of the numerical variable in plot_data

relative

if TRUE, numbers will be interpreted as percentages (values in num_var should lie within ⁠[0,1]⁠)

show_numbers

if TRUE, numbers will be displayed on top of the bars

fill_var

column name of the variable in plot_data which will be used to color the bars individually

colors

vector of colors, or a single color

show_color_legend

if TRUE, a legend for the colors will be displayed

flip

if TRUE, bars will be oriented horizontally

Value

a bar plot

Data frame leaves `haven`

Description

if df is/contains a haven labelled or tibble object, convert it to a base R data frame

Usage

util_cast_off(df, symb, .dont_cast_off_cols = FALSE)

Arguments

df

data.frame may have or contain non-standard classes

symb

character name of the data frame for error messages

.dont_cast_off_cols

logical internal use, only.

Value

data.frame having all known special things removed

Verify the data type of a value

Description

Function to verify the data type of a value.

Usage

util_check_data_type(
  x,
  type,
  check_convertible = FALSE,
  threshold_value = 0,
  return_percentages = FALSE,
  check_conversion_stable = FALSE,
  robust_na = FALSE
)

Arguments

x

the value

type

expected data type

check_convertible

logical also try, if a conversion to the declared data type would work.

threshold_value

numeric from=0 to=100. percentage of failing conversions allowed.

return_percentages

logical return the percentage of mismatches.

check_conversion_stable

logical do not distinguish convertible from convertible, but with issues

robust_na

logical treat white-space-only-values as NA

Value

if return_percentages: if not check_convertible, the percentage of mismatches instead of logical value, if check_convertible, return a named vector with the percentages of all cases (names of the vector are match, convertible_mismatch_stable, convertible_mismatch_unstable, nonconvertible_mismatch) if not return_percentages: if check_convertible is FALSE, logical whether x is of the expected type if check_convertible is TRUE integer with the states ⁠0, 1, 2, 3⁠: 0 = Mismatch, not convertible 1 = Match 2 = Mismatch, but convertible 3 = Mismatch, convertible, but with issues (e.g., loss of decimal places)

Check data for observer levels

Description

Check data for observer levels

Usage

util_check_group_levels(
  study_data,
  group_vars,
  min_obs_in_subgroup = -Inf,
  max_obs_in_subgroup = +Inf,
  min_subgroups = -Inf,
  max_subgroups = +Inf
)

Arguments

study_data

data.frame the data frame that contains the measurements

group_vars

variable the name of the observer, device or reader variable

min_obs_in_subgroup

integer from=0. optional argument if group_vars are used. This argument specifies the minimum number of observations that is required to include a subgroup (level) of the group variable named by group_vars in the analysis. Subgroups with fewer observations are excluded.

max_obs_in_subgroup

integer from=0. optional argument if group_vars are used. This argument specifies the maximum number of observations that is required to include a subgroup (level) of the group variable named by group_vars in the analysis. Subgroups with more observations are excluded.

min_subgroups

max_subgroups

integer from=0. optional argument if a "group_var" is used. This argument specifies the maximum no. of subgroups (levels) included "group_var". If the variable defined in "group_var" has more subgroups it is split for analysis.

Value

modified study data frame

Examples

## Not run: 
study_data <- prep_get_data_frame("study_data")
meta_data <- prep_get_data_frame("meta_data")
prep_prepare_dataframes(.label_col = LABEL)
util_check_group_levels(ds1, "CENTER_0")
dim(util_check_group_levels(ds1, "USR_BP_0", min_obs_in_subgroup = 400))

## End(Not run)

Check for one value only

Description

utility function to identify variables with one value only.

Usage

util_check_one_unique_value(x)

Arguments

x

vector with values

Value

logical(1): TRUE, if – except NA – exactly only one value is observed in x, FALSE otherwise

Get Function called for a Call Name

Description

get aliases from report attributes and then replace them by the actual function name

Usage

util_cll_nm2fkt_nm(cll_names, report)

Arguments

cll_names

character then systematic function call name to fetch its function name

report

dataquieR_resultset2 the report

Value

character the function name

Return hex code colors from color names or `STATAReporter` syntax

Description

Return hex code colors from color names or STATAReporter syntax

Usage

util_col2rgb(colors)

Arguments

colors

the colors, e.g.,"255 0 0" or "red" or "#ff0000"

Value

character vector with colors using HTML hexadecimal encoding, e..g, "#ff0000" for "red"

Get description for a call

Description

Get description for a call

Usage

util_col_description(cn)

Arguments

cn

the call name

Value

the description

Collect all errors, warnings, or messages so that they are combined for a combined result

Description

Collect all errors, warnings, or messages so that they are combined for a combined result

Usage

util_collapse_msgs(class, all_of_f)

Create a data frame containing all the results from summaries of reports

Description

Create a data frame containing all the results from summaries of reports

Usage

util_combine_list_report_summaries(
  to_combine,
  type = c("unique_vars", "repeated_vars")
)

Arguments

to_combine

vector a list containing the summaries of reports obtained with summary(report)

type

character if type is unique_vars it means that the variable names are unique and there is not need to add a prefix to the variables and labels to specify the report of origin. If type is repeated_vars a prefix will be used to specify the report of origin of each variable

Value

a summary of summaries of dataquieR reports

Combine results for Single Variables

Description

to, e.g., a data frame with one row per variable or a similar heat-map, see print.ReportSummaryTable().

Usage

util_combine_res(all_of_f)

Arguments

all_of_f

all results of a function

Value

row-bound combined results

Combine two value lists

Description

Combine two value lists

Usage

util_combine_value_label_tables(vlt1, vlt2)

Arguments

vlt1

value_label_table

vlt2

value_label_table

Value

value_label_table

Examples

## Not run: 
util_combine_value_label_tables(
  tibble::tribble(~ CODE_VALUE, ~ CODE_LABEL, 17L, "Test", 19L, "Test", 17L, "TestX"),
  tibble::tribble(~ CODE_VALUE, ~ CODE_LABEL, 17L, "Test", 19L, "Test", 17L, "TestX"))

## End(Not run)

Compares study data data types with the ones expected according to the metadata

Description

Utility function to compare data type of study data with those defined in metadata

Usage

util_compare_meta_with_study(
  sdf,
  mdf,
  label_col,
  check_convertible = FALSE,
  threshold_value = 0,
  return_percentages = FALSE,
  check_conversion_stable = FALSE
)

Arguments

sdf

the data.frame of study data

mdf

the data.frame of associated static metadata

label_col

variable attribute the name of the column in the metadata with labels of variables

check_convertible

logical also try, if a conversion to the declared data type would work.

threshold_value

numeric from=0 to=100. percentage failing conversions allowed if check_convertible is TRUE.

return_percentages

logical return the percentage of mismatches.

check_conversion_stable

logical do not distinguish convertible from convertible, but with issues

Value

for return_percentages == FALSE: if check_convertible is FALSE, a binary vector ⁠(0, 1)⁠ if data type applies, if check_convertible is ⁠TRUE`` a vector with the states ⁠0, 1, 2, 3⁠: 0 = Mismatch, not convertible 1 = Match 2 = Mismatch, but convertible 3 = Mismatch, convertible, but with issues (e.g., loss of decimal places) for ⁠return_percentages == TRUE': a data frame with percentages of non-matching datatypes according, each column is a variable, the rows follow the vectors returned by util_check_data_type.

Remove specific classes from a ggplot `plot_env` environment

Description

Useful to remove large objects before writing to disk with qs or rds. Also deletes parent environment of the plot environment. Also deletes unneeded variables

Usage

util_compress_ggplots_in_res(r)

Arguments

r

the object

Compute SE.Skewness

Description

Compute SE.Skewness

Usage

util_compute_SE_skewness(x, skewness = util_compute_skewness(x))

Arguments

x

data

skewness

if already known

Value

the standard error of skewness

Compute Kurtosis

Description

Compute Kurtosis

Usage

util_compute_kurtosis(x)

Arguments

x

data

Value

the Kurtosis

Compute the Skewness

Description

Compute the Skewness

Usage

util_compute_skewness(x)

Arguments

x

data

Value

the Skewness

Produce a condition function

Description

Produce a condition function

Usage

util_condition_constructor_factory(
  .condition_type = c("error", "warning", "message")
)

Arguments

.condition_type

character the type of the conditions being created and signaled by the function, "error", "warning", or "message"

Extract condition from try error

Description

Extract condition from try error

Usage

util_condition_from_try_error(x)

Arguments

x

the try-error object

Value

condition of the try-error

Can a vector be converted to a defined `DATA_TYPE`

Description

the function also checks, if the conversion is perfect, or if something is lost (e.g., decimal places), or something is strange (like arbitrary suffixes in a date, just note, that as.POSIXct("2020-01-01 12:00:00 CET asdf") does not fail in R), but util_conversion_stable("2020-01-01 12:00:00 CET asdf", DATA_TYPES$DATETIME) will.

Usage

util_conversion_stable(vector, data_type, return_percentages = FALSE)

Arguments

vector

vector input vector,

data_type

enum The type, to what the conversion should be tried.

return_percentages

logical return the percentage of stable conversions or matches.

Details

HINT: util_conversion_stable(.Machine$integer.max + 1, DATA_TYPES$INTEGER) seems to work correctly, although is.integer(.Machine$integer.max + 1) returns FALSE.

Value

numeric ratio of convertible entries in vector

return a flip term for `ggplot2` plots, if desired.

Description

return a flip term for ggplot2 plots, if desired.

Usage

util_coord_flip(w, h, p, ref_env, ...)

Arguments

w

width of the plot to determine its aspect ratio

h

height of the plot to determine its aspect ratio

p

the ggplot2 object, so far. If w or h are missing, p is used for an estimate on w and h, if both axes are discrete.

ref_env

environment of the actual entry function, so that the correct formals can be detected.

...

additional arguments for coord_flip or coord_cartesian

Value

coord_flip or coord_cartesian

Copy default dependencies to the report's lib directory

Description

Copy default dependencies to the report's lib directory

Usage

util_copy_all_deps(dir, pages, ...)

Arguments

dir

report directory

pages

all pages to write

...

additional htmltools::htmlDependency objects to be added to all pages, also

Value

invisible(NULL)

Check referred variables

Description

This function operates in the environment of its caller (using eval.parent, similar to Function like C-Preprocessor-Macros ). Different from the other utility function that work in the caller's environment (prep_prepare_dataframes), It has no side effects except that the argument of the calling function specified in arg_name is normalized (set to its default or a general default if missing, variable names being all white space replaced by NAs). It expects two objects in the caller's environment: ds1 and meta_data. meta_data is the metadata data frame and ds1 is produced by a preceding call of prep_prepare_dataframes using meta_data and study_data. So this function can only be used after calling the function prep_prepare_dataframes

Usage

util_correct_variable_use(
  arg_name,
  allow_na,
  allow_more_than_one,
  allow_null,
  allow_all_obs_na,
  allow_any_obs_na,
  min_distinct_values,
  need_type,
  need_scale,
  role = "",
  overwrite = TRUE,
  do_not_stop = FALSE,
  remove_not_found = TRUE
)

util_correct_variable_use2(
  arg_name,
  allow_na,
  allow_more_than_one,
  allow_null,
  allow_all_obs_na,
  allow_any_obs_na,
  min_distinct_values,
  need_type,
  need_scale,
  role = arg_name,
  overwrite = TRUE,
  do_not_stop = FALSE,
  remove_not_found = TRUE
)

Arguments

arg_name

character Name of a function argument of the caller of util_correct_variable_use

allow_na

logical default = FALSE. allow NAs in the variable names argument given in arg_name

allow_more_than_one

logical default = FALSE. allow more than one variable names in arg_name

allow_null

logical default = FALSE. allow an empty variable name vector in the argument arg_name

allow_all_obs_na

logical default = TRUE. check observations for not being all NA

allow_any_obs_na

logical default = TRUE. check observations for being complete without any NA

min_distinct_values

integer Minimum number of distinct observed values of a study variable

need_type

character if not NA, variables must be of data type need_type according to the metadata, can be a pipe (|) separated list of allowed data types. Use ! to exclude a type. See DATA_TYPES for the predefined variable types of the dataquieR concept.

need_scale

character if not NA, variables must be of scale level need_scale according to the metadata, can be a pipe (|) separated list of allowed scale levels. Use ! to exclude a level. See SCALE_LEVELS for the predefined scale levels of the dataquieR concept.

role

character variable-argument role. Set different defaults for all allow-arguments and need_type of this util_correct_variable_use.. If given, it defines the intended use of the verified argument. For typical arguments and typical use cases, roles are predefined in .variable_arg_roles. The role's defaults can be overwritten by the arguments. If role is "" (default), the standards are allow_na = FALSE, allow_more_than_one = FALSE, allow_null = FALSE, allow_all_obs_na = TRUE, allow_any_obs_na = TRUE, and need_type = NA. Use util_correct_variable_use2 for using the arg_name as default for role. See .variable_arg_roles for currently available variable-argument roles.

overwrite

logical overwrite vector of variable names to match the labels given in label_col.

do_not_stop

logical do not throw an error, if one of the variables violates allow_all_obs_na, allow_any_obs_na or min_distinct_values. Instead, the variable will be removed from arg_name in the parent environment with a warning. This is helpful for functions which work with multiple variables.

remove_not_found

TODO: Not yet implemented

Details

util_correct_variable_use and util_correct_variable_use2 differ only in the default of the argument role.

util_correct_variable_use and util_correct_variable_use2 put strong effort on producing compressible error messages to the caller's caller (who is typically an end user of a dataquieR function).

The function ensures, that a specified argument of its caller that refers variable names (one or more as character vector) matches some expectations.

This function accesses the caller's environment!

Count Expected Observations

Description

Count participants, if an observation was expected, given the PART_VARS from item-level metadata

Usage

util_count_expected_observations(
  resp_vars,
  study_data,
  meta_data,
  label_col = LABEL,
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT")
)

Arguments

resp_vars

character the response variables, for that a value may be expected

study_data

meta_data

label_col

character mapping attribute colnames(study_data) vs. meta_data[label_col]

expected_observations

Value

a vector with the number of expected observations for each resp_vars.

Create an HTML file for the dq_report2

Description

Create an HTML file for the dq_report2

Usage

util_create_page_file(
  page_nr,
  pages,
  rendered_pages,
  dir,
  template_file,
  report,
  logo,
  loading,
  packageName,
  deps,
  progress_msg,
  progress,
  title,
  by_report
)

Arguments

page_nr

the number of the page being created

pages

list with all page-contents named by their desired file names

rendered_pages

list with all rendered (htmltools::renderTags) page-contents named by their desired file names

dir

target directory

template_file

the report template file to use

report

the output of dq_report2

logo

logo PNG file

loading

loading animation div

packageName

the name of the current package

deps

dependencies, as pre-processed by htmltools::copyDependencyToDir and htmltools::renderDependencies

progress_msg

closure to call with progress information

progress

closure to call with progress information

title

character the web browser's window name

by_report

logical this report html is part of a set of reports, add a back-link

Value

invisible(file_name)

Create an overview of the reports created with `dq_report_by`

Description

Create an overview of the reports created with dq_report_by

Usage

util_create_report_by_overview(
  output_dir,
  strata_column,
  segment_column,
  strata_column_label,
  subgroup,
  mod_label
)

Arguments

output_dir

character the directory in which all reports are searched and the overview is saved

strata_column

character name of a study variable to stratify the report by. It can be null

segment_column

character name of a metadata attribute usable to split the report in sections of variables. It can be null

strata_column_label

character the label of the variable used as strata_column

subgroup

character optional, to define subgroups of cases

mod_label

list util_ensure_label() info

Value

an overview of all dataquieR reports created with dq_report_by

Create a dashboard-table from a report summary

Description

Create a dashboard-table from a report summary

Usage

util_dashboard_table(repsum)

Arguments

repsum

a report summary from summary(report)

Data type conversion

Description

Utility function to convert a study variable to match the data type given in the metadata, if possible.

Usage

util_data_type_conversion(x, type)

Arguments

x

the value

type

expected data type

Value

the transformed values (if possible)

Expression De-Parsing

Description

Turn unevaluated expressions into character strings.

Arguments

expr

any R expression.

collapse

a string, passed to paste()

width.cutoff

integer in [20, 500] determining the cutoff (in bytes) at which line-breaking is tried.

...

further arguments passed to deparse().

Details

This is a simple utility function for R < 4.0.0 to ensure a string result (character vector of length one), typically used in name construction, as util_deparse1(substitute(.)).

This avoids a dependency on backports and on R >= 4.0.0.

Value

the deparsed expression

Detect cores

Description

See parallel::detectCores for further details.

Usage

util_detect_cores()

Value

number of available CPU cores.

Escape characters for HTML in a data frame

Description

Escape characters for HTML in a data frame

Usage

util_df_escape(x)

Arguments

x

data.frame to be escaped

Value

data.frame with html escaped content

Utility function to dichotomize variables

Description

This function uses the metadata attributes RECODE_CASES and/or RECODE_CONTROL to dichotomize the data. 'Cases' will be recoded to 1, 'controls' to 0. The recoding can be specified by an interval (for metric variables) or by a list of categories separated by the 'SPLIT_CHAR'. Recoding will be used for data quality checks that include a regression model.

Usage

util_dichotomize(study_data, meta_data, label_col = VAR_NAMES)

Arguments

study_data

study data without jump/missing codes as specified in the code conventions

meta_data

metadata as specified in the code conventions

label_col

variable attribute the name of the column in the metadata with labels of variables

Utility function to characterize study variables

Description

This function summarizes some properties of measurement variables.

Usage

util_dist_selection(study_data, val_lab = lifecycle::deprecated())

Arguments

study_data

study data, pre-processed with prep_prepare_dataframes to replace missing value codes by NA

val_lab

deprecated

Value

data frame with one row for each variable in the study data and the following columns: Variables contains the names of the variables IsInteger contains a check whether the variable contains integer values only (variables coded as factor will be converted to integers) IsMultCat contains a check for variables with integer or string values whether there are more than two categories NCategory contains the number of distinct values for variables with values coded as integers or strings (excluding NA and empty entries) AnyNegative contains a check whether the variable contains any negative values NDistinct contains the number of distinct values PropZeroes reports the proportion of zeroes

Create an environment with several alias names for the study data variables

Description

generates an environment similar to as.environment(ds1), but makes variables available by their VAR_NAME, LABEL, and label_col - names.

Usage

util_ds1_eval_env(study_data, meta_data = "item_level", label_col = LABEL)

Arguments

study_data

data.frame the data frame that contains the measurements

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables. If study_data has already been mapped, i.e., util_ds1_eval_env(ds1, ...) is called, this will work too

Test, if values of x are empty, i.e. NA or whitespace characters

Description

Test, if values of x are empty, i.e. NA or whitespace characters

Usage

util_empty(x)

Arguments

x

the vector to test

Value

a logical vector, same length as x; TRUE, if resp. element in x is "empty"

convert a value to character

Description

convert a value to character

Usage

util_ensure_character(x, error = FALSE, error_msg, ...)

Arguments

x

the value

error

logical if TRUE, an error is thrown, a warning otherwise in case of a conversion error

error_msg

error message to be displayed, if conversion was not possible

...

additional arguments passed to util_error or util_warning respectively in case of an error, and if an error_msg has been passed

Value

as.character(x)

similar to match.arg

Description

will only warn and return a cleaned x.

Usage

util_ensure_in(x, set, err_msg, error = FALSE, applicability_problem = NA)

Arguments

x

character vector of needles

set

character vector representing the haystack

err_msg

character optional error message. Use %s twice, once for the missing elements and once for proposals

error

logical if TRUE, the execution will stop with an error, if not all x are elements of set, otherwise, it will throw a warning and "clean" the vector x from unexpected elements.

applicability_problem

logical error indicates unsuitable resp_vars

Value

character invisible(intersect(x, set))

Utility function ensuring valid labels and variable names

Description

Valid labels should not be empty, be unique and do not exceed a certain length.

Usage

util_ensure_label(meta_data, label_col, max_label_len = MAX_LABEL_LEN)

Arguments

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

max_label_len

integer maximum length for the labels, defaults to 30.

Value

a list containing the study data, possibly with adapted column names, the metadata, possibly with adapted labels, and a string and a table informing about the changes

Support function to stop, if an optional package is not installed

Description

This function stops, if a package is not installed but needed for using an optional feature of dataquieR.

Usage

util_ensure_suggested(
  pkg,
  goal = ifelse(is.null(rlang::caller_call()), "work", paste("call",
    sQuote(rlang::call_name(rlang::caller_call())))),
  err = TRUE,
  and_import = c()
)

Arguments

pkg

needed package

goal

feature description for error message.

err

logical Should the function throw an error (default) or a warning?

and_import

import the listed function to the caller's environment

Value

TRUE if all packages in pkg are available, FALSE if at least one of the packages is missing.

Examples

## Not run:  # internal use, only
f <- function() {
  util_ensure_suggested <- get("util_ensure_suggested",
    asNamespace("dataquieR"))
  util_ensure_suggested("ggplot2", "Test",
      and_import = "(ggplot|geom_.*|aes)")
  print(ggplot(cars, aes(x = speed)) + geom_histogram())
}
f()

## End(Not run)

Produce an error message with a useful short stack trace. Then it stops the execution.

Description

Produce an error message with a useful short stack trace. Then it stops the execution.

Usage

util_error(
  m,
  ...,
  applicability_problem = NA,
  intrinsic_applicability_problem = NA,
  integrity_indicator = "none",
  level = 0,
  immediate,
  title = "",
  additional_classes = c()
)

Arguments

m

error message or a condition

...

arguments for sprintf on m, if m is a character

applicability_problem

logical TRUE, if an applicability issue, that is, the information for computation is missing (that is, an error that indicates missing metadata) or an error because the requirements of the stopped function were not met, e.g., a barplot was called for metric data. We can have logical or empirical applicability problems. empirical is the default, if the argument intrinsic_applicability_problem is left unset or set to FALSE.

intrinsic_applicability_problem

logical TRUE, if this is a logical applicability issue, that is, the computation makes no sense (for example, an error of unsuitable resp_vars). Intrinsic/logical applicability problems are also applicability problems. Non-logical applicability problems are called empirical applicability problems.

integrity_indicator

character the message is an integrity problem, here is the indicator abbreviation..

level

integer level of the error message (defaults to 0). Higher levels are more severe.

immediate

logical not used.

additional_classes

character additional classes the thrown condition object should inherit from, first.

Value

nothing, its purpose is to stop.

Evaluate a parsed redcap rule for given study data

Description

also allows to use VAR_NAMES in the rules, if other labels have been selected

Usage

util_eval_rule(
  rule,
  ds1,
  meta_data = "item_level",
  use_value_labels,
  replace_missing_by = "NA",
  replace_limits = TRUE
)

Arguments

rule

the redcap rule (parsed, already)

ds1

the study data as prepared by prep_prepare_dataframes

meta_data

the metadata

use_value_labels

map columns with VALUE_LABELS as factor variables

replace_missing_by

enum LABEL | INTERPRET | NA . Missing codes should be replaced by the missing labels, the AAPOR codes from the missing table or by NA. Can also be an empty string to keep the codes.

replace_limits

logical replace hard limit violations by NA

Value

the result of the parsed rule

Evaluate an expression and create a `dataquieR_result` object from it's evaluated value

Description

if an error occurs, the function will return a corresponding object representing that error. all conditions will be recorded and replayed, whenever the result is printed by print.dataquieR_result.

Usage

util_eval_to_dataquieR_result(
  expression,
  env = parent.frame(),
  filter_result_slots,
  nm,
  function_name,
  my_call = expression,
  my_storr_object = NULL,
  init = FALSE,
  called_in_pipeline = TRUE
)

Arguments

expression

the expression

env

the environment to evaluate the expression in

filter_result_slots

character regular expressions, only if an indicator function's result's name matches one of these, it'll be used for the report. If of length zero, no filtering is performed.

nm

character name for the computed result

function_name

character name of the function to be executed

my_call

the call being executed (equivalent to expression)

my_storr_object

a storr object to store the result in

init

logical is this an initial call to compute dummy results?

called_in_pipeline

logical if the evaluation should be considered as part of a pipeline.

Value

a dataquieR_result object

Generate a full DQ report, v2

Description

Generate a full DQ report, v2

Usage

util_evaluate_calls(
  all_calls,
  study_data,
  meta_data,
  label_col,
  meta_data_segment,
  meta_data_dataframe,
  meta_data_cross_item,
  resp_vars,
  filter_result_slots,
  cores,
  debug_parallel,
  mode = c("default", "futures", "queue", "parallel"),
  mode_args,
  my_storr_object = NULL
)

Arguments

all_calls

list a list of calls

study_data

data.frame the data frame that contains the measurements

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional: Data frame level metadata

meta_data_cross_item

data.frame – optional: cross-item level metadata

resp_vars

variable list the name of the measurement variables for the report.

filter_result_slots

character regular expressions, only if an indicator function's result's name matches one of these, it'll be used for the report. If of length zero, no filtering is performed.

cores

integer number of cpu cores to use or a named list with arguments for parallelMap::parallelStart or NULL, if parallel has already been started by the caller. Can also be a cluster.

debug_parallel

logical print blocks currently evaluated in parallel

mode

mode_args

Value

a dataquieR_resultset2. Can be printed creating a RMarkdown-report.

Verify, that argument is a data frame

Description

stops with an error, if not. will add the columns, and return the resulting extended data frame, and also updating the original data frame in the calling environment, if #' x is empty (data frames easily break to 0-columns in R, if they have not rows, e.g. using some split/rbind pattern)

Usage

util_expect_data_frame(
  x,
  col_names,
  convert_if_possible,
  custom_errors,
  dont_assign,
  keep_types = FALSE
)

Arguments

x

an object that is verified to be a data.frame.

col_names

column names x must contain or named list of predicates to check the columns (e.g., list(AGE=is.numeric, SEX=is.character))

convert_if_possible

if given, for each column, a lambda can be given similar to col_names check functions. This lambda would be used to try a conversion. If a conversion fails (returns NA, where the input was not ‘util_empty’), an error is still thrown, the data is converted, otherwise

custom_errors

list with error messages, specifically per column. names of the list are column names, values are messages (character).

dont_assign

set TRUE to keep x in the caller environment untouched

keep_types

logical keep types as possibly defined in a file, if the data frame is loaded from one. set TRUE for study data.

Value

invisible data frame

check, if a scalar/vector function argument matches expectations

Description

check, if a scalar/vector function argument matches expectations

Usage

util_expect_scalar(
  arg_name,
  allow_more_than_one = FALSE,
  allow_null = FALSE,
  allow_na = FALSE,
  min_length = -Inf,
  max_length = Inf,
  check_type,
  convert_if_possible,
  conversion_may_replace_NA = FALSE,
  dont_assign = FALSE,
  error_message
)

Arguments

arg_name

the argument

allow_more_than_one

allow vectors

allow_null

allow NULL

allow_na

allow NAs

min_length

minimum length of the argument's value

max_length

maximum length of the argument's value

check_type

a predicate function, that must return TRUE on the argument's value.

convert_if_possible

if given, a lambda can be given similar to check_type This lambda would be used to try a conversion. If a conversion fails (returns NA, where the input was not ‘util_empty’), an error is still thrown, the data is converted, otherwise

conversion_may_replace_NA

if set to TRUE, we can define a function in convert_if_possible that replaces NA values without causing a warning, but this option is set to FALSE by default to catch possible conversion problems (use it with caution).

dont_assign

set TRUE to keep x in the caller environment untouched

error_message

if check_type() returned FALSE, show this instead of a default error message.

Value

the value of arg_name – but this is updated in the calling frame anyway.

Examples

## Not run: 
f <- function(x) {
  util_expect_scalar(x, check_type = is.integer)
}
f(42L)
try(f(42))
g <- function(x) {
  util_expect_scalar(x, check_type = is.integer, convert_if_possible =
          as.integer)
}
g(42L)
g(42)

## End(Not run)

Extract all ids from a list of `htmltools` objects

Description

Extract all ids from a list of htmltools objects

Usage

util_extract_all_ids(pages)

Arguments

pages

the list of objects

Value

a character vector with valid targets

Extract columns of a `SummaryTable` (or Segment, ...)

Description

Extract columns of a SummaryTable (or Segment, ...)

Usage

util_extract_indicator_metrics(Table)

Arguments

Table

data.frame, a table

Value

data.frame columns with indicator metrics from Table

return all matches of an expression

Description

return all matches of an expression

Usage

util_extract_matches(data, pattern)

Arguments

data

a character vector

pattern

a character string containing a regular expression

Value

A list with matching elements or NULL (in case on non-matching elements)

Author(s)

Josh O'Brien

Examples

## Not run:  # not exported, so not tested
dat0 <- list("a sentence with citation (Ref. 12), (Ref. 13), and then (Ref. 14)",
  "another sentence without reference")
pat <- "Ref. (\\d+)"
util_extract_matches(dat0, pat)

## End(Not run)

Filter a `MISSING_LIST_TABLE` for rows matching the variable `rv`

Description

In MISSING_LIST_TABLE, a column resp_vars may be specified. If so, and if, for a row, this column is not empty, then that row only affects the one variable specified in that cell

Usage

util_filter_missing_list_table_for_rv(table, rv, rv2 = rv)

Arguments

table

cause_label_df a data frame with missing codes and optionally resp_vars. It also comprises labels and optionally an interpretation column with AAPOR codes. Must already cover the variable rv, i.e., item level metadata is not checked to find the suitable missing table for rv.

rv

variable the response variable to filter the missing list for specified by a label.

rv2

variable the response variable to filter the missing list for specified by a VAR_NAMES-name.

Value

data.frame the row-wise bound data frames as one data frame

Filter collection based on its `names()` using regular expressions

Description

Filter collection based on its names() using regular expressions

Usage

util_filter_names_by_regexps(collection, regexps)

Arguments

collection

a named collection (list, vector, ...)

regexps

character a vector of regular expressions

Value

collection reduced to entries, that's names match at least any expression from regexps

Examples

## Not run:  # internal function
util_filter_names_by_regexps(iris, c("epa", "eta"))

## End(Not run)

Function that calculated height and width values for `script_iframe`

Description

Function that calculated height and width values for script_iframe

Usage

util_finalize_sizing_hints(sizing_hints)

Arguments

sizing_hints

list containing information for setting the size of the iframe

Value

a list with figure_type_id, w, and h; sizes are as CSS, existing elements are kept, w_in_cm and h_in_cm are estimates for the size in centimeters on a typical computer display (in 2024)

Find externally called function in the stack trace

Description

intended use: error messages for the user

Usage

util_find_external_functions_in_stacktrace(
  sfs = rev(sys.frames()),
  cls = rev(sys.calls())
)

Arguments

sfs

reverse sys.frames to search in

cls

reverse sys.calls to search in

Value

vector of logicals stating for each index, if it had been called externally

Find first externally called function in the stack trace

Description

intended use: error messages for the user

Usage

util_find_first_externally_called_functions_in_stacktrace(
  sfs = rev(sys.frames()),
  cls = rev(sys.calls())
)

Arguments

sfs

reverse sys.frames to search in

cls

reverse sys.calls to search in

Value

reverse sys.frames index of first non-dataquieR function in this stack

Check, if `x` contains valid missing codes

Description

Check, if x contains valid missing codes

Usage

util_find_free_missing_code(x)

Arguments

x

a vector of missing codes

Value

a missing code not in x

Search for a formal in the stack trace

Description

Similar to dynGet(), find a symbol in the closest data quality indicator function and return its value. Can stop(), if symbol evaluation causes a stop.

Usage

util_find_indicator_function_in_callers(symbol = "resp_vars")

Arguments

symbol

symbol to find

Value

value of the symbol, if available, NULL otherwise

Try hard, to map a variable

Description

does not warn on ambiguities nor if not found (but in the latter case, it returns ifnotfound)

Usage

util_find_var_by_meta(
  resp_vars,
  meta_data = "item_level",
  label_col = LABEL,
  allowed_sources = c(VAR_NAMES, label_col, LABEL, LONG_LABEL, "ORIGINAL_VAR_NAMES",
    "ORIGINAL_LABEL"),
  target = VAR_NAMES,
  ifnotfound = NA_character_
)

Arguments

resp_vars

variables to map from

meta_data

metadata

label_col

label-col to map from, if not allowed_sources should be entirely passed

allowed_sources

allowed names to map from (as metadata columns)

target

metadata attribute to map to

ifnotfound

list A list of values to be used if the item is not found: it will be coerced to a list if necessary.

Value

vector of mapped target names of resp_vars

Move the first row of a data frame to its column names

Description

Move the first row of a data frame to its column names

Usage

util_first_row_to_colnames(dfr)

Arguments

dfr

data.frame

Value

data.frame with first row as column names

Fix results from merge

Description

this function handles the result of merge()-calls, if no.dups = TRUE and suffixes = c("", "")

Usage

util_fix_merge_dups(dfr, stop_if_incompatible = TRUE)

Arguments

dfr

data frame to fix

stop_if_incompatible

logical stop if data frame can not be fixed

RStudio crashes on parallel calls in some versions on Darwin based operating systems with R 4

Description

RStudio crashes on parallel calls in some versions on Darwin based operating systems with R 4

Usage

util_fix_rstudio_bugs()

Value

invisible null

Ensure, sizing hint sticks at the `dqr`, only

Description

Ensure, sizing hint sticks at the dqr, only

Usage

util_fix_sizing_hints(dqr, x)

Arguments

dqr

a dataquieR result

x

a plot object

Value

a list with dqr and x, but fixed

Fix a `storr` object, if it features the factory-attribute

Description

Fix a storr object, if it features the factory-attribute

Usage

util_fix_storr_object(my_storr_object)

Arguments

my_storr_object

a storr-object

Value

a (hopefully) working storr_object

return a single page navigation menu floating on the right

Description

if displayed in a dq_report2

Usage

util_float_index_menu(index_menu_table, object)

Arguments

index_menu_table

data.frame columns: links, hovers, texts

object

htmltools tag list, used, instead of index_menu_table, if passed

Examples

## Not run: 
util_float_index_menu(tibble::tribble(
   ~ links, ~ hovers, ~ texts,
   "http://www.google.de/#xxx", "This is Google", "to Google",
   "http://www.uni-giessen.de/#xxx", "This is Gießen", "cruising on the A45"
))

## End(Not run)

Plots simple HTML tables with background color scale

Description

Plots simple HTML tables with background color scale

Usage

util_formattable(
  tb,
  min_val = min(tb, na.rm = TRUE),
  max_val = max(tb, na.rm = TRUE),
  min_color = c(0, 0, 255),
  max_color = c(255, 0, 0),
  soften = function(x) stats::plogis(x, location = 0.5, scale = 0.1),
  style_header = "font-weight: bold;",
  text_color_mode = c("bw", "gs"),
  hover_texts = NULL,
  escape_all_content = TRUE
)

Arguments

tb

data.frame the table as data.frame with mostly numbers

min_val

numeric minimum value for the numbers in tb

max_val

numeric maximum value for the numbers in tb

min_color

numeric vector with the RGB color values for the minimum color, values between 0 and 255

max_color

numeric vector with the RGB color values for the maximum color, values between 0 and 255

soften

function to be applied to the relative values between 0 and 1 before mapping them to a color

style_header

character to be applied to style the HTML header of the table

text_color_mode

enum bw | gs. Should the text be displayed in black and white or using a grey scale? In both cases, the color will be adapted to the background.

hover_texts

data.frame if not NULL, this data frame contains html code displayed when the user's mouse pointer moves inside corresponding cells from tb. Can contain HTML code.

escape_all_content

logical if TRUE, treat tb and hover_texts using some HTML escaping function

Value

htmltools compatible object

Examples

## Not run: 

tb <- as.data.frame(matrix(ncol = 5, nrow = 5))
tb[] <- sample(1:100, prod(dim(tb)), replace = TRUE)
tb[, 1] <- paste("case", 1:nrow(tb))
htmltools::browsable(util_formattable(tb))
htmltools::browsable(util_formattable(tb[, -1]))

## End(Not run)

Get description for an indicator function

Description

Get description for an indicator function

Usage

util_function_description(fname)

Arguments

fname

the function name

Value

the description

Generate a link to a specific result

Description

for dq_report2

Usage

util_generate_anchor_link(
  varname,
  callname,
  order_context = c("variable", "indicator"),
  name,
  title
)

Arguments

varname

variable to create a link to

callname

function call to create a link to

order_context

link created to variable overview or indicator overview page

name

replaces varname and callname, must contain the . separator, then

title

optional, replaces auto-generated link title

Value

the htmltools tag

Generate a tag for a specific result

Description

for dq_report2

Usage

util_generate_anchor_tag(
  varname,
  callname,
  order_context = c("variable", "indicator"),
  name
)

Arguments

varname

variable to create an anchor for

callname

function call to create an anchor for

order_context

anchor created on variable overview or indicator overview page

name

replaces varname and callname, must contain the . separator, then

Value

the htmltools tag

Generate an execution/calling plan for computing a report from the metadata

Description

Generate an execution/calling plan for computing a report from the metadata

Usage

util_generate_calls(
  dimensions,
  meta_data,
  label_col,
  meta_data_segment,
  meta_data_dataframe,
  meta_data_cross_item,
  specific_args,
  arg_overrides,
  resp_vars,
  filter_indicator_functions
)

Arguments

dimensions

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional: Data frame level metadata

meta_data_cross_item

data.frame – optional: Cross-item level metadata

specific_args

arg_overrides

list arguments to be passed to all called indicator functions if applicable.

resp_vars

variables to be respected, NULL means to use all.

filter_indicator_functions

character regular expressions, only if an indicator function's name matches one of these, it'll be used for the report. If of length zero, no filtering is performed.

Value

a list of calls

Generate function calls for a given indicator function

Description

new reporting pipeline v2.0

Usage

util_generate_calls_for_function(
  fkt,
  meta_data,
  label_col,
  meta_data_segment,
  meta_data_dataframe,
  meta_data_cross_item,
  specific_args,
  arg_overrides,
  resp_vars
)

Arguments

fkt

the indicator function's name

meta_data

the item level metadata data frame

label_col

the label column

meta_data_segment

segment level metadata

meta_data_dataframe

data frame level metadata

meta_data_cross_item

cross-item level metadata

specific_args

argument overrides for specific functions

arg_overrides

general argument overrides

resp_vars

variables to be respected

Value

function calls for the given function

Convert a dataquieR report v2 to a named list of web pages

Description

Convert a dataquieR report v2 to a named list of web pages

Usage

util_generate_pages_from_report(
  report,
  template,
  disable_plotly,
  progress = progress,
  progress_msg = progress_msg,
  block_load_factor,
  dir,
  my_dashboard
)

Arguments

report

dataquieR report v2.

template

character template to use, only the name, not the path

disable_plotly

logical do not use plotly, even if installed

progress

function lambda for progress in percent – 1-100

progress_msg

function lambda for progress messages

block_load_factor

numeric multiply size of parallel compute blocks by this factor.

dir

character output directory for potential iframes.

my_dashboard

list of class shiny.tag.list featuring a dashboard or missing or NULL

Value

named list, each entry becomes a file with the name of the entry. the contents are HTML objects as used by htmltools.

Examples

## Not run: 
devtools::load_all()
prep_load_workbook_like_file("meta_data_v2")
report <- dq_report2("study_data", dimensions = NULL, label_col = "LABEL");
save(report, file = "report_v2.RData")
report <- dq_report2("study_data", label_col = "LABEL");
save(report, file = "report_v2_short.RData")

## End(Not run)

Create a table summarizing the number of indicators and descriptors in the report

Description

Create a table summarizing the number of indicators and descriptors in the report

Usage

util_generate_table_indicators_descriptors(report)

Arguments

report

a report

Value

a table containing the number of indicators and descriptors created in the report, separated by data quality dimension.

Return the category for a result

Description

messages do not cause any category, warnings are cat3, errors are cat5

Usage

util_get_category_for_result(
  result,
  aspect = c("applicability", "error", "anamat", "indicator_or_descriptor"),
  ...
)

Arguments

result

a dataquieR_resultset2 result

aspect

an aspect/problem category of results (error, applicability error)

...

not used

Value

a category, see util_as_cat()

Fetch a missing code list from the metadata

Description

get missing codes from metadata (e.g. MISSING_LIST or JUMP_LIST)

Usage

util_get_code_list(
  x,
  code_name,
  split_char = SPLIT_CHAR,
  mdf,
  label_col = VAR_NAMES,
  warning_if_no_list = TRUE,
  warning_if_unsuitable_list = TRUE
)

Arguments

x

variable the name of the variable to retrieve code lists for. only one variable at a time is supported, not vectorized!!

code_name

variable attribute JUMP_LIST or MISSING_LIST: Which codes to retrieve.

split_char

character len = 1. Character(s) used to separate different codes in the metadata, usually |, as in 99999|99998|99997.

mdf

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

warning_if_no_list

logical len = 1. If TRUE, a warning is displayed, if no missing codes are available for a variable.

warning_if_unsuitable_list

logical len = 1. If TRUE, a warning is displayed, if missing codes do not match with a variable' data type.

Value

numeric vector of missing codes.

Get colors for each russet `DQ` category

Description

Get colors for each russet DQ category

Usage

util_get_colors()

Value

named vector of colors, names are categories (e.g, "1" to "5") values are colors as HTML RGB hexadecimal strings

Read additional concept tables

Description

Read additional concept tables

Usage

util_get_concept_info(filename, ...)

Arguments

filename

RDS-file name without extension to read from

...

passed to subset

Value

a data frame

Get encoding from metadata or guess it from data

Description

Get encoding from metadata or guess it from data

Usage

util_get_encoding(
  resp_vars = colnames(study_data),
  study_data,
  label_col,
  meta_data,
  meta_data_dataframe
)

Arguments

resp_vars

variable the names of the measurement variables, if missing or NULL, all variables will be checked

study_data

data.frame the data frame that contains the measurements

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

Value

named vector of valid encoding strings matching resp_vars

Find a foreground color for a background

Description

black or white

Usage

util_get_fg_color(cl)

Arguments

cl

colors

Value

black or white for each cl

Import vector of hover text for tables in the report

Description

Import vector of hover text for tables in the report

Usage

util_get_hovertext(x)

Arguments

x

name of the tables. They are meta_data, meta_data_segment, meta_data_dataframe, meta_data_cross_item, meta_data_item_computation, com_item_missingness, int_datatype_matrix, con_inadmissible_categorical, rulesetformat, gradingruleset

Value

named vector containing the hover text from the file metadata-hovertext.rds in the inst folder. Names correspond to column names in the metadata tables

Get labels for each russet `DQ` category

Description

Get labels for each russet DQ category

Usage

util_get_labels_grading_class()

Value

named vector of labels, names are categories (e.g, "1" to "5") values are labels

Return messages/warnings/notes/error messages for a result

Description

Return messages/warnings/notes/error messages for a result

Usage

util_get_message_for_result(
  result,
  aspect = c("applicability", "error", "anamat", "indicator_or_descriptor"),
  collapse = "\n<br />\n",
  ...
)

Arguments

result

a dataquieR_resultset2 result

aspect

an aspect/problem category of results

collapse

either a lambda function or a separator for combining multiple messages for the same result

...

not used

Value

hover texts for results with data quality issues, run-time errors, warnings or notes (aka messages)

an environment with functions available for `REDcap` rules

Description

an environment with functions available for REDcap rules

Usage

util_get_redcap_rule_env()

Value

environment

Get rule sets for `DQ` grading

Description

Get rule sets for DQ grading

Usage

util_get_rule_sets()

Value

names lists, names are the ruleset names, values are data.frames featuring the columns GRADING_RULESET, dqi_parameterstub, indicator_metric, dqi_catnum and dqi_cat_1 to ⁠dqi_cat_<dqi_catnum>⁠

Get formats for `DQ` categories

Description

Get formats for DQ categories

Usage

util_get_ruleset_formats()

Value

data.frame columns: categories (e.g., "1" to "5"), color (e.g., "33 102 172", "67 147 195", "227 186 20", "214 96 77", 178 23 43"), label (e.g., "OK", "unclear", "moderate", "important", "critical" )

Get namespace for attributes

Description

Get namespace for attributes

Usage

util_get_storr_att_namespace(my_storr_object)

Arguments

my_storr_object

the storr object

Value

the namespace name

Get the `storr` object backing a report

Description

Get the storr object backing a report

Usage

util_get_storr_object_from_report(r)

Arguments

r

the dataquieR_resultset2 / report

Value

the storr object holding the results or NULL, if the report lives in the memory, only

Get namespace specifically for summary attributes for speed-up

Description

Get namespace specifically for summary attributes for speed-up

Usage

util_get_storr_summ_namespace(my_storr_object)

Arguments

my_storr_object

the storr object

Value

the namespace name

Get the thresholds for grading

Description

Get the thresholds for grading

Usage

util_get_thresholds(indicator_metric, meta_data)

Arguments

indicator_metric

which indicator metric to be classified

meta_data

the item level metadata

Value

named list (names are VAR_NAMES, values are named vectors of intervals, names in the vectors are the category numbers)

Get variable attributes of a certain provision level

Description

This function returns all variable attribute names of a certain metadata provision level or of more than one level.

Usage

util_get_var_att_names_of_level(level, cumulative = TRUE)

Arguments

level

level(s) of requirement

cumulative

include all names from more basic levels

Value

all matching variable attribute names

Return all variables in the segment `segment`

Description

Return all variables in the segment segment

Usage

util_get_vars_in_segment(segment, meta_data = "item_level", label_col = LABEL)

Arguments

segment

character the segment as specified in STUDY_SEGMENT

meta_data

data.frame the metadata

label_col

character the metadata attribute used for naming the variables

Value

vector of variable names

Get the Table with Known Vocabularies

Description

Get the Table with Known Vocabularies

Usage

util_get_voc_tab(.data_frame_list = .dataframe_environment())

Arguments

.data_frame_list

environment cache for loaded data frames

Value

data.frame the (combined) table with known vocabularies

Add labels to `ggplot`

Description

EXPERIMENTAL

Usage

util_gg_var_label(
  ...,
  meta_data = get("meta_data", parent.frame()),
  label_col = get("label_col", parent.frame())
)

Arguments

...

EXPERIMENTAL

meta_data

the metadata

label_col

the label columns

Value

a modified ggplot

Utility function to check whether a variable has no grouping variable assigned

Description

Utility function to check whether a variable has no grouping variable assigned

Usage

util_has_no_group_vars(resp_vars, label_col = LABEL, meta_data = "item_level")

Arguments

resp_vars

variable list the name of a measurement variable

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame old name for item_level

Value

boolean

Utility Function Heatmap with 1 Threshold

Description

Function to create heatmap-like plot given one threshold – works for percentages for now.

Usage

util_heatmap_1th(
  df,
  cat_vars,
  values,
  threshold,
  right_intv,
  invert,
  cols,
  strata
)

Arguments

df

data.frame with data to display as a heatmap.

cat_vars

variable list len=1-2. Variables to group by. Up to 2 group levels supported.

values

variable the name of the percentage variable

threshold

numeric lowest acceptable value

right_intv

logical len=1. If FALSE (default), intervals used to define color ranges in the heatmap are closed on the left side, if TRUE on the right side, respectively.

invert

logical len=1. If TRUE, high values are better, warning colors are used for low values. FALSE works vice versa.

cols

deprecated, ignored.

strata

variable optional, the name of a variable used for stratification inheritParams acc_distributions

Value

a list with:

SummaryPlot: ggplot2::ggplot object with the heatmap

If on Windows, hide a file

Description

If on Windows, hide a file

Usage

util_hide_file_windows(fn)

Arguments

fn

the file path + name

Value

invisible(NULL)

Utility function to create histograms

Description

A helper function for simple histograms.

Usage

util_histogram(
  plot_data,
  num_var = colnames(plot_data)[1],
  fill_var = NULL,
  facet_var = NULL,
  nbins_max = 100,
  colors = "#2166AC",
  is_datetime = FALSE
)

Arguments

plot_data

a data.frame without missing values

num_var

column name of the numerical or datetime variable in plot_data (if omitted, the first column is assumed to contain this variable)

fill_var

column name of the categorical variable in plot_data which will be used for coloring stacked histograms

facet_var

column name of the categorical variable in plot_data which will be used to create facets

nbins_max

the maximum number of bins for the histogram (see util_optimize_histogram_bins)

colors

vector of colors, or a single color

is_datetime

if TRUE, the x-axis will be adapted for the datetime format

Value

a histogram

escape `⁠"⁠`

Description

escape ⁠"⁠

Usage

util_html_attr_quote_escape(s)

Arguments

s

haystack

Value

s with ⁠"⁠ replaced by ⁠"⁠

Create a dynamic dimension related page for the report

Description

Create a dynamic dimension related page for the report

Usage

util_html_for_dims(
  report,
  use_plot_ly,
  template,
  block_load_factor,
  repsum,
  dir
)

Arguments

report

dataquieR_resultset2 a dq_report2 report

use_plot_ly

logical use plotly, if available.

template

character template to use for the dq_report2 report.

block_load_factor

numeric multiply size of parallel compute blocks by this factor.

repsum

the dataquieR summary, see summary() and dq_report2()

dir

character output directory for potential iframes.

Value

list of arguments for append_single_page() defined locally in util_generate_pages_from_report().

Create a dynamic single variable page for the report

Description

Create a dynamic single variable page for the report

Usage

util_html_for_var(
  report,
  cur_var,
  use_plot_ly,
  template,
  note_meta = c(),
  rendered_repsum,
  dir
)

Arguments

report

dataquieR_resultset2 a dq_report2 report

cur_var

character variable name for single variable pages

use_plot_ly

logical use plotly, if available.

template

character template to use for the dq_report2 report.

note_meta

character notes on the metadata for a single variable (if needed)

rendered_repsum

the dataquieR summary, see summary(), dq_report2() and print.dataquieR_summary()

dir

character output directory for potential iframes.

Value

list of arguments for append_single_page() defined locally in util_generate_pages_from_report().

The jack of all trades device for tables

Description

The jack of all trades device for tables

Usage

util_html_table(
  tb,
  filter = "top",
  columnDefs = NULL,
  autoWidth = FALSE,
  hideCols = character(0),
  rowCallback = DT::JS("function(r,d) {$(r).attr('height', '2em')}"),
  copy_row_names_to_column = !is.null(tb) && length(rownames(tb)) == nrow(tb) &&
    !is.integer(attr(tb, "row.names")) && !all(seq_len(nrow(tb)) == rownames(tb)),
  link_variables = TRUE,
  tb_rownames = FALSE,
  meta_data,
  rotate_headers = FALSE,
  fillContainer = TRUE,
  ...,
  colnames,
  descs,
  options = list(),
  is_matrix_table = FALSE,
  colnames_aliases2acronyms = is_matrix_table && !cols_are_indicatormetrics,
  cols_are_indicatormetrics = FALSE,
  label_col = LABEL,
  output_format = c("RMD", "HTML"),
  dl_fn = "*",
  rotate_for_one_row = FALSE,
  title = dl_fn,
  messageTop = NULL,
  messageBottom = NULL,
  col_tags = NULL,
  searchBuilder = FALSE,
  initial_col_tag,
  init_search,
  additional_init_args,
  additional_columnDefs
)

Arguments

tb

the table as data.frame

filter

passed to DT::datatable

columnDefs

column specifications for the datatables JavaScript object

autoWidth

passed to the datatables JavaScript library

hideCols

columns to hide (by name)

rowCallback

passed to the datatables JavaScript library (with default)

copy_row_names_to_column

add a column 0 with rownames

link_variables

considering row names being variables, convert row names to links to the variable specific reports

tb_rownames

number of columns from the left considered as row-names

meta_data

the data dictionary for labels and similar stuff

rotate_headers

rotate headers by 90 degrees

fillContainer

see DT::datatable

...

passed to DT::datatable

colnames

column names for the table (defaults to colnames(tb))

descs

character descriptions of the columns for the hover-box shown for the column names, if not missing, this overrides the existing description stuff from known column names. If you have an attribute "description" of the tb, then it overwrites everything and appears as hover text

options

individually overwrites defaults in options passed to DT::datatable

is_matrix_table

create a heat map like table without padding

colnames_aliases2acronyms

abbreviate column names considering being analysis matrix columns by their acronyms defined in square.

cols_are_indicatormetrics

logical cannot be TRUE, colnames_aliases2acronyms is TRUE. cols_are_indicatormetrics controls, if the columns are really function calls or, if cols_are_indicatormetrics has been set to TRUE, the columns are indicator metrics.

label_col

label col used for mapping labels in case of link_variables is used (that argument set to TRUE and Variables or VAR_NAMES in meta_data)

output_format

target format RMD or HTML, for RMD, markdown will be used in the output, for HTML, only HTML code is being generated

dl_fn

file name for downloaded table – see https://datatables.net/reference/button/excel

rotate_for_one_row

logical rotate one-row-tables

title

character title for download formats, see https://datatables.net/extensions/buttons/examples/html5/titleMessage.html

messageTop

character subtitle for download formats, see https://datatables.net/extensions/buttons/examples/html5/titleMessage.html

messageBottom

character footer for download formats, see https://datatables.net/extensions/buttons/examples/html5/titleMessage.html

col_tags

list if not NULL, a named list(), names are names used to name a newly created column-group hide/show button, elements are column names belonging to each column groups as defined by colnames

searchBuilder

logical if TRUE, display a searchBuilder-Button.

initial_col_tag

character col_tags entry to activate initially

init_search

list object to initialize searchBuilder, see datatables.net

additional_init_args

list if not missing or NULL, arguments passed to JavaScript, if searchBuilder == TRUE

additional_columnDefs

list additional columnDefs, can be missing or NULL

Value

the table to be added to an rmd/´html file as htmlwidgets::htmlwidgets

utility function for the outliers rule of Hubert and Vandervieren 2008

Description

function to calculate outliers according to the rule of Huber et al. This function requires the package robustbase

Usage

util_hubert(x)

Arguments

x

numeric data to check for outliers

Value

binary vector

Make `it` scalable, if it is a figure

Description

this function writes figures to helper files and embeds these in a returned object which is a scalable iframe. it does not change other objects in it.

Usage

util_iframe_it_if_needed(it, dir, nm, fkt, sizing_hints, ggthumb)

Arguments

it

htmltools::tag() compatible object

dir

character output directory for potential iframe.

nm

character name for the iframed file, if one is created

fkt

character function name of the indicator function that created ìt.

sizing_hints

object additional metadata about the natural figure size

ggthumb

ggplot2::ggplot() optional, underlying ggplot2 object for a preview

Value

htmltools::tag() compatible object, maybe now in an iframe

Extract all properties of a `ReportSummaryTable`

Description

Extract all properties of a ReportSummaryTable

Usage

util_init_respum_tab(x)

Arguments

x

ReportSummaryTable object

Value

list with all properties

Integer breaks for `ggplot2`

Description

creates integer-only breaks

Usage

util_int_breaks_rounded(x, n = 5)

Arguments

x

the values

n

integer giving the desired number of intervals. Non-integer values are rounded down.

Value

breaks suitable for ⁠scale_*_continuous⁠' breaks argument

Author(s)

Sarah

Examples

## Not run: 
big_numbers1 <- data.frame(x = 1:5, y = c(0:1, 0, 1, 0))
big_numbers2 <- data.frame(x = 1:5, y = c(0:1, 0, 1, 0) + 1000000)

big_numbers_plot1 <- ggplot(big_numbers1, aes(x = x, y = y)) +
  geom_point()

big_numbers_plot2 <- ggplot(big_numbers2, aes(x = x, y = y)) +
  geom_point()

big_numbers_plot1 + scale_y_continuous()
big_numbers_plot1 + scale_y_continuous(breaks = util_int_breaks_rounded)

big_numbers_plot2 + scale_y_continuous()
big_numbers_plot2 + scale_y_continuous(breaks = util_int_breaks_rounded)

## End(Not run)

Check for duplicated content

Description

This function tests for duplicates entries in the data set. It is possible to check duplicated entries by study segments or to consider only selected segments.

Usage

util_int_duplicate_content_dataframe(
  level = c("dataframe"),
  identifier_name_list,
  id_vars_list,
  unique_rows,
  meta_data_dataframe = "dataframe_level",
  ...,
  dataframe_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

identifier_name_list

vector the vector that contains the name of the identifier to be used in the assessment. For the study level, corresponds to the names of the different data frames. For the segment level, indicates the name of the segments.

id_vars_list

list the list containing the identifier variables names to be used in the assessment.

unique_rows

vector named. for each data frame, either true/false or no_id to exclude ID variables from check

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

...

Not used.

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

SegmentData: data frame with the results of the quality check for duplicated entries
SegmentTable: data frame with selected duplicated entries check results, used for the data quality report.
Other: vector with row indices of duplicated entries, if any, otherwise NULL.

Check for duplicated content

Description

This function tests for duplicates entries in the data set. It is possible to check duplicated entries by study segments or to consider only selected segments.

Usage

util_int_duplicate_content_segment(
  level = c("segment"),
  identifier_name_list,
  id_vars_list,
  unique_rows,
  study_data,
  meta_data,
  meta_data_segment = "segment_level",
  segment_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

identifier_name_list

id_vars_list

list the list containing the identifier variables names to be used in the assessment.

unique_rows

vector named. for each segment, either true/false or no_id to exclude ID variables from check

study_data

data.frame the data frame that contains the measurements, mandatory.

meta_data

data.frame the data frame that contains metadata attributes of the study data, mandatory.

meta_data_segment

data.frame – optional: Segment level metadata

segment_level

data.frame alias for meta_data_segment

Value

a list with

SegmentData: data frame with the results of the quality check for duplicated entries
SegmentTable: data frame with selected duplicated entries check results, used for the data quality report.
Other: vector with row indices of duplicated entries, if any, otherwise NULL.

Check for duplicated IDs

Description

This function tests for duplicates entries in identifiers. It is possible to check duplicated identifiers by study segments or to consider only selected segments.

Usage

util_int_duplicate_ids_dataframe(
  level = c("dataframe"),
  id_vars_list,
  identifier_name_list,
  repetitions,
  meta_data_dataframe = "dataframe_level",
  ...,
  dataframe_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

id_vars_list

list id variable names for each segment or data frame

identifier_name_list

vector the segments or data frame names being assessed

repetitions

vector an integer vector indicating the number of allowed repetitions in the id_vars.

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

...

not used.

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

DataframeData: data frame with the results of the quality check for duplicated identifiers
DataframeTable: data frame with selected duplicated identifiers check results, used for the data quality report.
Other: named list with inner lists of unique cases containing each the row indices of duplicated identifiers separated by "|" , if any. outer names are names of the data frames

Check for duplicated IDs

Description

This function tests for duplicates entries in identifiers. It is possible to check duplicated identifiers by study segments or to consider only selected segments.

Usage

util_int_duplicate_ids_segment(
  level = c("segment"),
  id_vars_list,
  study_segment,
  repetitions,
  study_data,
  meta_data,
  meta_data_segment = "segment_level",
  segment_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

id_vars_list

list id variable names for each segment or data frame

study_segment

vector the segments or data frame names being assessed

repetitions

vector an integer vector indicating the number of allowed repetitions in the id_vars. Currently, no repetitions are supported.

study_data

data.frame the data frame that contains the measurements, mandatory.

meta_data

data.frame the data frame that contains metadata attributes of the study data, mandatory.

meta_data_segment

data.frame – optional: Segment level metadata

segment_level

data.frame alias for meta_data_segment

Value

a list with

SegmentData: data frame with the results of the quality check for duplicated identifiers
SegmentTable: data frame with selected duplicated identifiers check results, used for the data quality report.
Other: named list with inner lists of unique cases containing each the row indices of duplicated identifiers separated by "|" , if any. outer names are names of the segments. Use prep_get_study_data_segment() to get the data frame the indices refer to.

Check for unexpected data record set

Description

This function tests that the identifiers match a provided record set. It is possible to check for unexpected data record sets by study segments or to consider only selected segments.

Usage

util_int_unexp_records_set_dataframe(
  level = c("dataframe"),
  id_vars_list,
  identifier_name_list,
  valid_id_table_list,
  meta_data_record_check_list,
  meta_data_dataframe = "dataframe_level",
  ...,
  dataframe_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

id_vars_list

list the list containing the identifier variables names to be used in the assessment.

identifier_name_list

list the list that contains the name of the identifier to be used in the assessment. For the study level, corresponds to the names of the different data frames. For the segment level, indicates the name of the segments.

valid_id_table_list

list the reference list with the identifier variable values.

meta_data_record_check_list

character a character vector indicating the type of check to conduct, either "subset" or "exact".

meta_data_dataframe

data.frame the data frame that contains the metadata for the data frame level

...

not used

dataframe_level

data.frame alias for meta_data_dataframe

Value

a list with

SegmentData: data frame with the results of the quality check for unexpected data elements
SegmentTable: data frame with selected unexpected data elements check results, used for the data quality report.
UnexpectedRecords: vector with row indices of duplicated records, if any, otherwise NULL.

Check for unexpected data record set

Description

This function tests that the identifiers match a provided record set. It is possible to check for unexpected data record sets by study segments or to consider only selected segments.

Usage

util_int_unexp_records_set_segment(
  level = c("segment"),
  id_vars_list,
  identifier_name_list,
  valid_id_table_list,
  meta_data_record_check_list,
  study_data,
  label_col,
  meta_data,
  item_level,
  meta_data_segment = "segment_level",
  segment_level
)

Arguments

level

character a character vector indicating whether the assessment should be conducted at the study level (level = "dataframe") or at the segment level (level = "segment").

id_vars_list

list the list containing the identifier variables names to be used in the assessment.

identifier_name_list

valid_id_table_list

list the reference list with the identifier variable values.

meta_data_record_check_list

character a character vector indicating the type of check to conduct, either "subset" or "exact".

study_data

data.frame the data frame that contains the measurements, mandatory.

label_col

variable attribute the name of the column in the metadata with labels of variables

meta_data

data.frame the data frame that contains metadata attributes of the study data, mandatory.

item_level

data.frame the data frame that contains metadata attributes of study data

meta_data_segment

data.frame – optional: Segment level metadata

segment_level

data.frame alias for meta_data_segment

Value

a list with

SegmentData: data frame with the results of the quality check for unexpected data elements
SegmentTable: data frame with selected unexpected data elements check results, used for the data quality report.
UnexpectedRecords: vector with row indices of duplicated records, if any, otherwise NULL.

Utility function to interpret mathematical interval notation

Description

Utility function to split limit definitions into interpretable elements

Usage

util_interpret_limits(mdata)

Arguments

mdata

data.frame the data frame that contains metadata attributes of study data

Value

augments metadata by interpretable limit columns

Check for integer values

Description

This function checks if a variable is integer.

Usage

util_is_integer(x, tol = .Machine$double.eps^0.5)

Arguments

x

the object to test

tol

precision of the detection. Values deviating more than tol from their closest integer value will not be deemed integer.

Value

TRUE or FALSE

Detect falsish values

Description

Detect falsish values

Usage

util_is_na_0_empty_or_false(x)

Arguments

x

a value/vector of values

Value

vector of logical values: TRUE, wherever x is somehow empty

Create a predicate function to check for certain numeric properties

Description

useful, e.g., for util_expect_data_frame and util_expect_scalar. The generated function returns on TRUE or FALSE, even if called with a vector.

Usage

util_is_numeric_in(
  min = -Inf,
  max = +Inf,
  whole_num = FALSE,
  finite = FALSE,
  set = NULL
)

Arguments

min

if given, minimum for numeric values

max

if given, maximum for numeric values

whole_num

if TRUE, expect a whole number

finite

Are Inf and -Inf invalid values? (FALSE by default)

set

if given, a set, the value must be in (see util_match_arg)

Value

a function that checks an x for the properties.

Examples

## Not run: 
util_is_numeric_in(min = 0)(42)
util_is_numeric_in(min = 43)(42)
util_is_numeric_in(max = 3)(42)
util_is_numeric_in(whole_num = TRUE)(42)
util_is_numeric_in(whole_num = TRUE)(42.1)
util_is_numeric_in(set = c(1, 3, 5))(1)
util_is_numeric_in(set = c(1, 3, 5))(2)

## End(Not run)

Detect un-disclosed `ggplot`

Description

Detect un-disclosed ggplot

Usage

util_is_svg_object(x)

Arguments

x

the object to check

Value

TRUE or FALSE

Check, if `x` is a try-error

Description

Check, if x is a try-error

Usage

util_is_try_error(x)

Arguments

x

Value

logical() if it is a try-error

Check, if `x` contains valid missing codes

Description

Check, if x contains valid missing codes

Usage

util_is_valid_missing_codes(x)

Arguments

x

a vector of values

Value

TRUE or FALSE

being called by the active binding function for .manual

Description

being called by the active binding function for .manual

Usage

util_load_manual(
  rebuild = FALSE,
  target = "inst/manual.RData",
  target2 = "inst/indicator_or_descriptor.RData",
  man_hash = ""
)

Arguments

rebuild

rebuild the cache

target

file for ..manual

target2

file for ..indicator_or_descriptor

man_hash

internal use: hash-sum over the manual to prevent rebuild if not changed.

Check for repetitive values using the digits 8 or 9 only

Description

Values not being finite (see is.finite) are also reported as missing codes. Also, all missing codes must be composed out of the digits 8 and 9 and they must be the largest values of a variable.

Usage

util_looks_like_missing(x, n_rules = 1)

Arguments

x

numeric vector to test

n_rules

numeric Only outlying values can be missing codes; at least n_rules rules in acc_univariate_outlier match

Value

logical indicates for each value in x, if it looks like a missing code

Rename columns of a `SummaryTable` (or Segment, ...) to look nice

Description

Rename columns of a SummaryTable (or Segment, ...) to look nice

Usage

util_make_data_slot_from_table_slot(Table)

Arguments

Table

data.frame, a table

Value

renamed table

Maps label column metadata on study data variable names

Description

Maps a certain label column from the metadata to the study data frame.

Usage

util_map_all(label_col = VAR_NAMES, study_data, meta_data)

Arguments

label_col

the variable of the metadata that contains the variable names of the study data

study_data

the name of the data frame that contains the measurements

meta_data

the name of the data frame that contains metadata attributes of study data

Value

list with slot df with a study data frame with mapped column names

Map based on largest common prefix

Description

Map based on largest common prefix

Usage

util_map_by_largest_prefix(
  needle,
  haystack,
  split_char = "_",
  remove_var_suffix = TRUE
)

Arguments

needle

character(1) item to search

haystack

character items to find the entry sharing the largest prefix with needle

split_char

character(1) to split entries to atomic words (like letters, if "" or snake_elements, if "_")

remove_var_suffix

logical(1) remove potential suffix after the first dot ., before finding needle in haystack.

Value

character(1) with the fitting function name or NA_character_

Examples

## Not run:  # internal function
util_map_by_largest_prefix(
  "acc_distributions_loc_ecdf_observer_time",
  names(dataquieR:::.manual$titles)
)
util_map_by_largest_prefix(
  "acc_distributions_loc_observer_time",
  names(dataquieR:::.manual$titles)
)
util_map_by_largest_prefix(
  "acc_distributions_loc_ecdf",
  names(dataquieR:::.manual$titles)
)
util_map_by_largest_prefix(
  "acc_distributions_loc",
  names(dataquieR:::.manual$titles)
)

## End(Not run)

Support function to allocate labels to variables

Description

Map variables to certain attributes, e.g. by default their labels.

Usage

util_map_labels(
  x,
  meta_data = "item_level",
  to = LABEL,
  from = VAR_NAMES,
  ifnotfound,
  warn_ambiguous = FALSE
)

Arguments

x

character variable names, character vector, see parameter from

meta_data

data.frame old name for item_level

to

character variable attribute to map to

from

character variable identifier to map from

ifnotfound

list A list of values to be used if the item is not found: it will be coerced to a list if necessary.

warn_ambiguous

logical print a warning if mapping variables from from to to produces ambiguous identifiers.

Details

The function not only maps to the LABEL column, but to can be any metadata variable attribute, so the function can also be used, to get, e.g. all HARD_LIMITS from the metadata.

Value

a character vector with:

mapped values

Examples

## Not run: 
meta_data <- prep_create_meta(
  VAR_NAMES = c("ID", "SEX", "AGE", "DOE"),
  LABEL = c("Pseudo-ID", "Gender", "Age", "Examination Date"),
  DATA_TYPE = c(DATA_TYPES$INTEGER, DATA_TYPES$INTEGER, DATA_TYPES$INTEGER,
                 DATA_TYPES$DATETIME),
  MISSING_LIST = ""
)
stopifnot(all(prep_map_labels(c("AGE", "DOE"), meta_data) == c("Age",
                                                 "Examination Date")))

## End(Not run)

Utility function to create a margins plot for binary variables

Description

Utility function to create a margins plot for binary variables

Usage

util_margins_bin(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  threshold_type = NULL,
  threshold_value,
  min_obs_in_subgroup = 5,
  min_obs_in_cat = 5,
  caption = NULL,
  ds1,
  label_col,
  adjusted_hint = "",
  title = "",
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default),
  include_numbers_in_figures = getOption("dataquieR.acc_margins_num",
    dataquieR.acc_margins_num_default)
)

Arguments

resp_vars

variable the name of the binary measurement variable

group_vars

variable the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

threshold_type

enum empirical | user | none. See acc_margins.

threshold_value

numeric see acc_margins

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis.

min_obs_in_cat

integer This optional argument specifies the minimum number of observations that is required to include a category (level) of the outcome (resp_vars) in the analysis.

caption

string a caption for the plot (optional, typically used to report the coding of cases and control group)

ds1

data.frame the data frame that contains the measurements, after replacing missing value codes by NA, excluding inadmissible values and transforming categorical variables to factors.

label_col

variable attribute the name of the column in the metadata with labels of variables

adjusted_hint

character hint, if adjusted for co_vars

title

character title for the plot

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations (in the figure)?

include_numbers_in_figures

logical Should the figure report the number of observations for each level of the grouping variable?

Value

A table and a matching plot.

Utility function to create a margins plot from linear regression models

Description

Utility function to create a margins plot from linear regression models

Usage

util_margins_lm(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  threshold_type = NULL,
  threshold_value,
  min_obs_in_subgroup = 5,
  ds1,
  label_col,
  levels = NULL,
  adjusted_hint = "",
  title = "",
  n_violin_max = getOption("dataquieR.max_group_var_levels_with_violins",
    dataquieR.max_group_var_levels_with_violins_default),
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default),
  include_numbers_in_figures = getOption("dataquieR.acc_margins_num",
    dataquieR.acc_margins_num_default)
)

Arguments

resp_vars

variable the name of the measurement variable

group_vars

variable the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

threshold_type

enum empirical | user | none. See acc_margins.

threshold_value

numeric see acc_margins

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis.

ds1

data.frame the data frame that contains the measurements, after replacing missing value codes by NA, excluding inadmissible values and transforming categorical variables to factors.

label_col

variable attribute the name of the column in the metadata with labels of variables

levels

levels() of the original ordinal variable, if applicable. Used for axis tick labels.

adjusted_hint

character hint, if adjusted for co_vars

title

character title for the plot

n_violin_max

integer from=0. This optional argument specifies the maximum number of levels of the group_var for which violin plots will be shown in the figure.

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations (in the figure)?

include_numbers_in_figures

logical Should the figure report the number of observations for each level of the grouping variable?

Value

A table and a matching plot.

Utility function to create a plot similar to the margins plots for nominal variables

Description

This function is still under development. It uses the nnet package to compute multinomial logistic regression models.

Usage

util_margins_nom(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  min_obs_in_subgroup = 5,
  min_obs_in_cat = 5,
  ds1,
  label_col,
  adjusted_hint = "",
  title = "",
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default)
)

Arguments

resp_vars

variable the name of the nominal measurement variable

group_vars

variable the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis.

min_obs_in_cat

integer This optional argument specifies the minimum number of observations that is required to include a category (level) of the outcome (resp_vars) in the analysis.

ds1

data.frame the data frame that contains the measurements, after replacing missing value codes by NA, excluding inadmissible values and transforming categorical variables to factors.

label_col

variable attribute the name of the column in the metadata with labels of variables

adjusted_hint

character hint, if adjusted for co_vars

title

character title for the plot

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations (in the figure)?

Value

A table and a matching plot.

Utility function to create a plot similar to the margins plots for ordinal variables

Description

This function is still under development. It uses the ordinal package to compute ordered regression models.

Usage

util_margins_ord(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  min_obs_in_subgroup = 5,
  min_subgroups = 5,
  ds1,
  label_col,
  adjusted_hint = "",
  title = "",
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default)
)

Arguments

resp_vars

variable the name of the ordinal measurement variable

group_vars

variable the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis.

min_subgroups

integer from=3. The model provided by the ordinal package requires at least three different subgroups (levels) of the group_var. Users might want to increase this threshold to obtain results only for variables with a sufficient number of group_var levels (observers, devices, etc.).

ds1

data.frame the data frame that contains the measurements, after replacing missing value codes by NA, excluding inadmissible values and transforming categorical variables to factors.

label_col

variable attribute the name of the column in the metadata with labels of variables

adjusted_hint

character hint, if adjusted for co_vars

title

character title for the plot

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations (in the figure)?

Value

A table and a matching plot.

Utility function to create a margins plot from Poisson regression models

Description

Utility function to create a margins plot from Poisson regression models

Usage

util_margins_poi(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  threshold_type = NULL,
  threshold_value,
  min_obs_in_subgroup = 5,
  ds1,
  label_col,
  adjusted_hint = "",
  title = "",
  sort_group_var_levels = getOption("dataquieR.acc_margins_sort",
    dataquieR.acc_margins_sort_default),
  include_numbers_in_figures = getOption("dataquieR.acc_margins_num",
    dataquieR.acc_margins_num_default)
)

Arguments

resp_vars

variable the name of the measurement variable

group_vars

variable the name of the observer, device or reader variable

co_vars

variable list a vector of covariables, e.g. age and sex for adjustment

threshold_type

enum empirical | user | none. See acc_margins.

threshold_value

numeric see acc_margins

min_obs_in_subgroup

integer from=0. This optional argument specifies the minimum number of observations that is required to include a subgroup (level) of the group_var in the analysis.

ds1

data.frame the data frame that contains the measurements, after replacing missing value codes by NA, excluding inadmissible values and transforming categorical variables to factors.

label_col

variable attribute the name of the column in the metadata with labels of variables

adjusted_hint

character hint, if adjusted for co_vars

title

character title for the plot

sort_group_var_levels

logical Should the levels of the grouping variable be sorted descending by the number of observations (in the figure)?

include_numbers_in_figures

logical Should the figure report the number of observations for each level of the grouping variable?

Value

A table and a matching plot.

`dataquieR` version of match.arg

Description

does not support partial matching, but will display the most likely match as a warning/error.

Usage

util_match_arg(arg, choices, several_ok = FALSE, error = TRUE)

Arguments

arg

the argument

choices

the choices

several_ok

allow more than one entry in arg

error

stop(), if arg is not in choices (warns and cleans arg, otherwise)

Value

"cleaned" arg

Combine data frames by merging

Description

This is an extension of merge working for a list of data frames.

Usage

util_merge_data_frame_list(data_frames, id_vars)

Arguments

data_frames

list of data.frames

id_vars

character the variable(s) to merge the data frames by. each of them must exist in all data frames.

Value

data.frame combination of data frames

Produce a condition message with a useful short stack trace.

Description

Produce a condition message with a useful short stack trace.

Usage

util_message(
  m,
  ...,
  applicability_problem = NA,
  intrinsic_applicability_problem = NA,
  integrity_indicator = "none",
  level = 0,
  immediate,
  title = "",
  additional_classes = c()
)

Arguments

m

a message or a condition

...

arguments for sprintf on m, if m is a character

applicability_problem

intrinsic_applicability_problem

integrity_indicator

character the message is an integrity problem, here is the indicator abbreviation..

level

integer level of the message (defaults to 0). Higher levels are more severe.

immediate

logical not used.

additional_classes

character additional classes the thrown condition object should inherit from, first.

Value

condition the condition object, if the execution is not stopped

Select really numeric variables

Description

Reduce resp_vars to those, which are either float or integer without VALUE_LABELS, i.e. likely numeric but not a factor

Usage

util_no_value_labels(resp_vars, meta_data, label_col, warn = TRUE, stop = TRUE)

Arguments

resp_vars

variable list len=1-2. the name of the continuous measurement variable

meta_data

data.frame the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

warn

logical warn about removed variable names

stop

logical stop on no matching resp_var

Value

character vector of matching resp_vars.

Distribute `CODE_LIST_TABLE` in item level metadata

Description

fills the columns MISSING_LIST_TABLE and VALUE_LABEL_TABLE from CODE_LIST_TABLE, if applicable

Usage

util_normalize_clt(meta_data)

Arguments

meta_data

data.frame old name for item_level

Value

meta_data, but CODE_LIST_TABLE column is distributed to the columns VALUE_LABEL_TABLE and MISSING_LIST_TABLE, respectively.

Normalize and check cross-item-level metadata

Description

Normalize and check cross-item-level metadata

Usage

util_normalize_cross_item(
  meta_data = "item_level",
  meta_data_cross_item = "cross-item_level",
  label_col = LABEL
)

Arguments

meta_data

cross-item-level metadata

meta_data_cross_item

label_col

character label column to use for variable naming

Value

normalized and checked cross-item-level metadata

Convert `VALUE_LABELS` to separate tables

Description

Convert VALUE_LABELS to separate tables

Usage

util_normalize_value_labels(
  meta_data = "item_level",
  max_value_label_len = getOption("dataquieR.MAX_VALUE_LABEL_LEN",
    dataquieR.MAX_VALUE_LABEL_LEN_default)
)

Arguments

meta_data

data.frame old name for item_level

max_value_label_len

integer maximum length for value labels

Value

data.frame metadata with VALUE_LABEL_TABLE instead of VALUE_LABELS (or none of these, if absent)

Examples

## Not run: 
prep_purge_data_frame_cache()
prep_load_workbook_like_file("meta_data_v2")
util_normalize_value_labels()
prep_add_data_frames(test_labs =
  tibble::tribble(~ CODE_VALUE, ~ CODE_LABEL, 17L, "Test", 19L, "Test",
    17L, "TestX"))
il <- prep_get_data_frame("item_level")
if (!VALUE_LABEL_TABLE %in% colnames(il)) {
  il$VALUE_LABEL_TABLE <- NA_character_
}
il$VALUE_LABEL_TABLE[[1]] <- "test_labs"
il$VALUE_LABELS[[1]] <- "17 = TestY"
prep_add_data_frames(item_level = il)
util_normalize_value_labels()

## End(Not run)

Detect Expected Observations

Description

For each participant, check, if an observation was expected, given the PART_VARS from item-level metadata

Usage

util_observation_expected(
  rv,
  study_data,
  meta_data,
  label_col = LABEL,
  expected_observations = c("HIERARCHY", "ALL", "SEGMENT")
)

Arguments

rv

character the response variable, for that a value may be expected

study_data

meta_data

REDcap rules 1 REDcap rules 2 REDcap rules 3

label_col

character mapping attribute colnames(study_data) vs. meta_data[label_col]

expected_observations

Value

a vector with TRUE or FALSE for each row of study_data, if for study_data[rv] a value is expected.

Utility function observations in subgroups

Description

This function uses !is.na to count the number of non-missing observations in subgroups of the data (list) and in a set of user defined response variables. In some applications it is required that the number of observations per e.g. factor level is higher than a user-defined minimum number.

Usage

util_observations_in_subgroups(x, rvs)

Arguments

x

data frame

rvs

variable names

Value

matrix of flags

Creates a Link to our Website

Description

i.e., to a vignette on the website

Usage

util_online_ref(fkt_name)

Arguments

fkt_name

character function name to generate a link for

Value

character the link

Utility function to compute and optimize bin breaks for histograms

Description

Utility function to compute and optimize bin breaks for histograms

Usage

util_optimize_histogram_bins(
  x,
  interval_freedman_diaconis = NULL,
  nbins_max = 100,
  cuts = NULL
)

Arguments

x

a vector of data values (numeric or datetime)

interval_freedman_diaconis

range of values which should be included to calculate the Freedman-Diaconis bandwidth (e.g., for con_limit_deviations only values within limits) in interval notation (e.g., ⁠[0;100]⁠)

nbins_max

the maximum number of bins for the histogram. Strong outliers can cause too many narrow bins, which might be even to narrow to be plotted. This also results in large files and rendering problems. So it is sensible to limit the number of bins. The function will produce a message if it reduces the number of bins in such a case. Reasons could be unspecified missing value codes, or minimum or maximum values far away from most of the data values, a few number of unique values, or (for con_limit_deviations) no or few values within limits.

cuts

a vector of values at which breaks between bins should occur

Value

a list with bin breaks, if needed separated for each segment of the plot

Utility function to distribute points across a time variable

Description

Utility function to distribute points across a time variable

Usage

util_optimize_sequence_across_time_var(
  time_var_data,
  n_points,
  prop_grid = 0.5
)

Arguments

time_var_data

vector of the data points of the time variable

n_points

maximum number of points to distribute across the time variable (minimum: 3)

prop_grid

proportion of points given in n_points that should be distributed in an equally spaced grid across the time variable (minimum: 0.1, maximum: 1). The remaining proportion of points will be spaced according to the distribution of the time variable's data points.

Value

a sequence of points in datetime format

Get the order of a vector with general order given in some other vector

Description

Get the order of a vector with general order given in some other vector

Usage

util_order_by_order(x, order, ...)

Arguments

x

the vector

order

the "order vector

...

additional arguments passed to order

Examples

## Not run: 
util_order_by_order(c("a", "b", "a", "c", "d"), letters)

## End(Not run)

Utility function parallel version of `purrr::pmap`

Description

Parallel version of purrr::pmap.

Usage

util_par_pmap(
  .l,
  .f,
  ...,
  cores = list(mode = "socket", cpus = util_detect_cores(), logging = FALSE,
    load.balancing = TRUE),
  use_cache = FALSE
)

Arguments

.l

data.frame with one call per line and one function argument per column

.f

function to call with the arguments from .l

...

additional, static arguments for calling .f

cores

number of cpu cores to use or a (named) list with arguments for parallelMap::parallelStart or NULL, if parallel has already been started by the caller.

use_cache

logical set to FALSE to omit re-using already distributed study- and metadata on a parallel cluster

Value

list of results of the function calls

Author(s)

Aurèle

S Struckmann

Utility function to parse assignments

Description

This function parses labels & level assignments in the format 1 = male | 2 = female. The function also handles m = male | f = female, but this would not match the metadata concept. The split-character can be given, if not the default from SPLIT_CHAR is to be used, but this would also violate the metadata concept.

Usage

util_parse_assignments(
  text,
  split_char = SPLIT_CHAR,
  multi_variate_text = FALSE,
  split_on_any_split_char = FALSE
)

Arguments

text

Text to be parsed

split_char

Character separating assignments, may be a vector, then all will be tried and the the most likely matching one will be returned as attribute split_char of the result.

multi_variate_text

don't paste text but parse element-wise

split_on_any_split_char

split on any split split_char, if > 1 given.

Value

the parsed assignments as a named list

Examples

## Not run: 
md <- prep_get_data_frame("meta_data")
vl <- md$VALUE_LABELS
vl[[50]] <- "low<medium < high"
a <- util_parse_assignments(vl, split_char = c(SPLIT_CHAR, "<"),
  multi_variate_text = TRUE)
b <- util_parse_assignments(vl, split_char = c(SPLIT_CHAR, "<"),
  split_on_any_split_char = TRUE, multi_variate_text = TRUE)
is_ordered <- vapply(a, attr, "split_char", FUN.VALUE = character(1)) == "<"
md$VALUE_LABELS[[50]] <- "low<medium < high"
md$VALUE_LABELS[[51]] <- "1 = low< 2=medium < 3=high"
md$VALUE_LABELS[[49]] <- "2 = medium< 1=low < 3=high" # counter intuitive
with_sl <- prep_scalelevel_from_data_and_metadata(study_data = "study_data",
  meta_data = md)
View(with_sl[, union(SCALE_LEVEL, colnames(with_sl))])

## End(Not run)

Utility function to parse intervals

Description

Utility function to parse intervals

Usage

util_parse_interval(int)

Arguments

int

an interval as string, e.g., "[0;Inf)"

Value

the parsed interval with elements inc_l (Is the lower limit included?), low (the value of the lower limit), inc_u (Is the upper limit included?), upp (the value of the upper limit)

Interpret a `REDcap`-style rule and create an expression, that represents this rule

Description

Interpret a REDcap-style rule and create an expression, that represents this rule

Usage

util_parse_redcap_rule(
  rule,
  debug = 0,
  entry_pred = "REDcapPred",
  must_eof = FALSE
)

Arguments

rule

character REDcap style rule

debug

integer debug level (0 = off, 1 = log, 2 = breakpoints)

entry_pred

character for debugging reasons: The production rule used entry point for the parser

must_eof

logical if TRUE, expect the input to be eof, when the parser succeeded, fail, if not.

Value

expression the interpreted rule

For resolving left-recursive rules, StackOverflow helps understanding the grammar below, just in case, theoretical computer science is not right in your mind currently.

Examples

## Not run: 
#  rules:
# pregnancies <- 9999 ~ SEX == 'm' |  is.na(SEX)
# pregnancies <- 9998 ~ AGE < 12 |  is.na(AGE)
# pregnancies = 9999 ~ dist > 2 |  speed == 0

data.frame(target = "SEX_0",
  rule = '[speed] > 5 and [dist] > 42 or 1 = "2"',
  CODE = 99999, LABEL = "PREGNANCIES_NOT_ASSESSED FOR MALES",
  class = "JUMP")
ModifyiedStudyData <- replace in SEX_0 where SEX_0 is empty, if rule fits
ModifyedMetaData <- add missing codes with labels and class here

subset(study_data, eval(pregnancies[[3]]))

rule <-
 paste0('[con_consentdt] <> "" and [sda_osd1dt] <> "" and',
 ' datediff([con_consentdt],[sda_osd1dt],"d",true) < 0')

x <- data.frame(con_consentdt = c(as.POSIXct("2020-01-01"),
                as.POSIXct("2020-10-20")),
                sda_osd1dt = c(as.POSIXct("2020-01-20"),
                as.POSIXct("2020-10-01")))
eval(util_parse_redcap_rule(paste0(
  '[con_consentdt] <> "" and [sda_osd1dt] <> "" and ',
  'datediff([con_consentdt],[sda_osd1dt],"d", "Y-M-D",true) < 10')),
  x, util_get_redcap_rule_env())

util_parse_redcap_rule("[a] = 12 or [b] = 13")
cars[eval(util_parse_redcap_rule(
  rule = '[speed] > 5 and [dist] > 42 or 1 = "2"'), cars,
  util_get_redcap_rule_env()), ]
cars[eval(util_parse_redcap_rule(
  rule = '[speed] > 5 and [dist] > 42 or 2 = "2"'), cars,
  util_get_redcap_rule_env()), ]
cars[eval(util_parse_redcap_rule(
  rule = '[speed] > 5 or [dist] > 42 and 1 = "2"'), cars,
  util_get_redcap_rule_env()), ]
cars[eval(util_parse_redcap_rule(
  rule = '[speed] > 5 or [dist] > 42 and 2 = "2"'), cars,
  util_get_redcap_rule_env()), ]
util_parse_redcap_rule(rule = '(1 = "2" or true) and (false)')
eval(util_parse_redcap_rule(rule =
  '[dist] > sum(1, +(2, [dist] + 5), [speed]) + 3 + [dist]'),
cars, util_get_redcap_rule_env())

## End(Not run)

Paste strings but keep NA (`paste0`)

Description

Paste strings but keep NA (paste0)

Usage

util_paste0_with_na(...)

Arguments

...

other arguments passed to paste0

Value

character pasted strings

Paste strings but keep NA

Description

Paste strings but keep NA

Usage

util_paste_with_na(...)

Arguments

...

other arguments passed to paste

Value

character pasted strings

Plot to un-disclosed `ggplot` object

Description

Plot to un-disclosed ggplot object

Usage

util_plot2svg_object(expr, w = 21.2, h = 15.9, sizing_hints)

Arguments

expr

plot expression

w

width in cm

h

height in cm

Value

ggplot object, but rendered (no original data included)

Utility function to create plots for categorical variables

Description

Depending on the required level of complexity, this helper function creates various plots for categorical variables. Next to basic bar plots, it also enables group comparisons (for example for device/examiner effects) and longitudinal views.

Usage

util_plot_categorical_vars(
  resp_vars,
  group_vars = NULL,
  time_vars = NULL,
  study_data,
  meta_data,
  n_cat_max = 6,
  n_group_max = getOption("dataquieR.max_group_var_levels_in_plot", 20),
  n_data_min = 20
)

Arguments

resp_vars

name of the categorical variable

group_vars

name of the grouping variable

time_vars

name of the time variable

study_data

the data frame that contains the measurements

meta_data

the data frame that contains metadata attributes of study data

n_cat_max

maximum number of categories to be displayed individually for the categorical variable (resp_vars)

n_group_max

maximum number of categories to be displayed individually for the grouping variable (group_vars, devices / examiners)

n_data_min

minimum number of data points to create a time course plot for an individual category of the resp_vars variable

Value

a figure

Plot a `ggplot2` figure without `plotly`

Description

Plot a ggplot2 figure without plotly

Usage

util_plot_figure_no_plotly(x, sizing_hints = NULL)

Arguments

x

ggplot2::ggplot2 object

sizing_hints

object additional metadata about the natural figure size

Value

htmltools compatible object

Plot a `ggplot2` figure using `plotly`

Description

Plot a ggplot2 figure using plotly

Usage

util_plot_figure_plotly(x, sizing_hints = NULL)

Arguments

x

ggplot2::ggplot2 object

sizing_hints

object additional metadata about the natural figure size

Value

htmltools compatible object

Replacement for `htmltools::plotTag`

Description

the function is specifically designed for fully scalable SVG figures.

Usage

util_plot_svg_to_uri(expr, w = 800, h = 600)

Arguments

expr

plot expression

w

width

h

height

w and h are mostly used for the relation of fixed text sizes to the figure size.

Value

htmltools compatible object

`Plotly` to un-disclosed `ggplot` object

Description

Plotly to un-disclosed ggplot object

Usage

util_plotly2svg_object(plotly, w = 21.2, h = 15.9, sizing_hints)

Arguments

plotly

the object

w

width in cm

h

height in cm

Value

ggplot object, but rendered (no original data included)

Utility function to prepare the metadata for location checks

Description

Utility function to prepare the metadata for location checks

Usage

util_prep_location_check(
  resp_vars,
  meta_data,
  report_problems = c("error", "warning", "message"),
  label_col = VAR_NAMES
)

Arguments

resp_vars

variable list the names of the measurement variables

meta_data

data.frame the data frame that contains metadata attributes of study data

report_problems

enum Should missing metadata information be reported as error, warning or message?

Value

a list with the location metric (mean or median) and expected range for the location check

Utility function to prepare the metadata for proportion checks

Description

Utility function to prepare the metadata for proportion checks

Usage

util_prep_proportion_check(
  resp_vars,
  meta_data,
  ds1,
  report_problems = c("error", "warning", "message"),
  label_col = attr(ds1, "label_col")
)

Arguments

resp_vars

variable list the names of the measurement variables

meta_data

data.frame the data frame that contains metadata attributes of study data

ds1

data.frame the data frame that contains the measurements (hint: missing value codes should be excluded, so the function should be called with ds1, if available)

report_problems

enum Should missing metadata information be reported as error, warning or message?

label_col

variable attribute the name of the column in the metadata with labels of variables

Value

a list with the expected range for the proportion check

Convert single `dataquieR` result to an `htmltools` compatible object

Description

Convert single dataquieR result to an htmltools compatible object

Usage

util_pretty_print(
  dqr,
  nm,
  is_single_var,
  meta_data,
  label_col,
  use_plot_ly,
  dir,
  ...
)

Arguments

dqr

dataquieR_result an output (indicator) from dataquieR

nm

character the name used in the report, the alias name of the function call plus the variable name

is_single_var

logical we are creating a single variable overview page or an indicator summary page

meta_data

meta_data the data frame that contains metadata attributes of study data

label_col

variable attribute the name of the column in the metadata with labels of variables

use_plot_ly

logical use plotly

dir

character output directory for potential iframes.

...

further arguments passed through, if applicable

Value

htmltools compatible object with rendered dqr

Prepare a vector four output

Description

Prepare a vector four output

Usage

util_pretty_vector_string(v, quote = dQuote, n_max = length(v))

Arguments

v

the vector

quote

function, used for quoting – sQuote or dQuote

n_max

maximum number of elements of v to display, if not missing.

Value

the "pretty" collapsed vector as a string.

Bind data frames row-based

Description

if not all data frames share all columns, missing columns will be filled with NAs.

Usage

util_rbind(..., data_frames_list = list())

Arguments

...

data.frame none more more data frames

data_frames_list

list optional, a list of data frames

Value

data.frame all data frames appended

Examples

## Not run: 
util_rbind(head(cars), head(iris))
util_rbind(head(cars), tail(cars))
util_rbind(head(cars)[, "dist", FALSE], tail(cars)[, "speed", FALSE])

## End(Not run)

Can we really be sure to run `RStudio`

Description

⁠Jetbrain's⁠ Idea and so on fake to be RStudio by having RStudio in .Platform$GUI.

Usage

util_really_rstudio()

Value

TRUE, if really sure to be RStudio, FALSE, otherwise.

Map a vector of values based on an assignment table

Description

Map a vector of values based on an assignment table

Usage

util_recode(values, mapping_table, from, to, default = NULL)

Arguments

values

vector the vector

mapping_table

data.frame a table with the mapping table

from

character the name of the column with the "old values"

to

character the name of the column with the "new values"

default

character either one character or on character per value, being used, if an entry from values is not in the from column in 'mapping_table

Value

the mapped values

For a group of variables (original) the function provides all original plus referred variables in the metadata and a new item_level metadata including information on the original variables and the referred variables

Description

For a group of variables (original) the function provides all original plus referred variables in the metadata and a new item_level metadata including information on the original variables and the referred variables

Usage

util_referred_vars(
  resp_vars,
  id_vars = character(0),
  vars_in_subgroup = character(0),
  meta_data,
  meta_data_segment = NULL,
  meta_data_dataframe = NULL,
  meta_data_cross_item = NULL,
  meta_data_item_computation = NULL,
  strata_column = NULL
)

Arguments

resp_vars

variable list the name of the original variables.

id_vars

variable a vector containing the name/s of the variables containing ids

vars_in_subgroup

variable a vector containing the name/s of the variable/s mentioned inside the subgroup rule

meta_data

data.frame old name for item_level

meta_data_segment

data.frame – optional: Segment level metadata

meta_data_dataframe

data.frame – optional if study_data is present: Data frame level metadata

meta_data_cross_item

data.frame – optional: Cross-item level metadata

meta_data_item_computation

data.frame – optional: Computed items metadata

strata_column

variable name of a study variable used to stratify the report by and to add as referred variable

Value

a named list containing the referred variables and a new item_level metadata including information on the original variables and the referred variables

removes empty rows from `x`

Description

removes empty rows from x

Usage

util_remove_empty_rows(x, id_vars = character(0))

Arguments

x

data.frame a data frame to be cleaned

id_vars

character column names, that will be treated as empty

Value

data.frame reduced x

remove all records, that have at least one `NA` in any of the given variables

Description

remove all records, that have at least one NA in any of the given variables

Usage

util_remove_na_records(study_data, vars = colnames(study_data))

Arguments

study_data

the study data frame

vars

the variables being checked for NAs

Value

modified study_data data frame

Examples

## Not run: 
dta <- iris
dim(util_remove_na_records(dta))
dta$Species[4:6] <- NA
dim(util_remove_na_records(dta))
dim(util_remove_na_records(dta, c("Sepal.Length", "Petal.Length")))

## End(Not run)

Render a table summarizing dataquieR results

Description

Render a table summarizing dataquieR results

Usage

util_render_table_dataquieR_summary(
  x,
  grouped_by = c("call_names", "indicator_metric"),
  folder_of_report = NULL,
  var_uniquenames = NULL
)

Arguments

x

a report summary (summary(r))

grouped_by

define the columns of the resulting matrix. It can be either "call_names", one column per function, or "indicator_metric", one column per indicator or both c("call_names", "indicator_metric"). The last combination is the default

folder_of_report

a named vector with the location of variable and call_names

var_uniquenames

a data frame with the original variable names and the unique names in case of reports created with dq_report_by containing the same variable in several reports (e.g., creation of reports by sex)

Value

something, htmltools can render

Utility function to replace missing codes by `NA`s

Description

Substitute all missing codes in a data.frame by NA.

Usage

util_replace_codes_by_NA(
  study_data,
  meta_data = "item_level",
  split_char = SPLIT_CHAR,
  sm_code = NULL
)

Arguments

study_data

Study data including jump/missing codes as specified in the code conventions

meta_data

Metadata as specified in the code conventions

split_char

Character separating missing codes

sm_code

missing code for NAs, if they have been re-coded by util_combine_missing_lists

Codes are expected to be numeric.

Value

a list with a modified data frame and some counts

Replace limit violations (HARD_LIMITS) by NAs

Description

Replace limit violations (HARD_LIMITS) by NAs

Usage

util_replace_hard_limit_violations(study_data, meta_data, label_col)

Arguments

study_data

meta_data