Version: | 1.1.0 |
Type: | Package |
Title: | Robust Nonparametric Two-Sample Tests for Location/Scale |
Author: | Sermad Abbas |
Maintainer: | Sermad Abbas <abbas@statistik.tu-dortmund.de> |
BugReports: | https://github.com/s-abbas/robnptests/issues |
Description: | Implementations of several robust nonparametric two-sample tests for location or scale differences. The test statistics are based on robust location and scale estimators, e.g. the sample median or the Hodges-Lehmann estimators as described in Fried & Dehling (2011) <doi:10.1007/s10260-011-0164-1>. The p-values can be computed via the permutation principle, the randomization principle, or by using the asymptotic distributions of the test statistics under the null hypothesis, which ensures (approximate) distribution independence of the test decision. To test for a difference in scale, we apply the tests for location difference to transformed observations; see Fried (2012) <doi:10.1016/j.csda.2011.02.012>. Random noise on a small range can be added to the original observations in order to hold the significance level on data from discrete distributions. The location tests assume homoscedasticity and the scale tests require the location parameters to be zero. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Depends: | R (≥ 4.0.0) |
URL: | https://github.com/s-abbas/robnptests |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.1 |
Imports: | Rdpack, gtools, robustbase, statmod, stats, utils, checkmate |
RdMacros: | Rdpack |
Suggests: | testthat, knitr, rmarkdown, usethis, covr |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2023-02-13 20:44:58 UTC; abbas |
Repository: | CRAN |
Date/Publication: | 2023-02-13 21:10:02 UTC |
Calculation of permutation p-value
Description
calc_perm_p_value
calculates the permutation p-value following Phipson and Smyth (2010).
Usage
calc_perm_p_value(
statistic,
distribution,
m,
n,
randomization,
n.rep,
alternative
)
Arguments
statistic |
observed value of the test statistic. |
distribution |
a numeric vector with the permutation/randomization distribution. |
m |
an integer value giving size of first sample. |
n |
an integer value giving size of second sample. |
randomization |
a logical value indicating whether the p-value should be
computed from a permutation ( |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
Value
p-value for the specified alternative.
References
Phipson B, Smyth GK (2010). “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. doi:10.2202/1544-6115.1585.
Checks for input arguments
Description
check_test_input
is a helper functions that contains checks for the
input arguments of the two-sample tests.
Usage
check_test_input(
x,
y,
alternative,
delta,
method,
scale,
n.rep,
na.rm,
scale.test,
wobble,
wobble.seed,
gamma = NULL,
psi = NULL,
k = NULL,
test.name
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided", "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or scale parameter, depending on whether the test should be performed for a difference in location or in scale. |
scale |
a character string specifying the scale estimator used for
standardization in the test statistic; must be one of |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value testing whether the samples should be compared for a difference in scale. |
wobble |
a logical value indicating whether the sample should be checked
for duplicated values that can cause the scale estimate to be zero.
If such values are present, uniform noise is added to the sample,
see |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be trimmed/replaced from each end of the sample before trimmed mean/winsorized variance. |
psi |
kernel used for optimization in the computation of the M-estimates.
Must be one of |
k |
tuning parameter(s) for the respective psi function. |
test.name |
character string specifying the two-sample test for which the helper function is used. |
Details
The two-sample tests in this package share similar arguments. To reduce the
amount of repetitive code, this function contains the argument checks so that
only check_test_input
needs to be called within the functions for
the two-sample tests.
The scale estimators "S1"
and "S2"
can only be used in
combination with test.name = "hl1_test"
or test.name = "hl2_test"
.
The estimators "S3"
and "S4"
can only be used with
test.name = "med_test"
.
Value
An error message if a check fails.
Test decision for asymptotic versions of HL1-, HL2-, and MED-tests
Description
compute_results_asymptotic
is a helper function to compute the test
decision for the HL1-, HL2-, and MED-test when method = "asymptotic"
.
Usage
compute_results_asymptotic(x, y, alternative, delta, type)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided", "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or scale parameter, depending on whether the test should be performed for a difference in location or in scale. |
type |
a character string specifying the desired test statistic. It must
be one of |
Value
A named list containing the following components:
statistic |
the value of the test statistic. |
estimates |
the location estimates for both samples in case of the HL1- and the MED-tests. The estimate for the location difference in case of the HL2-tests. |
p.value |
the p-value for the test. |
Finite-sample test decision for HL1-, HL2-, and MED-tests
Description
compute_results_finite
is a helper function to compute the test
decision for the HL1-, HL2-, and MED-test when method = "randomization"
or method = "permutation"
.
Usage
compute_results_finite(x, y, alternative, delta, method, n.rep, type)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided", "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or scale parameter, depending on whether the test should be performed for a difference in location or in scale. |
method |
a character string specifying how the p-value is computed with
possible values |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
type |
a character string specifying the desired test statistic. It must
be one of |
Value
A named list containing the following components:
statistic |
the value of the test statistic. |
estimates |
the location estimates for both samples in case of the HL1- and the MED-tests. The estimate for the location difference in case of the HL2-tests. |
p.value |
the p-value for the test. |
Two-sample location tests based on one-sample Hodges-Lehmann estimator
Description
hl1_test
performs a two-sample location test based on
the difference of the one-sample Hodges-Lehmann estimators of both samples.
Usage
hl1_test(
x,
y,
alternative = c("two.sided", "greater", "less"),
delta = ifelse(scale.test, 1, 0),
method = c("asymptotic", "permutation", "randomization"),
scale = c("S1", "S2"),
n.rep = 10000,
na.rm = FALSE,
scale.test = FALSE,
wobble = FALSE,
wobble.seed = NULL
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or
scale parameter, depending on whether the test should be performed
for a difference in location or in scale. The default is
|
method |
a character string specifying how the p-value is computed with
possible values |
scale |
a character string specifying the scale estimator used for standardization
of the test statistic; must be one of |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value to specify if the samples should be compared
for a difference in scale. The default is |
wobble |
a logical value indicating whether the sample should be checked for
duplicated values that can cause the scale estimate to be zero.
If such values are present, uniform noise is added to the sample,
see |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
Details
The test statistic for this test is based on the difference of the
one-sample Hodges-Lehmann estimators of x
and y
, see
hodges_lehmann
. Three versions
of the test are implemented: randomization, permutation, and asymptotic.
The test statistic for the permutation and randomization version of the test is standardized using a robust scale estimator, see (Fried and Dehling 2011).
With scale = "S1"
, the scale is estimated by
S = med(|x_i - x_j|: 1 \le i < j \le m, |y_i - y_j|, 1 \le i < j \le n),
whereas scale = "S2"
uses
S = med(|z_i - z_j|: 1 \le i < j \le m + n).
Here, z = (z_1, ..., z_{m + n}) = (x_1 - med(x), ..., x_m - med(x), y_1 - med(y), ..., y_n - med(y))
is the median-corrected sample.
The randomization distribution is based on randomly drawn splits with
replacement. The function permp
(Phipson and Smyth 2010)
is used to calculate the p-value. For the asymptotic test, a transformed version
of the difference of the HL1-estimators, which asymptotically follows a
normal distribution, is used. For more details on the asymptotic test, see
Fried and Dehling (2011).
For scale.test = TRUE
, the test compares the two samples for a difference
in scale. This is achieved by log-transforming the original squared observations,
i.e. x
is replaced by log(x^2)
and y
by log(y^2)
.
A potential scale difference then appears as a location difference between
the transformed samples, see Fried (2012).
Note that the samples need to have equal locations. The sample should not
contain zeros to prevent problems with the necessary log-transformation. If
it contains zeros, uniform noise is added to all variables in order to remove
zeros and a message is printed.
If the sample has been modified (either because of zeros if scale.test = TRUE
or wobble = TRUE
), the modified samples can be retrieved using
set.seed(wobble.seed); wobble(x, y)
.
Both samples need to contain at least 5 non-missing values.
Value
A named list with class "htest
" containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value for the test. |
estimate |
the one-sample Hodges-Lehmann estimates of |
null.value |
the specified hypothesized value of the mean difference/squared scale ratio. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating how the p-value was computed. |
data.name |
a character string giving the names of the data. |
References
Phipson B, Smyth GK (2010). “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. doi:10.2202/1544-6115.1585.
Fried R, Dehling H (2011). “Robust nonparametric tests for the two-sample location problem.” Statistical Methods & Applications, 20(4), 409–422. doi:10.1007/s10260-011-0164-1.
Fried R (2012). “On the online estimation of piecewise constant volatilities.” Computational Statistics & Data Analysis, 56(11), 3080–3090. doi:10.1016/j.csda.2011.02.012.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Asymptotic HL1 test
hl1_test(x, y, method = "asymptotic", scale = "S1")
## Not run:
# HL12 test using randomization principle by drawing 1000 random permutations
# with replacement
hl1_test(x, y, method = "randomization", n.rep = 1000, scale = "S2")
## End(Not run)
Two-sample location tests based on two-sample Hodges-Lehmann estimator.
Description
hl2_test
performs a two-sample location test based on the two-sample
Hodges-Lehmann estimator for shift.
Usage
hl2_test(
x,
y,
alternative = c("two.sided", "greater", "less"),
delta = ifelse(scale.test, 1, 0),
method = c("asymptotic", "permutation", "randomization"),
scale = c("S1", "S2"),
n.rep = 10000,
na.rm = FALSE,
scale.test = FALSE,
wobble = FALSE,
wobble.seed = NULL
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or
scale parameter, depending on whether the test should be performed
for a difference in location or in scale. The default is
|
method |
a character string specifying how the p-value is computed with
possible values |
scale |
a character string specifying the scale estimator used for standardization
of the test statistic; must be one of |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value to specify if the samples should be compared
for a difference in scale. The default is |
wobble |
a logical value indicating whether the sample should be checked for
duplicated values that can cause the scale estimate to be zero.
If such values are present, uniform noise is added to the sample,
see |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
Details
The test statistic for this test is based on the two-sample Hodges-Lehmann
estimator of x
and y
, see
hodges_lehmann_2sample
. Three versions of the test
are implemented: randomization, permutation, and asymptotic.
The test statistic for the permutation and randomization version of the test is standardized using a robust scale estimator, see (Fried and Dehling 2011).
With scale = "S1"
, the scale is estimated by
S = med(|x_i - x_j|: 1 \le i < j \le m, |y_i - y_j|, 1 \le i < j \le n),
whereas scale = "S2"
uses
S = med(|z_i - z_j|: 1 \le i < j \le m + n).
Here, z = (z_1, ..., z_{m + n}) = (x_1 - med(x), ..., x_m - med(x), y_1 - med(y), ..., y_n - med(y))
is the median-corrected sample.
The randomization distribution is based on randomly drawn splits with
replacement. The function permp
(Phipson and Smyth 2010)
is used to calculate the p-value. For the asymptotic test, a transformed version
of the HL2-estimator, which asymptotically follows a normal distribution, is
used. For more details on the asymptotic test, see Fried and Dehling (2011).
For scale.test = TRUE
, the test compares the two samples for a difference
in scale. This is achieved by log-transforming the original squared observations,
i.e. x
is replaced by log(x^2)
and y
by log(y^2)
.
A potential scale difference then appears as a location difference between
the transformed samples, see Fried (2012).
Note that the samples need to have equal locations. The sample should not
contain zeros to prevent problems with the necessary log-transformation. If
it contains zeros, uniform noise is added to all variables in order to remove
zeros and a message is printed.
If the sample has been modified (either because of zeros if scale.test = TRUE
or wobble = TRUE
), the modified samples can be retrieved using
set.seed(wobble.seed); wobble(x, y)
.
Both samples need to contain at least 5 non-missing values.
Value
A named list with class "htest
" containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value for the test. |
estimate |
the estimated location difference between |
null.value |
the specified hypothesized value of the mean difference/squared scale ratio. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating how the p-value was computed. |
data.name |
a character string giving the names of the data. |
References
Phipson B, Smyth GK (2010). “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. doi:10.2202/1544-6115.1585.
Fried R, Dehling H (2011). “Robust nonparametric tests for the two-sample location problem.” Statistical Methods & Applications, 20(4), 409–422. doi:10.1007/s10260-011-0164-1.
Fried R (2012). “On the online estimation of piecewise constant volatilities.” Computational Statistics & Data Analysis, 56(11), 3080–3090. doi:10.1016/j.csda.2011.02.012.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Asymptotic HL2 test
hl2_test(x, y, method = "asymptotic", scale = "S1")
## Not run:
# HL22 test using randomization principle by drawing 1000 random permutations
# with replacement
hl2_test(x, y, method = "randomization", n.rep = 1000, scale = "S2")
## End(Not run)
One-sample Hodges-Lehmann estimator
Description
hodges_lehmann
calculates the one-sample Hodges-Lehmann estimator
of a sample.
Usage
hodges_lehmann(x, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
na.rm |
a logical value indicating whether NA values in |
Details
The one-sample Hodges-Lehmann estimator for a sample of size n
is defined as
med(\frac{X_i + X_j}{2}, 1 \le i < j \le m).
Value
The one-sample Hodges-Lehmann estimator.
References
Hodges JL, Lehmann EL (1963). “Estimates of location based on rank tests.” The Annals of Mathematical Statistics, 34(2), 598–611. doi:10.1214/aoms/1177704172.
Examples
# Generate random sample
set.seed(108)
x <- rnorm(10)
# Compute one-sample Hodges-Lehmann estimator
hodges_lehmann(x)
Two-sample Hodges-Lehmann estimator
Description
hodges_lehmann_2sample
calculates the two-sample Hodges-Lehmann
estimator for the location difference of two samples x and y.
Usage
hodges_lehmann_2sample(x, y, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
na.rm |
a logical value indicating whether NA values in |
Details
The two-sample Hodges-Lehmann estimator for two samples x
and y
of sizes m
and n
is defined as
med(|x_i - y_j|, 1 \le i \le m, 1 \le j \le n).
Value
The two-sample Hodges-Lehmann estimator.
References
Hodges JL, Lehmann EL (1963). “Estimates of location based on rank tests.” The Annals of Mathematical Statistics, 34(2), 598–611. doi:10.1214/aoms/1177704172.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(10); y <- rnorm(10)
# Compute two-sample Hodges-Lehmann estimator
hodges_lehmann_2sample(x, y)
M-estimator of location
Description
m_est
calculates an M-estimate of location and its variance
for different psi functions.
Usage
m_est(
x,
psi,
k = robustbase::.Mpsi.tuning.default(psi),
tol = 1e-06,
max.it = 15,
na.rm = FALSE
)
Arguments
x |
a (non-empty) numeric vector of data values. |
psi |
kernel used for optimization.
Must be one of |
k |
tuning parameter(s) for the respective kernel function,
defaults to parameters implemented in |
tol |
tolerance for convergence. The default is 1e-06. |
max.it |
the maximum number of iterations. The default is 15. |
na.rm |
a logical value indicating whether NA values in |
Details
To compute the M-estimate, the iterative algorithm described in Maronna et al. (2019) is used. The variance is estimated as in Huber (1981).
If max.it
contains decimal places, it is truncated to an integer
value.
Value
A named list containing the components:
est |
estimated mean. |
var |
estimated variance. |
References
Maronna RA, Martin DR, Yohai VJ, Salibián-Barrera M (2019). Robust Statistics: Theory and Methods (with R), Wiley Series in Probability and Statistics, Second edition edition. Wiley. doi:10.1002/9781119214656.
Huber PJ (1981). Robust Statistics. Wiley, New York. doi:10.1002/0471725250.
Examples
# Generate random sample
set.seed(108)
x <- rnorm(10)
# Computer Huber's M-estimate
m_est(x, psi = "huber")
Permutation distribution for M-statistics
Description
mest_perm_distribution
calculates the permutation distribution for the M-statistics from
m_test_statistic
.
Usage
m_est_perm_distribution(x, y, psi, k, randomization = FALSE, n.rep = 10000)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
psi |
kernel used for optimization.
Must be one of |
k |
tuning parameter(s) for the respective kernel function,
defaults to parameters implemented in |
randomization |
a logical value indicating whether the p-value should be
computed from a permutation ( |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
Details
Missing values in either x
or y
are not allowed.
Value
Vector with permutation distribution of the test statistic specified by psi
and k
.
References
Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibián-Barrera M, Verbeke T, Koller M, Conceicao EL, di Palma MA (2022). robustbase: Basic robust statistics. R package version 0.95-0, https://CRAN.R-project.org/package=robustbase.
Two sample location test based on M-estimators
Description
m_test
performs a two-sample location test based on an M-estimator.
Usage
m_test(
x,
y,
alternative = c("two.sided", "greater", "less"),
delta = ifelse(scale.test, 1, 0),
method = c("asymptotic", "permutation", "randomization"),
psi = c("huber", "hampel", "bisquare"),
k = robustbase::.Mpsi.tuning.default(psi),
n.rep = 10000,
na.rm = FALSE,
scale.test = FALSE,
wobble.seed = NULL,
...
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or
scale parameter, depending on whether the test should be performed
for a difference in location or in scale. The default is
|
method |
a character string specifying how the p-value is computed with
possible values |
psi |
kernel used for optimization.
Must be one of |
k |
tuning parameter(s) for the respective kernel function,
defaults to parameters implemented in |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value to specify if the samples should be compared
for a difference in scale. The default is |
wobble.seed |
an integer value used as a seed for the random number
generation in case that |
... |
additional arguments |
Details
The test statistic for this test is based on the difference of the M-estimates
of location of x
and y
, see m_est
.
Three different psi-functions can be used: huber
, hampel
, and
bisquare
. The corresponding tuning parameter(s) can be set by the
argument k
of the function.
The estimate for the location difference is scaled by a pooled estimate for
the standard deviation. This estimate is based on the
tau-estimate of scale and is computed with the default parameter settings
of the function scaleTau2
. These can be changed if
by setting c1
and c2
.
More details on the construction of the test statistic are given in the
vignettes vignette("robnptests")
and
vignette("m_tests")
.
Three versions of the test are implemented: randomization, permutation, and asymptotic.
The randomization distribution is based on randomly drawn splits with
replacement. The function permp
(Phipson and Smyth 2010)
is used to calculate the p-value. The psi-function for the the M-estimate
is computed with the implementations in the package
robustbase.
For the asymptotic test, the distribution of the test statistic is approximated
by a standard normal distribution.
However, this is only justified under the normality assumption. When the
observations do not come from a normal distribution, the tests might not keep
the desired significance level. Simulations indicate that the level is kept
under symmetric distributions if the variance exists. Under skewed
distributions, it tends to be anti-conservative, see the vignette
vignette("m_tests")
. The test statistic can be corrected by a
factor which has to be determined individually for a specific distribution in
such cases.
For scale.test = TRUE
, the test compares the two samples for a difference
in scale. This is achieved by log-transforming the original squared observations,
i.e. x
is replaced by log(x^2)
and y
by log(y^2)
.
A potential scale difference then appears as a location difference between
the transformed samples, see Fried (2012).
Note that the samples need to have equal locations. The sample should not
contain zeros to prevent problems with the necessary log-transformation. If
it contains zeros, uniform noise is added to all variables in order to remove
zeros and a message is printed.
If the sample has been modified because of zeros when scale.test = TRUE
,
the modified samples can be retrieved using
set.seed(wobble.seed); wobble(x, y)
Both samples need to contain at least 5 non-missing values.
Value
A named list with class "htest
" containing the following components:
statistic |
the value of the test statistic. |
parameter |
the degrees of freedom for the test statistic. |
p.value |
the p-value for the test. |
estimate |
the M-estimates of |
null.value |
the specified hypothesized value of the mean difference/squared scale ratio. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating how the p-value was computed. |
data.name |
a character string giving the names of the data. |
References
Fried R (2012). “On the online estimation of piecewise constant volatilities.” Computational Statistics & Data Analysis, 56(11), 3080–3090. doi:10.1016/j.csda.2011.02.012.
Maronna RA, Zamar RH (2002). “Robust estimates of location and dispersion of high-dimensional datasets.” Technometrics, 44(4), 307–317. doi:10.1198/004017002188618509.
Phipson B, Smyth GK (2010). “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. doi:10.2202/1544-6115.1585.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Asymptotic test based on Huber M-estimator
m_test(x, y, method = "asymptotic", psi = "huber")
## Not run:
# Randomization test based on Hampel M-estimator with 1000 random permutations
# drawn with replacement
m_test(x, y, method = "randomization", n.rep = 1000, psi = "hampel")
## End(Not run)
Test statistics for the M-tests
Description
m_test_statistic
calculates the test statistics for
tests based on M-estimators.
Usage
m_test_statistic(x, y, psi, k = robustbase::.Mpsi.tuning.default(psi), ...)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
psi |
kernel used for optimization.
Must be one of |
k |
tuning parameter(s) for the respective kernel function,
defaults to parameters implemented in |
... |
additional arguments |
Details
For details on how the test statistic is constructed, we refer to the
vignette vignette("m_tests")
Value
A named list containing the following components:
statistic |
standardized test statistic. |
estimates |
M-estimates of location for both |
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Compute Huber-M-statistic
m_test_statistic(x, y, psi = "huber")
Two-sample location tests based on the sample median
Description
med_test
performs a two-sample location test based on
the difference of the sample medians for both samples.
Usage
med_test(
x,
y,
alternative = c("two.sided", "greater", "less"),
delta = ifelse(scale.test, 1, 0),
method = c("asymptotic", "permutation", "randomization"),
scale = c("S3", "S4"),
n.rep = 10000,
na.rm = FALSE,
scale.test = FALSE,
wobble = FALSE,
wobble.seed = NULL
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
delta |
a numeric value indicating the true difference in the location or
scale parameter, depending on whether the test should be performed
for a difference in location or in scale. The default is
|
method |
a character string specifying how the p-value is computed with
possible values |
scale |
a character string specifying the scale estimator used for standardization
of the test statistic, must be one of |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value to specify if the samples should be compared
for a difference in scale. The default is |
wobble |
a logical value indicating whether the sample should be checked for
duplicated values that can cause the scale estimate to be zero.
If such values are present, uniform noise is added to the sample,
see |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
Details
The test statistic for this test is based on the difference of the sample
medians of x
and y
. Three versions of the test are implemented:
randomization, permutation, and asymptotic.
The test statistic for the permutation and randomization version of the test is standardized using a robust scale estimator, see (Fried and Dehling 2011).
With scale = "S3"
, the scale is estimated by
S = 2 * (|x_1 - med(x)|, ..., |x_m - med(x)|, |y_1 - med(y)|, ..., |y_n - med(y)|),
whereas scale = "S4"
uses
S = (med(|x_1 - med(x)|, ..., |x_m - med(x)|) + med(|y_1 - med(y)|, ..., |y_n - med(y)|).
When computing the randomization distribution based on randomly drawn splits with
replacement, the function permp
(Phipson and Smyth 2010)
is used to calculate the p-value. For the asymptotic test, a transformed version
of the difference of the sample medians, which asymptotically follows a normal
distribution, is used. For more details on the asymptotic test, see
Fried and Dehling (2011).
For scale.test = TRUE
, the test compares the two samples for a difference
in scale. This is achieved by log-transforming the original squared observations,
i.e. x
is replaced by log(x^2)
and y
by log(y^2)
.
A potential scale difference then appears as a location difference between
the transformed samples, see Fried (2012).
Note that the samples need to have equal locations. The sample should not
contain zeros to prevent problems with the necessary log-transformation. If
it contains zeros, uniform noise is added to all variables in order to remove
zeros and a message is printed.
If the sample has been modified (either because of zeros for scale.test = TRUE
,
or wobble = TRUE
), the modified samples can be retrieved using
set.seed(wobble.seed); wobble(x, y)
Both samples need to contain at least 5 non-missing values.
Value
A named list with class "htest
" containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value for the test. |
estimate |
the sample medians of |
null.value |
the specified hypothesized value of the mean difference/squared scale ratio. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating how the p-value was computed. |
data.name |
a character string giving the names of the data. |
References
Phipson B, Smyth GK (2010). “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. doi:10.2202/1544-6115.1585.
Fried R, Dehling H (2011). “Robust nonparametric tests for the two-sample location problem.” Statistical Methods & Applications, 20(4), 409–422. doi:10.1007/s10260-011-0164-1.
Fried R (2012). “On the online estimation of piecewise constant volatilities.” Computational Statistics & Data Analysis, 56(11), 3080–3090. doi:10.1016/j.csda.2011.02.012.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Asymptotic MED test
med_test(x, y, method = "asymptotic", scale = "S3")
## Not run:
# MED2 test using randomization principle by drawing 1000 random permutations
# with replacement
med_test(x, y, method = "randomization", n.rep = 1000, scale = "S4")
## End(Not run)
Permutation distribution for robust test statistics
Description
perm_distribution()
calculates the permutation distribution for
several test statistics.
Usage
perm_distribution(x, y, type, randomization = FALSE, n.rep = 10000)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
type |
a character string specifying the desired test statistic. It must
be one of |
randomization |
a logical value indicating whether the p-value should be
computed from a permutation ( |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
Details
Missing values in either x
or y
are not allowed.
Value
Vector with permutation distribution of the test statistic specified
by type
.
Preprocess data for the robust two sample tests
Description
preprocess_data
is a helper function that performs several
preprocessing steps on the data before performing the two-sample tests.
Usage
preprocess_data(x, y, delta, na.rm, wobble, wobble.seed, scale.test)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
delta |
a numeric value indicating the true difference in the location or scale parameter, depending on whether the test should be performed for a difference in location or in scale. |
na.rm |
a logical value indicating whether NA values in |
wobble |
a logical value indicating whether the sample should be checked for
duplicated values that can cause the scale estimate to be zero.
If such values are present, uniform noise is added to the sample,
see |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
scale.test |
a logical value to specify if the samples should be compared for a difference in scale. |
Details
The preprocessing steps include the removal of missing values and, if specified, wobbling and a transformation of the observations to test for differences in scale.
Value
A named list containing the following components:
x |
the (possibly transformed) input vector |
y |
the (possibly transformed) input vector |
delta |
the (possibly transformed) input value
|
Robust test statistics based on robust location estimators
Description
rob_perm_statistic
calculates test statistics for robust
permutation/randomization tests based on the sample median, the one-sample
Hodges-Lehmann estimator, or the two-sample Hodges-Lehmann estimator.
Usage
rob_perm_statistic(
x,
y,
type = c("HL11", "HL12", "HL21", "HL22", "MED1", "MED2"),
na.rm = FALSE
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
type |
a character string specifying the desired test statistic. It must
be one of |
na.rm |
a logical value indicating whether NA values in |
Details
The test statistics returned by rob_perm_statistic
are of the
form
D_i/S_j
where the D_i, i = 1,...,3, are different
estimators of location and the S_j, j = 1,...,4, are estimates for
the mutual sample scale. See Fried and Dehling (2011)
or the vignette vignette("robnptests")
for details.
Value
A named list containing the following components:
statistic |
the selected test statistic. |
estimates |
estimate of location for each sample if available. |
References
Fried R, Dehling H (2011). “Robust nonparametric tests for the two-sample location problem.” Statistical Methods & Applications, 20(4), 409–422. doi:10.1007/s10260-011-0164-1.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Compute HL21-statistic
rob_perm_statistic(x, y, type = "HL21")
Robust scale estimators based on median absolute deviation
Description
rob_scale
calculates an estimator for the within-sample dispersion
based on two samples.
Usage
rob_scale(
x,
y,
type = c("S1", "S2", "S3", "S4"),
na.rm = FALSE,
check.for.zero = FALSE
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
type |
character that specifies the estimator for the variance, can be
|
na.rm |
a logical value indicating whether NA values in |
check.for.zero |
logical value indicating a warning should be triggered
if the scale estimate is zero. The default is
|
Details
For definitions of the scale estimators, see Fried and Dehling (2011).
If check.for.zero = TRUE
, an error is thrown when the scale estimate
is zero. This argument is only included because the function is used in
rob_perm_statistic
to compute values of robust test statistics
where the scale estimate is used for standardization. A scale estimate of zero
leads to a non-existing test statistic, so that the corresponding test cannot
be performed.
Value
An estimate of the pooled variance of the two samples.
References
Fried R, Dehling H (2011). “Robust nonparametric tests for the two-sample location problem.” Statistical Methods & Applications, 20(4), 409–422. doi:10.1007/s10260-011-0164-1.
Select principle for computing null distribution
Description
select_method
is a helper function that chooses the principle for
computing the null distribution of a two-sample test.
Usage
select_method(x, y, method, test.name, n.rep)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
method |
a character string specifying how the p-value is computed with
possible values |
test.name |
character string specifying the two-sample test for which the helper function is used. |
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if
|
Details
When the principle is specified by the user, i.e. method
contains only
one element, the selected method is returned. Otherwise, if the user
does not specify the principle, it depends on the sample size: When both
samples contain more than 30 observations, an asymptotic test is performed.
If one of the samples contains less than 30 observations, the null
distribution is computed via the randomization principle. The number of
replications n.rep
for the randomization test needs to be specified
outside of this function. Each test function contains the argument
n.rep
where this can be done.
If n.rep
is larger than the maximum number of splits and
method = "randomization"
, a permutation test is performed.
Value
A character string, which specifies the principle for computing the null distribution.
Trimmed mean
Description
trim_mean
calculates a trimmed mean of a sample.
Usage
trim_mean(x, gamma = 0.2, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be trimmed from each end of the sample before calculating the mean. The default value is 0.2. |
na.rm |
a logical value indicating whether NA values in |
Details
This is a wrapper function for the function mean
.
Value
The trimmed mean.
Examples
# Generate random sample
set.seed(108)
x <- rnorm(10)
# Compute 20% trimmed mean
trim_mean(x, gamma = 0.2)
Test statistic for the two-sample trimmed t-test (Yuen's t-test)
Description
trimmed_t
calculates the test statistic for the two-sample trimmed t-test.
Usage
trimmed_t(x, y, gamma = 0.2, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be trimmed from each end of the sample before calculating the mean. The default value is 0.2. |
na.rm |
a logical value indicating whether NA values in |
Value
A named list containing the following components:
statistic |
the value of the test statistic. |
estimates |
the trimmed means for both samples. |
df |
the degrees of freedom for the test statistic. |
References
Yuen KK, Dixon WT (1973). “The approximate behaviour and performance of the two-sample trimmed t.” Biometrika, 60(2), 369–374. doi:10.2307/2334550.
Yuen KK (1974). “The two-sample trimmed t for unequal population variances.” Biometrika, 61(1), 165–170. doi:10.2307/2334299.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Compute trimmed t-statistic
trimmed_t(x, y, gamma = 0.2)
Two-sample trimmed t-test (Yuen's t-Test)
Description
trimmed_test
performs the two-sample trimmed t-test.
Usage
trimmed_test(
x,
y,
gamma = 0.2,
alternative = c("two.sided", "less", "greater"),
method = c("asymptotic", "permutation", "randomization"),
delta = ifelse(scale.test, 1, 0),
n.rep = 1000,
na.rm = FALSE,
scale.test = FALSE,
wobble.seed = NULL
)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be trimmed from each end of the sample before calculating the mean. The default value is 0.2. |
alternative |
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater", or "less". |
method |
a character string specifying how the p-value is computed with
possible values |
delta |
a numeric value indicating the true difference in the location or
scale parameter, depending on whether the test should be performed
for a difference in location or in scale. The default is
|
n.rep |
an integer value specifying the number of random splits used to
calculate the randomization distribution if |
na.rm |
a logical value indicating whether NA values in |
scale.test |
a logical value to specify if the samples should be compared
for a difference in scale. The default is |
wobble.seed |
an integer value used as a seed for the random number
generation in case of |
Details
The function performs Yuen's t-test based on the trimmed mean and winsorized
variance (Yuen and Dixon 1973).
The amount of trimming/winsorization is set in gamma
and
defaults to 0.2, i.e. 20% of the values are removed/replaced.
In addition to the asymptotic distribution a permutation and a
randomization version of the test are implemented.
When computing a randomization distribution based on randomly drawn splits
with replacement, the function permp
(Phipson and Smyth 2010)
is used to calculate the p-value.
For scale.test = TRUE
, the test compares the two samples for a difference
in scale. This is achieved by log-transforming the original squared observations,
i.e. x
is replaced by log(x^2)
and y
by log(y^2)
.
A potential scale difference then appears as a location difference between
the transformed samples, see Fried (2012).
Note that the samples need to have equal locations. The sample should not
contain zeros to prevent problems with the necessary log-transformation. If
it contains zeros, uniform noise is added to all variables in order to remove
zeros and a message is printed.
If the sample has been modified because of zeros when scale.test = TRUE
,
the modified samples can be retrieved using
set.seed(wobble.seed); wobble(x, y)
Both samples need to contain at least 5 non-missing values.
Value
A named list with class "htest
" containing the following components:
statistic |
the value of the test statistic. |
parameter |
the degrees of freedom for the test statistic. |
p.value |
the p-value for the test. |
estimate |
the trimmed means of |
null.value |
the specified hypothesized value of the mean difference/squared scale ratio. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating how the p-value was computed. |
data.name |
a character string giving the names of the data. |
References
Yuen KK, Dixon WT (1973). “The approximate behaviour and performance of the two-sample trimmed t.” Biometrika, 60(2), 369–374. doi:10.2307/2334550.
Yuen KK (1974). “The two-sample trimmed t for unequal population variances.” Biometrika, 61(1), 165–170. doi:10.2307/2334299.
Fried R (2012). “On the online estimation of piecewise constant volatilities.” Computational Statistics & Data Analysis, 56(11), 3080–3090. doi:10.1016/j.csda.2011.02.012.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(20); y <- rnorm(20)
# Trimmed t-test
trimmed_test(x, y, gamma = 0.1)
Winsorized mean
Description
win_mean
calculates the winsorized mean of a sample.
Usage
win_mean(x, gamma = 0.2, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be replaced at each end of the sample before calculating the mean. The default value is 0.2. |
na.rm |
a logical value indicating whether NA values in |
Value
The winsorized mean.
Examples
# Generate random samples
set.seed(108)
x <- rnorm(10)
# Compute 20% winsorized mean
win_mean(x, gamma = 0.2)
Winsorized variance
Description
win_var
calculates the winsorized variance of a sample.
Usage
win_var(x, gamma = 0, na.rm = FALSE)
Arguments
x |
a (non-empty) numeric vector of data values. |
gamma |
a numeric value in [0, 0.5] specifying the fraction of observations to be replaced at each end of the sample before calculating the mean. The default value is 0.2. |
na.rm |
a logical value indicating whether NA values in |
Value
A named list containing the following items:
var |
winsorized variance. |
h |
degrees of freedom used for tests based on trimmed means and the winsorized variance. |
Examples
# Generate random sample
set.seed(108)
x <- rnorm(10)
# Compute 20% winsorized variance
win_var(x, gamma = 0.2)
Add random noise to remove ties
Description
wobble
adds noise from a continuous uniform distribution to the
observations to remove ties.
Usage
wobble(x, y, check = TRUE)
Arguments
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
check |
a logical value indicating whether the samples should be checked
for bindings prior to adding uniform noise or not, defaults to
|
Details
If check = TRUE
the function checks whether all values in the two numeric
input vectors are distinct. If so, it returns the original values, otherwise
the ties are removed by adding noise from a continuous uniform distribution
to all observations. If check = FALSE
, it simply determines the number
of digits and adds uniform noise.
More precisely, we determine the minimum number of digits d_min
in the sample
and then add random numbers from the U[-0.5 10^(-d_min
), 0.5 10^(-d_min
)]
distribution to each of the observations.
Value
A named list of length two containing the modified input samples x
and
y
.
References
Fried R, Gather U (2007). “On rank tests for shift detection in time series.” Computational Statistics & Data Analysis, 52(1), 221–233. doi:10.1016/j.csda.2006.12.017.
Examples
x <- rnorm(20); y <- rnorm(20); x <- round(x)
wobble(x, y)