Help for package eva

Title:

Extreme Value Analysis with Goodness-of-Fit Testing

Date:

2020-11-14

Version:

0.2.6

Description:

Goodness-of-fit tests for selection of r in the r-largest order statistics (GEVr) model. Goodness-of-fit tests for threshold selection in the Generalized Pareto distribution (GPD). Random number generation and density functions for the GEVr distribution. Profile likelihood for return level estimation using the GEVr and Generalized Pareto distributions. P-value adjustments for sequential, multiple testing error control. Non-stationary fitting of GEVr and GPD. Bader, B., Yan, J. & Zhang, X. (2016) <doi:10.1007/s11222-016-9697-3>. Bader, B., Yan, J. & Zhang, X. (2018) <doi:10.1214/17-AOAS1092>.

Imports:

Matrix, parallel, stats, graphics, utils, EnvStats

Depends:

R(≥ 2.10.0)

Suggests:

rmarkdown, knitr, SpatialExtremes

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Repository:

CRAN

VignetteBuilder:

knitr

URL:

https://github.com/brianbader/eva_package

BugReports:

https://github.com/brianbader/eva_package/issues

LazyData:

true

RoxygenNote:

7.1.1

Encoding:

UTF-8

Author:

Brian Bader [aut, cre], Jun Yan [ctb]

Maintainer:

Brian Bader <bbader.stat@gmail.com>

NeedsCompilation:

Packaged:

2020-11-15 02:25:35 UTC; Brian

Date/Publication:

2020-11-15 17:20:02 UTC

eva: Extreme Value Analysis with Goodness-of-Fit Testing

Description

The focus of this package is to provide much needed automated diagnostic tools (in the form of statistical hypothesis testing) to extreme value models. Other useful functionality is efficient and user-friendly non-stationary model fitting, profile likelihood confidence intervals, data generation in the r-largest order statistics model (GEVr), and ordered p-value multiplicity adjustments. Also, all routines are implemented to efficiently handle the near-zero shape parameter, which may cause numerical issues in other packages. Functions can be roughly assigned to the following topics:

Formal (Automated) Goodness-of-Fit Testing

gevrSeqTests is a wrapper function that performs sequential testing for r in the GEVr distribution, with adjusted p-values. It can implement three tests:

gevrEd: An entropy difference test, which uses an asymptotic normal central limit theorem result.
gevrPbScore: A score test, implemented using parametric bootstrap and can be run in parallel.
gevrMultScore: An asymptotic approximation to the score test (computationally efficient).

gpdSeqTests is a wrapper function that performs sequential testing for thresholds in the Generalized Pareto distribution (GPD), with adjusted p-values. It can implement the following (six) tests:

gpdAd: The Anderson-Darling test, with log-linear interpolated p-values. Can also be bootstrapped (with a parallel option).
gpdCvm: The Cramer-Von Mises test, with log-linear interpolated p-values. Can also be bootstrapped (with a parallel option).
gpdImAsym: An asymptotic information matrix test, with bootstrapped covariance estimates.
gpdImPb: A full bootstrap version of information matrix test, with bootstrapped covariance estimates and critical values.
gpdPbScore: A score test, implemented using parametric bootstrap and can be run in parallel.
gpdMultScore: An asymptotic approximation to the score test (computationally effciient).

pSeqStop A simple function that reads in raw, ordered p-values and returns two sets that adjust for the familywise error rate and false discovery rate.

Data generation and model fitting

All the functions in this section (and package) efficiently handle a near-zero value of the shape parameter, which can cause numerical instability in similar functions from other packages. See the vignette for an example.

Data generation, density, quantile, and distribution functions can handle non-stationarity and vectorized inputs.

gevr Data generation and density function for the GEVr distribution, with distribution function and quantile functions available for GEV1 (block maxima).

gpd Data generation, distribution, quantile, and density functions for the GPD distribution.

gevrFit Non-stationary fitting of the GEVr distribution, with the option of maximum product spacings estimation when r=1. Uses formula statements for user friendliness and automatically centers/scales covariates when appropriate to speed up optimization.

gpdFit Non-stationary fitting of the GP distribution, with same options and implementation as ‘gevrFit’. Allows non-stationary threshold to be used.

gevrProfShape Profile likelihood estimation for the shape parameter of the stationary GEVr distribution.

gpdProfShape Profile likelihood estimation for the shape parameter of the stationary GP distribution.

gevrRl Profile likelihood estimation for return levels of the stationary GEVr distribution.

gpdRl Profile likelihood estimation for return levels of the stationary GP distribution.

Visual Diagnostics

gevrDiag, gpdDiag Diagnostic plots for a fit to the GEVr (GP) distribution. For stationary models, return level, density, quantile, and probability plots are returned. For non-stationary models, residual quantile, residual probability, and residuals versus covariate plots are returned.

mrlPlot Plots the empirical mean residual life, with confidence intervals. Visual diagnostic tool to choose a threshold for exceedances.

Data

fortmax Top ten annual precipitation events (inches) for one rain gauge in Fort Collins, Colorado from 1900 through 1999.

lowestoft Top ten annual sea levels at the LoweStoft station tide gauge from 1964 - 2014.

Top Ten Annual Precipitation: Fort Collins, Colorado

Description

Top ten annual precipitation events (inches) for one rain gauge in Fort Collins, Colorado from 1900 through 1999. See Katz et al. (2002) Sec. 2.3.1 for more information and analyses.

Usage

data(fortmax)

Format

A data frame with 100 observations. Each year is considered an observation, with the top ten annual precipitation events.

Source

Colorado Climate Center, Colorado State University. This is the original data source containing the daily precipitation data.

References

Katz, R. W., Parlange, M. B. and Naveau, P. (2002) Statistics of extremes in hydrology. Advances in Water Resources, 25, 1287-1304.

Examples

data(fortmax)
y <- fortmax[, -1]
gevrSeqTests(y, method = "ed")

The GEVr Distribution

Description

Random number generation (rgevr) and density (dgevr) functions for the GEVr distribution with parameters loc, scale, and shape. Also, quantile function (qgev) and cumulative distribution function (pgev) for the GEV1 distribution.

Usage

dgevr(x, loc = 0, scale = 1, shape = 0, log.d = FALSE)

rgevr(n, r, loc = 0, scale = 1, shape = 0)

qgev(p, loc = 0, scale = 1, shape = 0, lower.tail = TRUE, log.p = FALSE)

pgev(q, loc = 0, scale = 1, shape = 0, lower.tail = TRUE, log.p = FALSE)

Arguments

x

Vector or matrix of observations. If x is a matrix, each row is taken to be a new observation.

loc, scale, shape

Location, scale, and shape parameters. Can be vectors, but the lengths must be appropriate.

log.d

Logical: Whether or not to return the log density. (FALSE by default)

n

Number of observations

r

Number of order statistics for each observation.

p

Vector of probabilities.

lower.tail

Logical: If TRUE (default), probabilities are P[X <= x] otherwise, P[X > x].

log.p

Logical: If TRUE, probabilities p are given as log(p). (FALSE by default)

q

Vector of quantiles.

Details

GEVr data (in matrix x) should be of the form x[i,1] > x[i, 2] > \cdots > x[i, r] for each observation i = 1, \ldots, n. Note that currently the quantile and cdf functions are only for the GEV1 distribution. The GEVr distribution is also known as the r-largest order statistics model and is a generalization of the block maxima model (GEV1). The density function is given by

f_r (x_1, x_2, ..., x_r | \mu, \sigma, \xi) = \sigma^{-r} \exp\Big\{-(1+\xi z_r)^{-\frac{1}{\xi}} - \left(\frac{1}{\xi}+1\right)\sum_{j=1}^{r}\log(1+\xi z_j)\Big\}

for some location parameter \mu, scale parameter \sigma > 0, and shape parameter \xi, where x_1 > \cdots > x_r, z_j = (x_j - \mu) / \sigma, and 1 + \xi z_j > 0 for j=1, \ldots, r. When r = 1, this distribution is exactly the GEV distribution.

References

Coles, S. (2001). An introduction to statistical modeling of extreme values (Vol. 208). London: Springer.

Examples

# Plot the densities of the heavy and bounded upper tail forms of GEVr
set.seed(7)
dat1 <- rgevr(1000, 1, loc = 0, scale = 1, shape = 0.25)
dat2 <- rgevr(1000, 1, loc = 0, scale = 1, shape = -0.25)
hist(dat1, col = rgb(1, 0, 0, 0.5), xlim = c(-5, 10), ylim = c(0, 0.4),
     main = "Histogram of GEVr Densities", xlab = "Value", freq = FALSE)
hist(dat2, col = rgb(0, 0,1, 0.5), add = TRUE, freq = FALSE)
box()

# Generate sample with decreasing trend in location parameter
x <- rgevr(10, 2, loc = 10:1, scale = 1, shape = 0.1)
dgevr(x, loc = 10:1, scale = 10:1, shape = 0.1)

# Incorrect parameter specifications
## Not run: 
rgevr(10, 2, loc = 5:8, scale = 1, shape = 0.1)
rgevr(1, 2, loc = 5:8, scale = 1:2, shape = 0.1)

## End(Not run)

Diagnostic plots for a fit to the GEVr distribution.

Description

Diagnostic plots for a fit to the GEVr distribution.

Usage

gevrDiag(z, conf = 0.95, method = c("delta", "profile"))

Arguments

z

A class object returned from ‘gevrFit’.

conf

Confidence level used in the return level plot.

method

The method to compute the return level confidence interval - either delta method (default) or profile likelihood. Choosing profile likelihood may be quite slow.

Details

In certain cases the quantile plot may fail, because it requires solving a root equation. See the references for details.

Value

For stationary models, provides return level plot and density, probability, and quantile plots for each marginal order statistic. The overlaid density is the ‘true’ marginal density for the estimated parameters. For nonstationary models, provides residual probability and quantile plots. In addition, nonstationary models provide plots of the residuals vs. the parameter covariates.

References

Tawn, J. A. (1988). An extreme-value theory model for dependent observations. Journal of Hydrology, 101(1), 227-250.

Smith, R. L. (1986). Extreme value theory based on the r largest annual events. Journal of Hydrology, 86(1), 27-43.

Examples

## Not run: 
x <- rgevr(500, 2, loc = 0.5, scale = 1, shape = 0.1)
z <- gevrFit(x)
plot(z)

## End(Not run)

GEVr Entropy Difference Test

Description

Goodness-of-fit test for GEVr using the difference in likelihood between GEVr and GEV(r-1). This can be used sequentially to test for the choice of r.

Usage

gevrEd(data, theta = NULL)

Arguments

data

Data should be contain n rows, each a GEVr observation.

theta

Estimate for theta in the vector form (loc, scale, shape). If NULL, uses the MLE from the full data.

Details

GEVr data (in matrix x) should be of the form x[i,1] > x[i, 2] > \cdots > x[i, r] for each observation i = 1, \ldots, n. The test uses an asymptotic normality result based on the expected entropy between the GEVr and GEV(r-1) likelihoods. See reference for detailed information. This test can be used to sequentially test for the choice of r, implemented in the function ‘gevrSeqTests’.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Estimate of theta using the top r order statistics.

References

Bader B., Yan J., & Zhang X. (2015). Automated Selection of r for the r Largest Order Statistics Approach with Adjustment for Sequential Testing. Department of Statistics, University of Connecticut.

Examples

# This will test if the GEV2 distribution fits the data.
x <- rgevr(100, 2, loc = 0.5, scale = 1, shape = 0.5)
result <- gevrEd(x)

Parameter estimation for the GEVr distribution model

Description

This function provides maximum likelihood estimation for the GEVr model, with the option of probability weighted moment and maximum product spacing estimation for block maxima (GEV1) data. It also allows generalized linear modeling of the parameters.

Usage

gevrFit(
  data,
  method = c("mle", "mps", "pwm"),
  information = c("expected", "observed"),
  locvars = NULL,
  scalevars = NULL,
  shapevars = NULL,
  locform = ~1,
  scaleform = ~1,
  shapeform = ~1,
  loclink = identity,
  scalelink = identity,
  shapelink = identity,
  gumbel = FALSE,
  start = NULL,
  opt = "Nelder-Mead",
  maxit = 10000,
  ...
)

Arguments

data

Data should be a matrix from the GEVr distribution.

method

Method of estimation - maximum likelihood (mle), maximum product spacings (mps), and probability weighted moments (pwm). Uses mle by default. For r > 1, only mle can be used.

information

Whether standard errors should be calculated via observed or expected (default) information. For probability weighted moments, only expected information will be used if possible. In the case with covariates, only observed information is available.

locvars, scalevars, shapevars

A dataframe of covariates to use for modeling of the each parameter. Parameter intercepts are automatically handled by the function. Defaults to NULL for the stationary model.

locform, scaleform, shapeform

An object of class ‘formula’ (or one that can be coerced into that class), specifying the model of each parameter. By default, assumes stationary (intercept only) model. See details.

loclink, scalelink, shapelink

A link function specifying the relationship between the covariates and each parameter. Defaults to the identity function. For the stationary model, only the identity link should be used.

gumbel

Whether to fit the Gumbel (type I) extreme value distribution (i.e. shape parameter equals zero). Defaults to FALSE.

start

Option to provide a set of starting parameters to optim; a vector of location, scale, and shape, in that order. Otherwise, the routine attempts to find good starting parameters. See details.

opt

Optimization method to use with optim.

maxit

Number of iterations to use in optimization, passed to optim. Defaults to 10,000.

...

Additional arguments to pass to optim.

Details

In the stationary case (no covariates), starting parameters for mle and mps estimation are the probability weighted moment estimates. In the case where covariates are used, the starting intercept parameters are the probability weighted moment estimates from the stationary case and the parameters based on covariates are initially set to zero. For non-stationary parameters, the first reported estimate refers to the intercept term. Covariates are centered and scaled automatically to speed up optimization, and then transformed back to original scale.
Formulas for generalized linear modeling of the parameters should be given in the form '~ var1 + var2 + \cdots'. Essentially, specification here is the same as would be if using function ‘lm’ for only the right hand side of the equation. Interactions, polynomials, etc. can be handled as in the ‘formula’ class.
Intercept terms are automatically handled by the function. By default, the link functions are the identity function and the covariate dependent scale parameter estimates are forced to be positive. For some link function f(\cdot) and for example, scale parameter \sigma, the link is written as \sigma = f(\sigma_1 x_1 + \sigma_2 x_2 + \cdots + \sigma_k x_k).
Maximum likelihood estimation can be used in all cases. Probability weighted moment estimation can only be used if r = 1 and data is assumed to be stationary. Maximum product spacings estimation can be used in the non-stationary case, but only if r = 1.

Value

A list describing the fit, including parameter estimates and standard errors for the mle and mps methods. Returns as a class object ‘gevrFit’ to be used with diagnostic plots.

Examples

set.seed(7)
x1 <- rgevr(500, 1, loc = 0.5, scale = 1, shape = 0.3)
result1 <- gevrFit(x1, method = "mps")

# A linear trend in the location and scale parameter
n <- 100
r <- 10
x2 <- rgevr(n, r, loc = 100 + 1:n / 50,  scale = 1 + 1:n / 300, shape = 0)

covs <- as.data.frame(seq(1, n, 1))
names(covs) <- c("Trend1")
# Create some unrelated covariates
covs$Trend2 <- rnorm(n)
covs$Trend3 <- 30 * runif(n)
result2 <- gevrFit(data = x2, method = "mle", locvars = covs, locform = ~ Trend1 + Trend2*Trend3,
scalevars = covs, scaleform = ~ Trend1)

# Show summary of estimates
result2

GEVr Multiplier Score Test

Description

Fast weighted bootstrap alternative to the parametric bootstrap procedure for the GEVr score test.

Usage

gevrMultScore(data, bootnum, information = c("expected", "observed"))

Arguments

data

Data should be contain n rows, each a GEVr observation.

bootnum

Number of bootstrap replicates.

information

To use expected (default) or observed information in the test.

Details

GEVr data (in matrix x) should be of the form x[i,1] > x[i, 2] > \cdots > x[i, r] for each observation i = 1, \ldots, n.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Value of theta used in the test.

References

Bader B., Yan J., & Zhang X. (2015). Automated Selection of r for the r Largest Order Statistics Approach with Adjustment for Sequential Testing. Department of Statistics, University of Connecticut.

Examples

x <- rgevr(500, 5, loc = 0.5, scale = 1, shape = 0.3)
result <- gevrMultScore(x, bootnum = 1000)

GEVr Parametric Bootstrap Score Test

Description

Parametric bootstrap score test procedure to assess goodness-of-fit to the GEVr distribution.

Usage

gevrPbScore(
  data,
  bootnum,
  information = c("expected", "observed"),
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Data should be contain n rows, each a GEVr observation.

bootnum

Number of bootstrap replicates.

information

To use expected (default) or observed information in the test.

allowParallel

Should the bootstrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

GEVr data (in matrix x) should be of the form x[i,1] > x[i, 2] > \cdots > x[i, r] for each observation i = 1, \ldots, n.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Initial value of theta used in the test.

effective_bootnum

Effective number of bootstrap replicates (only those that converged are used).

References

Bader B., Yan J., & Zhang X. (2015). Automated Selection of r for the r Largest Order Statistics Approach with Adjustment for Sequential Testing. Department of Statistics, University of Connecticut.

Examples

# Generate some data from GEVr

x <- rgevr(200, 5, loc = 0.5, scale = 1, shape = 0.25)
gevrPbScore(x, bootnum = 99)

GEVr Shape Parameter Profile Likelihood Estimation for Stationary Models

Description

Computes the profile likelihood based confidence interval for the shape parameter of the stationary GEVr model.

Usage

gevrProfShape(z, conf = 0.95, plot = TRUE, opt = c("Nelder-Mead"))

Arguments

z

A class object returned from gevrFit.

conf

Confidence level to use. Defaults to 95 percent.

plot

Plot the profile likelihood and estimate (vertical line)?

opt

Optimization method to maximize the profile likelihood, passed to optim. The default method is Nelder-Mead.

Value

Estimate

Estimated shape parameter.

CI

Profile likelihood based confidence interval for the shape parameter.

ConfLevel

The confidence level used.

Examples

# Compare the length of the shape confidence intervals using GEV1 vs. GEV10
set.seed(7)
x <- rgevr(200, 10, loc = 0.5, scale = 1, shape = -0.3)
z1 <- gevrFit(x[, 1])
z2 <- gevrFit(x)
gevrProfShape(z1)
gevrProfShape(z2)

GEVr Return Level Estimate and Confidence Interval for Stationary Models

Description

Computes stationary m-period return level estimate and interval, using either the delta method or profile likelihood.

Usage

gevrRl(
  z,
  period,
  conf = 0.95,
  method = c("delta", "profile"),
  plot = TRUE,
  opt = c("Nelder-Mead")
)

Arguments

z

A class object returned from gevrFit. Must be a stationary fit.

period

The number of periods to use for the return level.

conf

Confidence level. Defaults to 95 percent.

method

The method to compute the confidence interval - either delta method (default) or profile likelihood.

plot

Plot the profile likelihood and estimate (vertical line)?

opt

Optimization method to maximize the profile likelihood if that is selected. The default method is Nelder-Mead.

Details

It is generally accepted that profile likelihood confidence intervals provide greater accuracy than the delta method, in particular for large return level periods. Also, by their nature, delta method confidence intervals must be symmetric which may be undesirable for return level estimation. If the original fit was Gumbel, then return levels will be for the Gumbel distribution.

Caution: The profile likelihood optimization may be slow (on the order of minutes).

Value

Estimate

Estimated m-period return level.

CI

Confidence interval for the m-period return level.

Period

The period length used.

ConfLevel

The confidence level used.

References

http://www.mas.ncl.ac.uk/~nlf8/teaching/mas8391/background/chapter2.pdf

Coles, S. (2001). An introduction to statistical modeling of extreme values (Vol. 208). London: Springer.

Examples

x <- rgevr(100, 2, loc = 0.5, scale = 1, shape = -0.3)
z <- gevrFit(x)
# Compute 250-period return level.
gevrRl(z, 250, method = "delta")

Sequential Tests for the GEVr Model

Description

Sequentially performs the entropy difference (ED) test or the multiplier or parametric bootstrap score tests for the GEVr model.

Usage

gevrSeqTests(
  data,
  bootnum = NULL,
  method = c("ed", "pbscore", "multscore"),
  information = c("expected", "observed"),
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Data should be contain n rows, each a GEVr observation.

bootnum

If method equals 'pbscore' or 'multscore', the number of bootstrap simulations to use.

method

Which test to run: ED test (ed), multiplier (multscore) or parametric bootstrap (pbscore) score test.

information

To use expected (default) or observed information in the score tests.

allowParallel

If method equals 'pbscore', should the parametric boostrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

GEVr data (in matrix x) should be of the form x[i,1] > x[i, 2] > \cdots > x[i, r] for each observation i = 1, \ldots, n. See function ‘pSeqStop’ for details on transformed p-values.

Value

Function returns a dataframe containing the test statistics, estimates, and p-value results of the sequential tests.

r

Value of r to be tested.

p.values

Raw p-values from the individual tests at each value of r.

ForwardStop

Transformed p-values according to the ForwardStop stopping rule.

StrongStop

Transformed p-values according to the StrongStop stopping rule.

statistic

Returned test statistics of each individual test.

est.loc

Estimated location parameter for the given r.

est.scale

Estimated scale parameter for the given r.

est.shape

Estimated shape parameter for the given r.

Examples

x <- rgevr(200, 5, loc = 0.5, scale = 1, shape = 0.25)
gevrSeqTests(x, method = "ed")

The Generalized Pareto Distribution (GPD)

Description

Density, distribution function, quantile function and random number generation for the Generalized Pareto distribution with location, scale, and shape parameters.

Usage

dgpd(x, loc = 0, scale = 1, shape = 0, log.d = FALSE)

rgpd(n, loc = 0, scale = 1, shape = 0)

qgpd(p, loc = 0, scale = 1, shape = 0, lower.tail = TRUE, log.p = FALSE)

pgpd(q, loc = 0, scale = 1, shape = 0, lower.tail = TRUE, log.p = FALSE)

Arguments

x

Vector of observations.

loc, scale, shape

Location, scale, and shape parameters. Can be vectors, but the lengths must be appropriate.

log.d

Logical; if TRUE, the log density is returned.

n

Number of observations.

p

Vector of probabilities.

lower.tail

Logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x].

log.p

Logical; if TRUE, probabilities p are given as log(p).

q

Vector of quantiles.

Details

The Generalized Pareto distribution function is given (Pickands, 1975) by

H(y) = 1 - \Big[1 + \frac{\xi (y - \mu)}{\sigma}\Big]^{-1/\xi}

defined on \{y : y > 0, (1 + \xi (y - \mu) / \sigma) > 0 \}, with location \mu, scale \sigma > 0, and shape parameter \xi.

References

Pickands III, J. (1975). Statistical inference using extreme order statistics. Annals of Statistics, 119-131.

Examples

dgpd(2:4, 1, 0.5, 0.01)
dgpd(2, -2:1, 0.5, 0.01)
pgpd(2:4, 1, 0.5, 0.01)
qgpd(seq(0.9, 0.6, -0.1), 2, 0.5, 0.01)
rgpd(6, 1, 0.5, 0.01)

# Generate sample with linear trend in location parameter
rgpd(6, 1:6, 0.5, 0.01)

# Generate sample with linear trend in location and scale parameter
rgpd(6, 1:6, seq(0.5, 3, 0.5), 0.01)

p <- (1:9)/10
pgpd(qgpd(p, 1, 2, 0.8), 1, 2, 0.8)

# Incorrect syntax (parameter vectors are of different lengths other than 1)
## Not run: 
rgpd(1, 1:8, 1:5, 0)
rgpd(10, 1:8, 1, 0.01)

## End(Not run)

Generalized Pareto Distribution Anderson-Darling Test

Description

Anderson-Darling goodness-of-fit test for the Generalized Pareto (GPD) distribution.

Usage

gpdAd(
  data,
  bootstrap = FALSE,
  bootnum = NULL,
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Data should be in vector form, assumed to be from the GPD.

bootstrap

Should bootstrap be used to obtain p-values for the test? By default, a table of critical values is used via interpolation. See details.

bootnum

Number of replicates if bootstrap is used.

allowParallel

Should the bootstrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

A table of critical values were generated via Monte Carlo simulation for shape parameters -0.5 to 1.0 by 0.1, which provides p-values via log-linear interpolation from .001 to .999. For p-values below .001, a linear equation exists by regressing -log(p-value) on the critical values for the tail of the distribution (.950 to .999 upper percentiles). This regression provides a method to extrapolate to arbitrarily small p-values.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Estimated value of theta for the initial data.

effective_bootnum

Effective number of bootstrap replicates if bootstrap based p-value is used (only those that converged are used).

References

Choulakian, V., & Stephens, M. A. (2001). Goodness-of-fit tests for the Generalized Pareto distribution. Technometrics, 43(4), 478-484.

Examples

# Generate some data from GPD
x <- rgpd(200, loc = 0, scale = 1, shape = 0.2)
gpdAd(x)

Generalized Pareto Distribution Cramer-von Mises Test

Description

Cramer-von Mises goodness-of-fit test for the Generalized Pareto (GPD) distribution.

Usage

gpdCvm(
  data,
  bootstrap = FALSE,
  bootnum = NULL,
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Data should be in vector form, assumed to be from the GPD.

bootstrap

Should bootstrap be used to obtain p-values for the test? By default, a table of critical values is used via interpolation. See details.

bootnum

Number of bootstrap replicates.

allowParallel

Should the bootstrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Estimated value of theta for the initial data.

effective_bootnum

Effective number of bootstrap replicates if bootstrap based p-value is used (only those that converged are used).

References

Choulakian, V., & Stephens, M. A. (2001). Goodness-of-fit tests for the Generalized Pareto distribution. Technometrics, 43(4), 478-484.

Examples

# Generate some data from GPD
x <- rgpd(200, loc = 0, scale = 1, shape = 0.2)
gpdCvm(x)

Diagnostic plots for a fit to the Generalized Pareto distribution

Description

Diagnostic plots for a fit to the Generalized Pareto distribution

Usage

gpdDiag(z, conf = 0.95, method = c("delta", "profile"))

Arguments

z

A class object returned from ‘gpdFit’.

conf

Confidence level used in the return level plot.

method

The method to compute the return level confidence interval - either delta method (default) or profile likelihood. Choosing profile likelihood may be quite slow.

Details

See the reference for details on how return levels are calculated.

Value

For stationary models, provides return level, density, probability, and quantile plots for the GPD exceedances. The overlaid density is the ‘true’ density for the estimated parameters. For nonstationary models, provides residual probability and quantile plots. In addition, nonstationary models provide plots of the residuals vs. the parameter covariates.

References

Coles, S. (2001). An introduction to statistical modeling of extreme values (Vol. 208). London: Springer.

Examples

## Not run: 
x <- rgpd(10000, loc = 0.5, scale = 1, shape = 0.1)
z <- gpdFit(x, nextremes = 500)
plot(z)

## End(Not run)

Parameter estimation for the Generalized Pareto Distribution (GPD)

Description

Fits exceedances above a chosen threshold to the Generalized Pareto model. Various estimation procedures can be used, including maximum likelihood, probability weighted moments, and maximum product spacing. It also allows generalized linear modeling of the parameters.

Usage

gpdFit(
  data,
  threshold = NA,
  nextremes = NA,
  npp = 365,
  method = c("mle", "mps", "pwm"),
  information = c("expected", "observed"),
  scalevars = NULL,
  shapevars = NULL,
  scaleform = ~1,
  shapeform = ~1,
  scalelink = identity,
  shapelink = identity,
  start = NULL,
  opt = "Nelder-Mead",
  maxit = 10000,
  ...
)

Arguments

data

Data should be a numeric vector from the GPD.

threshold

A threshold value or vector of the same length as the data.

nextremes

Number of upper extremes to be used (either this or the threshold must be given, but not both).

npp

Length of each period (typically year). Is used in return level estimation. Defaults to 365.

method

Method of estimation - maximum likelihood (mle), maximum product spacing (mps), and probability weighted moments (pwm). Uses mle by default. For pwm, only the stationary model can be fit.

information

Whether standard errors should be calculated via observed or expected (default) information. For probability weighted moments, only expected information will be used if possible. For non-stationary models, only observed information is used.

scalevars, shapevars

A dataframe of covariates to use for modeling of the each parameter. Parameter intercepts are automatically handled by the function. Defaults to NULL for the stationary model.

scaleform, shapeform

An object of class ‘formula’ (or one that can be coerced into that class), specifying the model of each parameter. By default, assumes stationary (intercept only) model. See details.

scalelink, shapelink

A link function specifying the relationship between the covariates and each parameter. Defaults to the identity function. For the stationary model, only the identity link should be used.

start

Option to provide a set of starting parameters to optim; a vector of scale and shape, in that order. Otherwise, the routine attempts to find good starting parameters. See details.

opt

Optimization method to use with optim.

maxit

Number of iterations to use in optimization, passed to optim. Defaults to 10,000.

...

Additional arguments to pass to optim.

Details

The base code for finding probability weighted moments is taken from the R package evir. See citation. In the stationary case (no covariates), starting parameters for mle and mps estimation are the probability weighted moment estimates. In the case where covariates are used, the starting intercept parameters are the probability weighted moment estimates from the stationary case and the parameters based on covariates are initially set to zero. For non-stationary parameters, the first reported estimate refers to the intercept term. Covariates are centered and scaled automatically to speed up optimization, and then transformed back to original scale.
Formulas for generalized linear modeling of the parameters should be given in the form '~ var1 + var2 + \cdots'. Essentially, specification here is the same as would be if using function ‘lm’ for only the right hand side of the equation. Interactions, polynomials, etc. can be handled as in the ‘formula’ class.
Intercept terms are automatically handled by the function. By default, the link functions are the identity function and the covariate dependent scale parameter estimates are forced to be positive. For some link function f(\cdot) and for example, scale parameter \sigma, the link is written as \sigma = f(\sigma_1 x_1 + \sigma_2 x_2 + \cdots + \sigma_k x_k).
Maximum likelihood estimation and maximum product spacing estimation can be used in all cases. Probability weighted moments can only be used for stationary models.

Value

A class object ‘gpdFit’ describing the fit, including parameter estimates and standard errors.

References

Pfaff, Bernhard, Alexander McNeil, and A. Stephenson. "evir: Extreme Values in R." R package version (2012): 1-7.

Examples

# Fit data using the three different estimation procedures
set.seed(7)
x <- rgpd(2000, loc = 0, scale = 2, shape = 0.2)

# Set threshold at 4
mle_fit <- gpdFit(x, threshold = 4, method = "mle")
pwm_fit <- gpdFit(x, threshold = 4, method = "pwm")
mps_fit <- gpdFit(x, threshold = 4, method = "mps")

# Look at the difference in parameter estimates and errors
mle_fit$par.ests
pwm_fit$par.ests
mps_fit$par.ests

mle_fit$par.ses
pwm_fit$par.ses
mps_fit$par.ses

# A linear trend in the scale parameter
set.seed(7)
n <- 300
x2 <- rgpd(n, loc = 0, scale = 1 + 1:n / 200, shape = 0)

covs <- as.data.frame(seq(1, n, 1))
names(covs) <- c("Trend1")

result1 <- gpdFit(x2, threshold = 0, scalevars = covs, scaleform = ~ Trend1)

# Show summary of estimates
result1

GPD Asymptotic Adjusted Information Matrix (IM) Test

Description

Runs the IM Test using bootstrap estimated covariance matrix. Asymptotically (in sample size) follows the F(3, bootnum - 3) distribution (see reference for details).

Usage

gpdImAsym(data, bootnum, theta = NULL)

Arguments

data

Data should be in vector form.

bootnum

Number of bootstrap replicates for the covariance estimate.

theta

Estimate for theta in the vector form (scale, shape). If NULL, uses the MLE.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Value of theta used in the test.

effective_bootnum

Effective number of bootstrap replicates used for the covariance estimate. If a replicate fails to converge, it will not be used in the estimation.

References

Dhaene, G., & Hoorelbeke, D. (2004). The information matrix test with bootstrap-based covariance matrix estimation. Economics Letters, 82(3), 341-347.

Examples

# Generate some data from GPD
x <- rgpd(200, loc = 0, scale = 1, shape = 0.2)
gpdImAsym(x, bootnum = 50)

GPD Bootstrapped Information Matrix (IM) Test

Description

Runs the IM Test using a two-step iterative procedure, to boostrap the covariance estimate and critical values. See reference for details.

Usage

gpdImPb(data, inner, outer, allowParallel = FALSE, numCores = 1)

Arguments

data

Data should be in vector form.

inner

Number of bootstrap replicates for the covariance estimate.

outer

Number of bootstrap replicates for critical values.

allowParallel

Should the outer bootstrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

Warning: This test can be very slow, since the covariance estimation is nested within the outer replicates. It would be recommended to use a small number of replicates for the covariance estimate (at most 50).

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Estimate of theta for the initial dataset.

effective_bootnum

Effective number of outer bootstrap replicates used (only those that converged are used).

References

Dhaene, G., & Hoorelbeke, D. (2004). The information matrix test with bootstrap-based covariance matrix estimation. Economics Letters, 82(3), 341-347.

Examples


x <- rgpd(200, loc = 0, scale = 1, shape = 0.2)
gpdImPb(x, inner = 20, outer = 99)

GPD Multiplier Score Test

Description

Fast weighted bootstrap alternative to the parametric bootstrap procedure for the Generalized Pareto score test.

Usage

gpdMultScore(data, bootnum, information = c("expected", "observed"))

Arguments

data

Data should be in vector form.

bootnum

Number of bootstrap replicates.

information

To use expected (default) or observed information in the test.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Value of theta used in the test.

Examples

x <- rgpd(100, loc = 0, scale = 1, shape = 0.25)
gpdMultScore(x, bootnum = 1000)

GPD Parametric Bootstrap Score Test

Description

Parametric bootstrap score test procedure to assess goodness-of-fit to the Generalized Pareto distribution.

Usage

gpdPbScore(
  data,
  bootnum,
  information = c("expected", "observed"),
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Data should be in vector form.

bootnum

Number of bootstrap replicates.

information

To use expected (default) or observed information in the test.

allowParallel

Should the bootstrap procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Value

statistic

Test statistic.

p.value

P-value for the test.

theta

Estimated value of theta for the initial data.

effective_bootnum

Effective number of bootstrap replicates (only those that converged are used).

Examples

# Generate some data from GPD
x <- rgpd(200, loc = 0, scale = 1, shape = 0.2)
gpdPbScore(x, bootnum = 100)

GPD Shape Parameter Profile Likelihood Estimation for Stationary Models

Description

Computes the profile likelihood based confidence interval for the shape parameter of the stationary Generalized Pareto model.

Usage

gpdProfShape(z, conf = 0.95, plot = TRUE)

Arguments

z

A class object returned from gpdFit.

conf

Confidence level to use. Defaults to 95 percent.

plot

Plot the profile likelihood and estimate (vertical line)?

Value

Estimate

Estimated shape parameter.

CI

Profile likelihood based confidence interval for the shape parameter.

ConfLevel

The confidence level used.

Examples

x <- rgpd(200, loc = 0, scale = 1, shape = 0.25)
z <- gpdFit(x, threshold = 0)
gpdProfShape(z)

GPD Return Level Estimate and Confidence Interval for Stationary Models

Description

Computes stationary m-period return level estimate and interval for the Generalized Pareto distribution, using either the delta method or profile likelihood.

Usage

gpdRl(
  z,
  period,
  conf = 0.95,
  method = c("delta", "profile"),
  plot = TRUE,
  opt = c("Nelder-Mead")
)

Arguments

z

An object of class ‘gpdFit’.

period

The number of periods to use for the return level.

conf

Confidence level. Defaults to 95 percent.

method

The method to compute the confidence interval - either delta method (default) or profile likelihood.

plot

Plot the profile likelihood and estimate (vertical line)?

opt

Optimization method to maximize the profile likelihood if that is selected. Argument passed to optim. The default method is Nelder-Mead.

Details

Caution: The profile likelihood optimization may be slow for large datasets.

Value

Estimate

Estimated m-period return level.

CI

Confidence interval for the m-period return level.

Period

The period length used.

ConfLevel

The confidence level used.

References

Coles, S. (2001). An introduction to statistical modeling of extreme values (Vol. 208). London: Springer.

Examples

x <- rgpd(5000, loc = 0, scale = 1, shape = -0.1)
# Compute 50-period return level.
z <- gpdFit(x, nextremes = 200)
gpdRl(z, period = 50, method = "delta")
gpdRl(z, period = 50, method = "profile")

GPD Multiple Threshold Goodness-of-Fit Testing

Description

Wrapper function to test multiple thresholds for goodness-of-fit to the Generalized Pareto model. Can choose which test to run from the available tests in this package.

Usage

gpdSeqTests(
  data,
  thresholds = NULL,
  nextremes = NULL,
  method = c("ad", "cvm", "pbscore", "multscore", "imasym", "impb"),
  nsim = NULL,
  inner = NULL,
  outer = NULL,
  information = c("expected", "observed"),
  allowParallel = FALSE,
  numCores = 1
)

Arguments

data

Original, full dataset in vector form.

thresholds

A set of threshold values (either this or a set of the number of extremes must be given, but not both). Must be provided as a vector.

nextremes

A set of the number of upper extremes to be used, provided as a vector.

method

Which test to run to sequentially test the thresholds. Must be one of ‘ad’, ‘cvm’, ‘pbscore’, ‘multscore’, ‘imasym’, or ‘impb’.

nsim

Number of boostrap replicates for the ‘ad’, ‘cvm’, ‘pbscore’, ‘multscore’, and ‘imasym’ tests.

inner

Number of inner boostrap replicates if ‘impb’ test is chosen.

outer

Number of outer boostrap replicates if ‘impb’ test is chosen.

information

To use observed or expected (default) information for the ‘pbscore’ and ‘multscore’ tests.

allowParallel

If selected, should the ‘cvm’, ‘ad’, ‘pbscore’, or ‘impb’ procedure be run in parallel or not. Defaults to false.

numCores

If allowParallel is true, specify the number of cores to use.

Details

Function returns a matrix containing the thresholds used, the number of observations above each threshold, the corresponding test statistics, p-values (raw and transformed), and parameter estimates at each threshold. The user must provide the data, a vector of thresholds or number of upper extremes to be used, and select the test.

Value

threshold

The threshold used for the test.

num.above

The number of observations above the given threshold.

p.values

Raw p-values for the thresholds tested.

ForwardStop

Transformed p-values according to the ForwardStop stopping rule.

StrongStop

Transformed p-values according to the StrongStop stopping rule.

statistic

Returned test statistics of each individual test.

est.scale

Estimated scale parameter for the given threshold.

est.shape

Estimated shape parameter for the given threshold.

Examples

set.seed(7)
x <- rgpd(10000, loc = 0, scale = 5, shape = 0.2)
## A vector of thresholds to test
threshes <- c(1.5, 2.5, 3.5, 4.5, 5.5)
gpdSeqTests(x, thresholds = threshes, method = "ad")

Top Ten Annual Sea Levels: Lowestoft, UK (1964 - 2014)

Description

Top ten annual sea levels at the LoweStoft Station tide gauge from 1964 - 2014. From 1964 - 1992, raw data is collected in hour intervals; from 1993 - present, raw data is collected in fifteen minute intervals. Data is pre-processed here to account for storm length - see reference for details.

Usage

data(lowestoft)

Format

A data matrix with 51 observations. Each year is considered an observation, with the top ten annual sea level events.

Source

UK Tide Gauge Network (Lowestoft Station): https://www.bodc.ac.uk/data/online_delivery/ntslf/processed/

References

Bader B., Yan J., & Zhang X. (2015). Automated Selection of r for the r Largest Order Statistics Approach with Adjustment for Sequential Testing. Department of Statistics, University of Connecticut.

Examples

data(lowestoft)
gevrSeqTests(lowestoft, method = "ed")
## Not run
## Look at the difference in confidence intervals between r = 1 and r = 10
# z1 <- gevrFit(lowestoft[, 1])
# z2 <- gevrFit(lowestoft)
# gevrRl(z1, 50, method = "profile")
# gevrRl(z2, 50, method = "profile")

Mean Residual Life Plot for the Generalized Pareto Distribution

Description

Plots the empirical mean residual life, with confidence intervals. The mean residual life plot provides a visual diagnostic tool to choose a threshold for exceedances.

Usage

mrlPlot(data, thresholds = NULL, conf = 0.95, npoints = 100)

Arguments

data

Vector of data.

thresholds

A numeric vector of threshold(s) to plot vertically. Defaults to NULL.

conf

The level of the confidence bounds to use. Defaults to 0.95.

npoints

The number of points to interpolate with. Defaults to 100.

Examples


x <- rgpd(500, loc = 0, scale = 1, shape = 0.1)
mrlPlot(x, thresholds = c(2))

P-Value Sequential Adjustment

Description

Given a set of (ordered) p-values, returns p-values adjusted according to the ForwardStop and StrongStop stopping rules.

Usage

pSeqStop(p)

Arguments

p

Vector of ordered p-values.

Details

Roughly speaking, under the assumption of independent but ordered p-values, the StrongStop adjusted p-values control for the familywise error rate, while ForwardStop provides control for the false discovery rate.

Value

StrongStop

Vector of ordered p-values adjusted for the familywise error rate.

ForwardStop

Vector of ordered p-values adjusted for the false discovery rate.

UnAdjusted

Vector of non-transformed p-values.

References

G'Sell, M. G., Wager, S., Chouldechova, A., & Tibshirani, R. (2013). Sequential Selection Procedures and False Discovery Rate Control. arXiv preprint arXiv:1309.5352.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289-300.

Bader B., Yan J., & Zhang X. (2015). Automated Selection of r for the r Largest Order Statistics Approach with Adjustment for Sequential Testing. Department of Statistics, University of Connecticut.

Examples

x <- rgevr(500, 10, loc = 0.5, scale = 1, shape = 0.5)
y <- gevrSeqTests(x, method = "ed")
pSeqStop(rev(y$p.values))