Version: | 0.7-17 |
Date: | 2024-05-14 |
Title: | Non-Supervised Regional Frequency Analysis |
Depends: | R (≥ 3.0.0), stats, graphics, methods |
Description: | A collection of statistical tools for objective (non-supervised) applications of the Regional Frequency Analysis methods in hydrology. The package refers to the index-value method and, more precisely, helps the hydrologist to: (1) regionalize the index-value; (2) form homogeneous regions with similar growth curves; (3) fit distribution functions to the empirical regional growth curves. Most of the methods are those described in the Flood Estimation Handbook (Centre for Ecology & Hydrology, 1999, ISBN:9781906698003). Homogeneity tests from Hosking and Wallis (1993) <doi:10.1029/92WR01980> and Viglione et al. (2007) <doi:10.1029/2006WR005095> are available. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Copyright: | The functions `MLlaio2004.R', `GOFlaio2004.R' and `MSClaio2008.R' are derived from original Francesco Laio Matlab code (non published material). |
NeedsCompilation: | no |
Packaged: | 2024-05-14 09:56:44 UTC; alviglio |
Author: | Alberto Viglione [aut, cre], Francesco Laio [ctb], Eric Gaume [ctb], Olivier Payrastre [ctb], Jose Luis Salinas [ctb], Chi Cong N'guyen [ctb], Karine Halbert [ctb] |
Maintainer: | Alberto Viglione <alberto.viglione@polito.it> |
Repository: | CRAN |
Date/Publication: | 2024-05-14 14:40:02 UTC |
Non-supervised Regional Frequency Analysis
Description
The estimation of hydrological variables in ungauged basins is a very important topic for many purposes, from research to engineering applications and water management (see the PUB project, Sivapalan et al., 2003). Regardless of the method used to perform such estimation, the underlying idea is to transfer the hydrological information from gauged to ungauged sites. When observations of the same variable at different measuring sites are available and many data samples are used together as source of information, the methods are called regional methods. The well known regional frequency analysis (e.g. Hosking and Wallis, 1997), where the interest is in the assessment of the frequency of hydrological events, belong to this class of methods. In literature, the main studied variable is the flood peak and the most used regional approach is the index-flood method of Dalrymple (1960), in which it is implicitly assumed that the flood frequency distribution for different sites belonging to a homogeneous region is the same except for a site-specific scale factor, the index-flood (see Hosking and Wallis, 1997, for details). Hence, the estimation of the flood frequency distribution for an ungauged site can be divided into two parts: estimation of the index-flood (more in general, the index-value) through linear/non-linear relations with climatic and basin descriptors; estimation of the adimentional flood frequency distribution, the growth curve, assigning the ungauged basin to one homogeneous region.
nsRFA
is a collection of statistical tools for objective (non-supervised) applications of the Regional Frequency Analysis methods in hydrology.
This does not mean that Regional Frequency Analysis should be non-supervised.
These tools are addressed to experts, to help their expert judgement.
The functions in nsRFA
allow the application of the index-flood method in the following points:
- | regionalization of the index-value; |
- | formation of homogeneous regions for the growth curves; |
- | fit of a distribution function to the empirical growth curve of each region; |
Regionalization of the index-value
The index-value can be either the sample mean (e.g. Hosking and Wallis, 1997) or the sample median (e.g. Robson and Reed, 1999) or another scale parameter. Many methodological approaches are available for the index-value estimation, and their differences can be related to the amount of information available. Excluding direct methods, that use information provided by flow data available at the station of interest, regional estimation methods require ancillary hydrological and physical information. Those methods can be divided in two classes: the multiregressive approach and the hydrological simulation approach. For both of them, the best estimator is the one that optimizes some criterion, such as the minimum error, the minimum variance or the maximum efficiency. Due to its simplicity, the most frequently used method is the multiregressive approach (see e.g. Kottegoda & Rosso, 1998; Viglione et al., 2007a), that relates the index-flow to catchment characteristics, such as climatic indices, geologic and morphologic parameters, land cover type, etc., through linear or non-linear equations.
R provides the functions lm
and nls
for linear and non-linear regressions (package stats
).
With the package nsRFA
, a tool to select the best linear regressions given a set of candidate descriptors, bestlm
, is provided.
In REGRDIAGNOSTICS
several functions are provided to analyze the output of lm
, such as:
the coefficient of determination (classic and adjusted);
the Student t significance test;
the variance inflation factor (VIF);
the root mean squared error (RMSE);
the mean absolute error (MAE);
the prediction intervals;
predicted values by a jackknife (cross-validation) procedure.
The functions in DIAGNOSTICS
provide more general diagnostics of model results (that can be also non-linear), comparing estimated values with observed values.
More details are provided in vignettes:
nsRFA_ex01 | How to use the package nsRFA (example 1): |
Regional frequency analysis of the annual flows in Piemonte | |
and Valle d'Aosta | |
that can be accessed via vignette("nsRFA_ex01", package="nsRFA")
.
Formation of homogeneous regions for the growth curves
Different techniques exist, for example those that lead to the formation of fixed regions through cluster analysis (Hosking and Wallis, 1997, Viglione, 2007), or those based on the method of the region of influence (ROI, Burn, 1990).
The regional procedure can be divided into two parts: the formation of regions and the assignment of an ungauged site to one
of them.
Concerning the first point, the sites are grouped according to their similarity in terms of those basin descriptors that are assumed to explain the shape of the growth curve.
This shape is usually quantified in a parametric way.
For instance, the coefficient of variation (CV) or the L-CV of the curve can be used for this purpose.
The package nsRFA
provide the functions in moments
and Lmoments
to calculate sample moments and L-moments.
Subsequently, the selected parameter is related with basin descriptors through a linear or a more complex model.
A regression analysis is performed with different combination of descriptors, and descriptors that are strongly related with the parameter are used to group sites in regions.
The same tools used for the regionalization of the index value, i.e. bestlm
, REGRDIAGNOSTICS
and DIAGNOSTICS
, can be used if the parametric method is chosen.
nsRFA
also provide a non-parametric approach that considers the dimensionless growth curve as a whole (see, Viglione et al., 2006; Viglione, 2007).
The multiregressive approach can still be used if we reason in terms of (dis)similarity between pairs of basins in the following way:
(1) for each couple of stations, a dissimilarity index between non-dimensional curves is calculated using a quantitatively predefined metric, for example using the Anderson-Darling statistic (A2
), and organising the distances in a matrix with AD.dist
;
(2) for each basin descriptor, the absolute value (or another metric) of the difference between its measure in two basins is used as distance between them, using dist
of the package stats
to obtain distance matrices;
(4) a multiregressive approach (bestlm
, lm
) is applied considering the matrices as variables and the basin descriptors associated to the best regression are chosen;
(5) the significance of the linear relationship between distance matrices is assessed through the Mantel test with mantel.lm
.
In the suitable-descriptor's space, stations with similar descriptor values can be grouped into disjoint regions through a cluster analysis (using functions in traceWminim
) or the ROI method can be used adapting a region to the ungauged basin (roi
).
In both cases, the homogeneity of the regions can be assessed with the functions in HOMTESTS
, where the Hosking and Wallis heterogeneity measures (HW.tests
, see Hosking and Wallis, 1997) and the Anderson-Darling homogeneity test (ADbootstrap.test
, see Viglione et al., 2007b) are provided.
More details are provided in vignettes:
nsRFA_ex01 | How to use the package nsRFA (example 1): |
Regional frequency analysis of the annual flows in Piemonte | |
and Valle d'Aosta | |
nsRFA_ex02 | How to use the package nsRFA (example 2): |
Region-Of-Influence approach, some FEH examples | |
that can be accessed via vignette("nsRFA_ex01", package="nsRFA")
.
Fit of a distribution function to the empirical growth curve of each region
Once an homogeneous region is defined, the empirical growth curves can be pooled together and a probability distribution can be fitted to the pooled sample.
The choice of the best distribution can be assisted by a Model Selection Criteria with MSClaio2008
(see, Laio et al., 2008).
The parameters of the selected distribution can be estimated using the method of moments (moment_estimation
), L-moments (par.GEV
, par.genpar
, par.gamma
, ...) or maximum-likelihood (MLlaio2004
).
Goodness-of-fit tests are also available: the Anderson-Darling goodness of fit test with GOFlaio2004
(Laio. 2004), and Monte-Carlo based tests with GOFmontecarlo
.
Confidence intervals for the fitted distribution can be calculated with a Markov Chain Monte Carlo algorithm, using BayesianMCMC
.
More details are provided in vignettes:
nsRFA_ex01 | How to use the package nsRFA (example 1): |
Regional frequency analysis of the annual flows in Piemonte | |
and Valle d'Aosta | |
MSClaio2008 | Model selection techniques for the frequency analysis |
of hydrological extremes: the MSClaio2008 R function | |
that can be accessed via vignette("nsRFA_ex01", package="nsRFA")
.
Other functions
varLmoments
provides distribution-free unbiased estimators of the variances and covariances of sample L-moments, as described in Elamir and Seheult (2004).
More details are provided in vignettes:
Fig1ElamirSeheult | Figure 1 in Elamir and Seheult (2004) |
Details
Package: | nsRFA |
Version: | 0.7 |
The package provides several tools for Regional Frequency Analysis of hydrological variables. The first version dates to 2006 and was developed in Turin at the Politecnico by Alberto Viglione.
For a complete list of the functions, use library(help="nsRFA").
Main changes in version 0.7
0.7-17: | removal of old Fortran code and therefore of functions bestlm (and therefore the vignette nsRFA_ex01 ) and HW.original ; |
0.7-12: | refinement of function BayesianMCMC allowing several threshold and new function BayesianMCMCreg ; |
0.7-1: | refinement of function BayesianMCMC ; |
0.7-0: | plotting position for historical information in DISTPLOTS ; |
Main changes in version 0.6
0.6-9: | new vignette Fig11GriffisStedinger ; |
0.6-8: | exponential and Gumbel distributions added in GOFmontecarlo ; |
0.6-6: | some plotting position/probability plots have been added in DISTPLOTS ; |
0.6-4: | refinement of function BayesianMCMC ; |
0.6-2: | new vignette nsRFA_ex02 ; |
0.6-2: | refinement of function BayesianMCMC ; |
0.6-0: | new vignette nsRFA_ex01 ; |
0.6-0: | new function bestlm ; |
0.6-0: | the plotting position/probability plots in DISTPLOTS have been reshaped; |
0.6-0: | this list of changes has been added; |
Author(s)
Alberto Viglione
Maintainer: Alberto Viglione <viglione@hydro.tuwien.ac.at>
References
All the manual references are listed here:
Beirlant, J., Goegebeur, Y., Teugels, J., Segers, J., 2004. Statistics of Extremes: Theory and Applications. John Wiley and Sons Inc., 490 pp.
Burn, D.H., 1990. Evaluation of regional flood frequency analysis with a region of influence approach. Water Resources Research 26(10), 2257-2265.
Castellarin, A., Burn, D.H., Brath, A., 2001. Assessing the effectiveness of hydrological similarity measures for flood frequency analysis. Journal of Hydrology 241, 270-285.
Chowdhury, J.U., Stedinger, J.R., Lu, L.H., Jul. 1991. Goodness-of-fit tests for regional generalized extreme value flood distributions. Water Resources Research 27(7), 1765-1776.
D'Agostino, R., Stephens, M., 1986. Goodness-of-fit techniques. Vol.68 of Statistics: textBooks and monographs. Department of Statistics, Southern Methodist University, Dallas, Texas.
Dalrymple, T., 1960. Flood frequency analyses. Vol. 1543-A of Water Supply Paper. U.S. Geological Survey, Reston, Va.
Durbin, J., Knott, M., 1971. Components of Cramer-von Mises statistics. London School of Economics and Political Science, pp. 290-307.
El Adlouni, S., Bob\'ee, B., Ouarda, T.B.M.J., 2008. On the tails of extreme event distributions in hydrology. Journal of Hydrology 355(1-4), 16-33.
Elamir, E. A.H., Seheult, A.H., 2004. Exact variance structure of sample l-moments. Journal of Statistical Planning and Inference 124, 337-359.
Everitt, B., 1974. Cluster Analysis. Social Science Research Council. Halsted Press, New York.
Fill, H., Stedinger, J., 1998. Using regional regression within index flood procedures and an empirical bayesian estimator. Journal of Hydrology 210(1-4), 128-145.
Greenwood, J., Landwehr, J., Matalas, N., Wallis, J., 1979. Probability weighted moments: Definition and relation to parameters of several distributions expressible in inverse form. Water Resources Research 15, 1049-1054.
Hosking, J., 1986. The theory of probability weighted moments. Tech. Rep. RC12210, IBM Research, Yorktown Heights, NY.
Hosking, J., 1990. L-moments: analysis and estimation of distributions using linear combinations of order statistics. J. Royal Statistical Soc. 52, 105-124.
Hosking, J., Wallis, J., 1993. Some statistics useful in regional frequency analysis. Water Resources Research 29(2), 271-281.
Hosking, J., Wallis, J., 1997. Regional Frequency Analysis: An Approach Based on L-Moments. Cambridge University Press.
Institute of Hydrology, 1999. Flood Estimation Handbook, Institute of Hydrology, Oxford.
Kendall, M., Stuart, A., 1961-1979. The Advanced Theory of Statistics. Charles Griffin & Company Limited, London.
Kottegoda, N.T., Rosso, R., 1997. Statistics, Probability, and Reliability for Civil and Environmental Engineers, international Edition. McGraw-Hill Companies.
Laio, F., 2004. Cramer-von Mises and Anderson-Darling goodness of fit tests for extreme value distributions with unknown parameters, Water Resour. Res., 40, W09308, doi:10.1029/2004WR003204.
Laio, F., Tamea, S., 2007. Verification tools for probabilistic forecast of continuous hydrological variables. Hydrology and Earth System Sciences 11, 1267-1277.
Laio, F., Di Baldassarre G., Montanari A., 2008. Model selection techniques for the frequency analysis of hydrological extremes, Water Resour. Res., Under Revision.
Regione Piemonte, 2004. Piano di tutela delle acque. Direzione Pianificazione Risorse Idriche.
Robson, A., Reed, D., 1999. Statistical procedures for flood frequency estimation. In: Flood Estimation HandBook. Vol.~3. Institute of Hydrology Crowmarsh Gifford, Wallingford, Oxfordshire.
Sankarasubramanian, A., Srinivasan, K., 1999. Investigation and comparison of sampling properties of l-moments and conventional moments. Journal of Hydrology 218, 13-34.
Scholz, F., Stephens, M., 1987. K-sample Anderson-Darling tests. Journal of American Statistical Association, 82(399), pp. 918-924.
Sivapalan, M., Takeuchi, K., Franks, S.W., Gupta, V.K., Karambiri, H., Lakshmi, V., Liang, X., McDonnell, J.J., Mendiondo, E.M., O'Connell, P.E., Oki, T., Pomeroy, J.W, Schertzer, D., Uhlenbrook, S., Zehe, E., 2003. IAHS decade on Predictions in Ungauged Basins (PUB), 2003-2012: Shaping an exciting future for the hydrological sciences, Hydrological Sciences - Journal - des Sciences Hydrologiques, 48(6), 857-880.
Smouse, P.E., Long, J.C., Sokal, R.R., 1986. Multiple regression and correlation extensions of the Mantel test of matrix correspondence. Systematic Zoology, 35(4), 627-632.
Stedinger, J., Lu, L., 1995. Appraisal of regional and index flood quantile estimators. Stochastic Hydrology and Hydraulics 9(1), 49-75.
Stedinger, J.R., Vogel, R.M. and Foufoula-Georgiou, E., 1993. Frequency analysis of extreme events. In David R. Maidment, editor, Hand-Book of Hydrology, chapter 18. McGraw-Hill Companies, international edition.
Viglione, A., Claps, P., Laio, F., 2006. Utilizzo di criteri di prossimit\'a nell'analisi regionale del deflusso annuo, XXX Convegno di Idraulica e Costruzioni Idrauliche - IDRA 2006, Roma, 10-15 Settembre 2006.
Viglione, A., 2007. Metodi statistici non-supervised per la stima di grandezze idrologiche in siti non strumentati, PhD thesis at the Politecnico of Turin, February 2007.
Viglione, A., Claps, P., Laio, F., 2007a. Mean annual runoff estimation in North-Western Italy, In: G. La Loggia (Ed.) Water resources assessment and management under water scarcity scenarios, CDSU Publ. Milano.
Viglione, A., Laio, F., Claps, P., 2007b. A comparison of homogeneity tests for regional frequency analysis. Water Resources Research 43~(3).
Vogel, R., Fennessey, N., 1993. L moment diagrams should replace product moment diagrams. Water Resources Research 29(6), 1745-1752.
Vogel, R., Wilson, I., 1996. Probability distribution of annual maximum, mean, and minimum streamflows in the united states. Journal of Hydrologic Engineering 1(2), 69-76.
Ward, J., 1963. Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58, pp. 236-244.
Anderson-Darling distance matrix for growth curves
Description
Distance matrix for growth curves. Every element of the matrix is the Anderson-Darling statistic calculated between two series.
Usage
AD.dist (x, cod, index=2)
Arguments
x |
vector representing data from many samples defined with |
cod |
array that defines the data subdivision among sites |
index |
if |
Details
The Anderson-Darling statistic used here is the one defined in https://en.wikipedia.org/wiki/Anderson-Darling_test as A^2
.
Value
AD.dist
returns the distance matrix between growth-curves built with the Anderson-Darling statistic.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
# Ad.dist
AD.dist(x,cod) # it takes some time
Data-sample
Description
Systematic flood data and historical flood data for the Ard\'eche region (France) as described in: \ Naulet, R. (2002). Utilisation de l'information des crues historiques pour une meilleure pr\'ed\'etermination du risque d'inondation. Application au bassin de l'Ard\‘eche \'a Vallon Pont-d’Arc et St-Martin d'Ard\‘eche. PhD thesis at CEMAGREF, Lyon, and at the Universit\’e du Qu\'ebec, pp. 322. \ and \ Nguyen, C.C. (2012). Am\'elioration des approches Bay\'esiennes MCMC pour l'analyse r\'egionale des crues (Improvement of BayesianMCMC approaches for regional flood frequency analyses). PhD thesis at the University of Nantes, pp. 192.
Usage
data(Ardechedata)
Format
Ardeche_areas
, areas (km2) of the gauged and ungauged catchments in the Ard\'eche region (France);
Ardeche_ungauged_extremes
, flood peaks (m3/s) reconstructed in ungauged catchments and number of years for which the peak was not exceeded;
Beauvene_cont
, sistematic flood peaks (m3/s) recorded at one station;
Chambonas_cont
, sistematic flood peaks (m3/s) recorded at one station;
SaintLaurent_cont
, sistematic flood peaks (m3/s) recorded at one station;
SaintMartin_cont
, sistematic flood peaks (m3/s) recorded at one station;
SaintMartin_hist
, values for historical peaks (m3/s) for one station and for flood perception thresholds (m3/s) non exceeded in the periods indicated;
Vogue_cont
, sistematic flood peaks (m3/s) recorded at one station.
Examples
data(Ardechedata)
SaintMartin_cont
SaintMartin_hist
Bayesian MCMC frequency analysis
Description
Bayesian Markov Chain Monte Carlo algorithm for flood frequency analysis with historical and other information. The user can choose between a local and a regional analysis.
Usage
BayesianMCMC (xcont, xhist=NA, infhist=NA, suphist=NA,
nbans=NA, seuil=NA, nbpas=1000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV",
apriori=function(...){1},
parameters0=NA, varparameters0=NA)
BayesianMCMCcont (x, nbpas=NA)
BayesianMCMCreg (xcont, scont, xhist=NA, infhist=NA, suphist=NA, shist=NA,
nbans=NA, seuil=NA, nbpas=1000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV",
apriori=function(...){1},
parameters0=NA, varparameters0=NA)
BayesianMCMCregcont (x, nbpas=NA)
plotBayesianMCMCreg_surf (x, surf, ask=FALSE, ...)
## S3 method for class 'BayesianMCMC'
plot(x, which=1, ask=FALSE, ...)
## S3 method for class 'BayesianMCMC'
print(x, ...)
## S3 method for class 'BayesianMCMCreg'
plot(x, which=1, ask=FALSE, ...)
## S3 method for class 'BayesianMCMCreg'
print(x, ...)
Arguments
x |
object of class |
xcont |
vector of systematic data |
scont |
vector of upstream catchment surfaces of systematic data |
xhist |
vector of historical data and/or extreme discharges at ungauged sites |
infhist |
vector of inferior limit for historical data and/or extreme discharges at ungauged sites |
suphist |
vector of superior limit for historical data and/or extreme discharges at ungauged sites |
shist |
vector of upstream catchment surfaces of extreme discharges at ungauged sites |
nbans |
period (in years) over which every threshold has not been exceeded except for the historical data and/or extreme discharges at ungauged sites. If several values of xhist for a same threshold, put the number of years associated to the threshold on the first row, then put 0 (see examples) |
seuil |
threshold not exceeded in the historical period except for the historical data and/or extreme discharges at ungauged sites (several thresholds allowed). |
nbpas |
number of iterations for the MCMC algorithm |
nbchaines |
number of chains for the MCMC algorithm |
confint |
confidence limits for the flood quantiles |
dist |
distribution: normal |
apriori |
function of the parameters of the model ‘proportional to’ their a-priori guessed distribution. The default fuction returns always 1, i.e. there is no a-priori information |
parameters0 |
initial values of the parameters for the MCMC algorithm |
varparameters0 |
initial values of the parameter variances for the MCMC algorithm |
which |
a number of a vector of numbers that defines the graph to plot (see details) |
ask |
if TRUE, the interactive mode is run |
surf |
a particular surface (number or vector), not necessarily being a surface included in the scont or shist vectors |
... |
other arguments |
Details
Supported cases
These functions are taking 4 cases into account, depending on the type of data provided: - Using only the systematic data: xcont provided, xhist=NA, infhist=NA and suphist=NA - Using censored information, historical flood known: xcont and xhist provided, infhist=NA and suphist=NA - Using censored information, historical flood unknown precisely but its lower limit known: xcont and infhist provided, xhist=NA and suphist=NA - Taking into account flood estimation intervals: infhist and suphist (respectively lower and upper limits) provided, xcont provided, xhist=NA - Please note that every other case is NOT supported. For example, you can't have some historical flood values perfectly known as well as some other for which you only know a lower limit or an interval.
Regarding the perception thresholds: - By definition, the number of exceedances of each perception threshold within its application period has to be known precisely, and all the floods exceeding the threshold have to be included in xhist (or infhist or suphist). - Several thresholds are allowed. - It is possible to include in xhist (or infhist or suphist) historical values that do not exceed the associated perception threshold. - If for one or several thresholds you only know that this or these threshold have never been exceeded and no more information is available on floods that did not exceed the threshold(s), this case is also supported. In this case, put for the historical flood corresponding to the threshold xhist=-1 (or infhist=-1 or infhist=suphist=-1).
Bayesian inference
Bayesian inference uses a numerical estimate of the degree of belief in a hypothesis before evidence has been observed and calculates a numerical estimate of the degree of belief in the hypothesis after evidence has been observed.
The name ‘Bayesian’ comes from the frequent use of Bayes' theorem in the inference process.
In our case the problem is: which is the probability that a frequency function P
(of type defined in dist
) has parameters \theta
, given that we have observed the realizations D
(defined in xcont
, xhist
, infhist
, suphist
, nbans
, seuil
).
The Bayes' theorem writes
P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}
where
P(\theta|D)
is the conditional probability of \theta
, given D
(it is also called the posterior probability because it is derived from or depends upon the specified value of D
) and is the result we are interested in;
P(\theta)
is the prior probability or marginal probability of \theta
(‘prior’ in the sense that it does not take into account any information about D
), and can be given using the input apriori
(it can be used to account for causal information);
P(D|\theta)
is the conditional probability of D
given \theta
and it is defined choosing dist
and depending on the availability of historical data;
P(D)
is the prior or marginal probability of D
, and acts as a normalizing constant.
Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing \theta
are updated by having observed D
.
Since complex models cannot be processed in closed form by a Bayesian analysis, namely because of the extreme difficulty in computing the normalization factor P(D)
, simulation-based Monte Carlo techniques as the MCMC approaches are used.
MCMC Metropolis algorithm
Markov chain Monte Carlo (MCMC) methods (which include random walk Monte Carlo methods), are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample from the desired distribution. The quality of the sample improves as a function of the number of steps.
The MCMC is performed here through a simple Metropolis algorithm, i.e. a Metropolis-Hastings algorithm with symmetric proposal density.
The Metropolis-Hastings algorithm can draw samples from any probability distribution P(x)
, requiring only that a function proportional to the density can be calculated at x
.
In Bayesian applications, the normalization factor is often extremely difficult to compute, so the ability to generate a sample without knowing this constant of proportionality is a major virtue of the algorithm.
The algorithm generates a Markov chain in which each state x_t + 1
depends only on the previous state x_t
.
The algorithm uses a Gaussian proposal density N(x_t, \sigma_x)
, which depends on the current state x_t
, to generate a new proposed sample x'
.
This proposal is accepted as the next value x_t + 1 = x'
if \alpha
drawn from U(0,1)
satisfies
\alpha < \frac{P(x')}{P(x_t)}
If the proposal is not accepted, then the current value of x
is retained (x_t + 1 = x_t
).
The Markov chain is started from a random initial value x_0
and the algorithm is run for many iterations until this initial state is forgotten.
These samples, which are discarded, are known as burn-in.
The remaining set of accepted values of x
represent a sample from the distribution P(x)
.
As a Gaussian proposal density (or a lognormal one for definite-positive parameters) is used, the variance parameter \sigma_x^2
has to be tuned during the burn-in period.
This is done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last N
samples.
The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an N-dimensional Gaussian target distribution.
If \sigma_x^2
is too small the chain will mix slowly (i.e., the acceptance rate will be too high, so the sampling will move around the space slowly and converge slowly to P(x)
).
If \sigma_x^2
is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density.
The desired acceptance rate is fixed here to 34%.
The MCMC algorithm is based on a code developed by Eric Gaume on Scilab. It is still unstable and not all the distributions have been tested.
Value
BayesianMCMC
and BayesianMCMCcont
(which just continues the simulations of BayesianMCMC
for local analyses and BayesianMCMCreg
and BayesianMCMCregcont
for regional analyses return the following values:
BayesianMCMCreg
and BayesianMCMCregcont
(which just continues the simulations of BayesianMCMC
starting from its output) return the following values:
parameters
matrix (nbpas)x(nbchaines) with the simulated sets of parameters with the MCMC algorithm;
parametersML
set of parameters correspondent to the maximum likelihood;
returnperiods
return periods for which quantilesML
and intervals
are calculated;
quantilesML
quantiles correspondent to returnperiods
for the distribution whose parameters are parametersML
;
logML
maximum log-likelihood;
intervals
confidence intervals for the quantiles quantilesML
for limits confint
;
varparameters
matrix (nbpas)x(nbchaines)x(number of parameters) with the simulated variances for the MCMC algorithm;
vraisdist
likelihoods for the sets parameters
;
propsaut
vector showing the evolution of the acceptance rate during the Bayesian MCMC fitting;
plot.BayesianMCMC
and plot.BayesianMCMCreg
(for a normalized surface of 1 km2) plot the following figures:
1
data as plotting position (the Cunanne plotting position a = 0.4
is used), fitted distribution (maximum likelihood) and confidence intervals;
2
diagnostic plot of the MCMC simulation (parameters);
3
diagnostic plot of the MCMC simulation (likelihood and MCMC acceptance rate);
4
posterior distribution of parameters obtained with the MCMC simulation (cloud plots);
5
a-priori distribution of parameters (contour plots);
plotBayesianMCMCreg_surf
plots the same plot as the first one given by plot.BayesianMCMCreg
but for each surface in argument, as well as its mean as a function of the surfaces;
Note
For information on the package and the Author, and for all the references, see nsRFA
.
Author(s)
Eric Gaume, Alberto Viglione, Jose Luis Salinas, Olivier Payrastre, Chi Cong N'guyen, Karine Halbert
See Also
.
Examples
set.seed(2988)
serie <- rand.GEV(120, xi=40, alfa=20, k=-0.4)
serie100 <- serie[1:100]
serie100[serie100 < 250] <- NA
serie20 <- serie[101:120]
serie <- c(serie100, serie20)
plot(serie, type="h", ylim=c(0, 600), xlab="",
ylab="Annual flood peaks [m3/s]", lwd=3)
abline(h=0)
points(serie100, col=2)
## Not run:
# Using only sistematic data
only_sist <- BayesianMCMC (xcont=serie20, nbpas=5000, nbchaines=3, varparameters0=c(70, 20, 0.5),
confint=c(0.05, 0.95), dist="GEV")
plot(only_sist, which=c(1:3), ask=TRUE, ylim=c(1,600))
only_sist <- BayesianMCMCcont(only_sist)
plot(only_sist, which=c(1:3), ask=TRUE, ylim=c(1,600))
only_sist <- BayesianMCMCcont(only_sist)
plot(only_sist, which=c(1:3), ask=TRUE, ylim=c(1,600))
# Adding the information that the threshold 250 m3/s was exceeded
# 3 times in the past 100 years
with_hist_thresh <- BayesianMCMC (xcont=serie20, infhist=rep(250,3),
nbans=100, seuil=250,
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV")
plot(with_hist_thresh, which=c(1:3), ask=TRUE, ylim=c(1,600))
# Assuming that the 3 historical events are known with high uncertainty
with_hist_limits <- BayesianMCMC (xcont=serie20,
infhist=c(320,320,250),
suphist=c(360,400,270),
nbans=100, seuil=250,
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV")
plot(with_hist_limits, which=c(1:3), ask=TRUE, ylim=c(1,600))
# Assuming that the 3 historical events are perfectly known
with_hist_known <- BayesianMCMC (xcont=serie20, xhist=serie100[!is.na(serie100)],
nbans=100, seuil=250,
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV")
plot(with_hist_known, which=c(1:3), ask=TRUE, ylim=c(1,600))
# Perception threshold without available information on floods
without_info <- BayesianMCMC (xcont=serie20, xhist=-1,
nbans=100, seuil=2400,
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV")
plot(without_info, which=c(1:3), ask=TRUE, ylim=c(1,600))
# Using one reasonable a-priori distribution
fNORM3 <- function (x) {
# x = vector of values
# mu = vector of means
mu = c(44, 26, -0.40)
# CM = covariance matrix
CM = matrix(c(13, 7.8, -0.055,
7.8, 15, -0.42,
-0.055, -0.42, 0.056), nrow=3, ncol=3)
CMm1 <- solve(CM)
term2 <- exp(-((x - mu) %*% CMm1 %*% (x - mu))/2)
term1 <- 1/(2*pi)^(3/2)/sqrt(det(CM))
term1*term2
}
with_hist_known2 <- BayesianMCMC (xcont=serie20, xhist=serie100[!is.na(serie100)],
nbans=100, seuil=250,
nbpas=5000, nbchaines=3, apriori=fNORM3,
confint=c(0.05, 0.95), dist="GEV")
plot(with_hist_known2, 5)
plot(with_hist_known2, 4)
plot(with_hist_known, 4)
plot(with_hist_known)
plot(with_hist_known2)
# Using one non-reasonable a-priori distribution
fNORM3 <- function (x) {
# x = vector of values
# mu = vector of means
mu = c(30, 50, -0.10)
# CM = covariance matrix
CM = matrix(c(13, 7.8, -0.055,
7.8, 15, -0.42,
-0.055, -0.42, 0.056), nrow=3, ncol=3)
CMm1 <- solve(CM)
term2 <- exp(-((x - mu) %*% CMm1 %*% (x - mu))/2)
term2
}
with_hist_known3 <- BayesianMCMC (xcont=serie20, xhist=serie100[!is.na(serie100)],
nbans=100, seuil=250,
nbpas=5000, nbchaines=3, apriori=fNORM3,
confint=c(0.05, 0.95), dist="GEV")
plot(with_hist_known3, 5)
plot(with_hist_known3, 4)
plot(with_hist_known, 4)
plot(with_hist_known)
plot(with_hist_known3)
## End(Not run)
## Not run:
# Assuming that the historical events are perfectly known and there are 4 different thresholds
# The data file is presenting this way:
# xhist nbans seuil
# 6000 55 6000
# 7400 28 7250
# 6350 8 3050
# 4000 0 3050
# 4550 0 3050
# 3950 0 3050
# 7550 58 2400
# 4650 0 2400
# 3950 0 2400
## Warning: nbans and seuil should have the same length as xhist.
# So when a threshold is exceeded several times, replicate it as many times it is exceeded
# and part the number of years of exceedance into the number of times of exceedance
# (the way you part the nbans vector is not important, what is important is that you have
# length(nbans)=length(xhist) and the total of years for one same threshold equals the number
# of years covered by the perception threshold)
xhist_thres <- c(6000, 7400, 6350, 4000, 4550, 3950, 7550, 4650, 3950)
seuil_thres <- c(6000, 7250, rep(3050, 4), rep(2400, 3))
nbans_thres <- c(55, 28, 8, 0, 0, 0, 58, 0, 0)
# The threshold at 6000 has been exceeded for 55 years, the one at 7250 for 28 years,
# the one at 3050 for 8 years and the one at 2400 for 58 years
with_hist_known_several_thresholds <- BayesianMCMC (xcont=serie20,
xhist=xhist_thres,
nbans=nbans_thres, seuil=seuil_thres,
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV",
varparameters0=c(NA, NA, 0.5))
plot(with_hist_known_several_thresholds, which=c(1:3), ask=TRUE)
## REGIONAL:
# Regional analysis, assuming that the 3 historical events are perfectly known and
# there are 2 perception thresholds
regional_with_hist_known <- BayesianMCMCreg (xcont=serie20,
scont=c(rep(507,9),rep(2240,11)),
xhist=serie100[!is.na(serie100)],
shist=c(495, 495, 87),
nbans=c(100, 0, 50), seuil=c(312, 312, 221),
nbpas=5000, nbchaines=3,
confint=c(0.05, 0.95), dist="GEV",
varparameters0=c(NA, NA, NA, 0.5))
plot(regional_with_hist_known, which=1:3, ask=TRUE, ylim=c(1,600))
surf=c(571, 2240)
plotBayesianMCMCreg_surf(regional_with_hist_known, surf)
## End(Not run)
Diagnostics of models
Description
Diagnostics of model results, it compares estimated values y
with observed values x
.
Usage
R2 (x, y, na.rm=FALSE)
RMSE (x, y, na.rm=FALSE)
MAE (x, y, na.rm=FALSE)
RMSEP (x, y, na.rm=FALSE)
MAEP (x, y, na.rm=FALSE)
Arguments
x |
observed values |
y |
estimated values |
na.rm |
logical. Should missing values be removed? |
Details
If x_i
are the observed values, y_i
the estimated values, with i=1,...,n
, and \bar{x}
the sample mean of x_i
, then:
R^2 = 1 - \frac{\sum_1^n (x_i-y_i)^2}{\sum_1^n x_i^2 - n \bar{x}^2}
RMSE = \sqrt{\frac{1}{n} \sum_1^n (x_i-y_i)^2}
MAE = \frac{1}{n} \sum_1^n |x_i-y_i|
RMSEP = \sqrt{\frac{1}{n} \sum_1^n ((x_i-y_i)/x_i)^2}
MAEP = \frac{1}{n} \sum_1^n |(x_i-y_i)/x_i|
See https://en.wikipedia.org/wiki/Coefficient_of_determination, https://en.wikipedia.org/wiki/Mean_squared_error and https://en.wikipedia.org/wiki/Mean_absolute_error for other details.
Value
R2
returns the coefficient of determination R^2
of a model.
RMSE
returns the root mean squared error of a model.
MAE
returns the mean absolute error of a model.
RMSE
returns the percentual root mean squared error of a model.
MAE
returns the percentual mean absolute error of a model.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
lm
, summary.lm
, predict.lm
, REGRDIAGNOSTICS
Examples
data(hydroSIMN)
datregr <- parameters
regr0 <- lm(Dm ~ .,datregr); summary(regr0)
regr1 <- lm(Dm ~ Am + Hm + Ybar,datregr); summary(regr1)
obs <- parameters[,"Dm"]
est0 <- regr0$fitted.values
est1 <- regr1$fitted.values
R2(obs, est0)
R2(obs, est1)
RMSE(obs, est0)
RMSE(obs, est1)
MAE(obs, est0)
MAE(obs, est1)
RMSEP(obs, est0)
RMSEP(obs, est1)
MAEP(obs, est0)
MAEP(obs, est1)
Empirical distribution plots
Description
Sample values are plotted against their empirical distribution in graphs where points belonging to a particular distribution should lie on a straight line.
Usage
plotpos (x, a=0, orient="xF", ...)
plotposRP (x, a=0, orient="xF", ...)
loglogplot (x, a=0, orient="xF", ...)
unifplot (x, a=0, orient="xF", line=FALSE, ...)
normplot (x, a=0, orient="xF", line=FALSE, ...)
lognormplot (x, a=0, orient="xF", line=FALSE, ...)
studentplot (x, df, a=0, orient="xF", line=FALSE,...)
logisplot (x, a=0, orient="xF", line=FALSE,...)
gammaplot (x, shape, a=0, orient="xF", line=FALSE,...)
expplot (x, a=0, orient="xF", line=FALSE,...)
paretoplot (x, a=0, orient="xF", line=FALSE,...)
gumbelplot (x, a=0, orient="xF", line=FALSE, ...)
frechetplot (x, a=0, orient="xF", line=FALSE,...)
weibullplot (x, a=0, orient="xF", line=FALSE,...)
plotposRPhist (xcont, xhist=NA, infhist=NA, suphist=NA, nbans=NA, seuil=NA,
col12=c(1,1), a=0, orient="xF", ...)
pointspos (x, a=0, orient="xF", ...)
pointsposRP (x, a=0, orient="xF", ...)
loglogpoints (x, a=0, orient="xF", ...)
unifpoints (x, a=0, orient="xF", ...)
normpoints (x, a=0, orient="xF", ...)
studentpoints (x, df, a=0, orient="xF", ...)
logispoints (x, a=0, orient="xF", ...)
gammapoints (x, shape, a=0, orient="xF", ...)
exppoints (x, a=0, orient="xF", ...)
gumbelpoints (x, a=0, orient="xF", ...)
weibullpoints (x, a=0, orient="xF", ...)
regionalplotpos (x, cod, a=0, orient="xF", ...)
regionalnormplot (x, cod, a=0, orient="xF", ...)
regionallognormplot (x, cod, a=0, orient="xF", ...)
regionalexpplot (x, cod, a=0, orient="xF", ...)
regionalparetoplot (x, cod, a=0, orient="xF", ...)
regionalgumbelplot (x, cod, a=0, orient="xF", ...)
regionalfrechetplot (x, cod, a=0, orient="xF", ...)
pointsposRPhist (xcont, xhist=NA, infhist=NA, suphist=NA, nbans=NA, seuil=NA,
col12=c(1,1), a=0, orient="xF", ...)
Arguments
x |
vector representing a data-sample |
xcont |
vector of systematic data (see |
xhist |
vector of historical data (see |
infhist |
inferior limit for historical data (see |
suphist |
superior limit for historical data (see |
nbans |
period (in years) over which the threshold has not been exceeded except for the historical data (see |
seuil |
threshold non exceeded in the historical period except for the historical data (see |
df |
degrees of freedom (> 0, maybe non-integer) of the Student t distribution. 'df = Inf' is allowed. |
shape |
shape parameter of the distribution |
a |
plotting position parameter, normally between 0 and 0.5 (the default value here, corresponding to the Hazen plotting position, see details) |
orient |
if |
line |
if TRUE (default) a straight line indicating the normal, lognormal, ..., distribution with parameters estimated from |
cod |
array that defines the data subdivision among sites |
col12 |
vector of 2 elements containing the colors for the systematic and historical data respectively |
... |
graphical parameters as |
Details
A brief introduction on Probability Plots (or Quantile-Quantile plots) is available on https://en.wikipedia.org/wiki/Q-Q_plot. For plotting positions see https://en.wikipedia.org/wiki/Plotting_position.
For the quantiles of the comparison distribution typically the Weibull formula k/(n + 1)
is used (default here).
Several different formulas have been used or proposed as symmetrical plotting positions.
Such formulas have the form
(k - a)/(n + 1 - 2a)
for some value of a
in the range from 0 to 1/2.
The above expression k/(n+1)
is one example of these, for a=0
.
The Filliben plotting position has a = 0.3175
and the Cunanne plotting position has a = 0.4
should be nearly quantile-unbiased for a range of distributions.
The Hazen plotting position, widely used by engineers, has a = 0.5
.
The Blom's plotting position, a = 3/8
, gives nearly unbiased quantiles for the normal distribution, while the Gringeton plotting position, a = 0.44
, is optimized for the largest observations from a Gumbel distribution.
For the generalized Pareto, the GEV and related distributions of the Type I (Gumbel) and Weibull, a = 0.35
is suggested.
For large sample size, n
, there is little difference between these various expressions.
Value
Representation of the values of x
vs their empirical probability function F
in a cartesian, uniform, normal, lognormal or Gumbel plot.
plotpos
and unifplot
are analogous except for the axis notation, unifplot
has the same notation as normplot
, lognormplot
, ...
plotposRP
is analogous to plotpos
but the frequencies F
are expressed as Return Periods T=1/(1-F)
.
With the default settings, F
is defined with the Weibull plotting position F=k/(n+1)
.
The straight line (if line
=TRUE) indicate the uniform, normal, lognormal or Gumbel distribution with parameters estimated from x
.
The regional plots draw samples of a region on the same plot.
pointspos
, normpoints
, ... are the analogous of points
, they can be used to add points or lines to plotpos
, normplot
, ...
normpoints
can be used either on normplot
or lognormplot
.
exppoints
can be used either on expplot
or paretoplot
(since the log-transformed Pareto random variable is exponentially distributed).
gumbelpoints
can be used either on gumbelplot
or frechetplot
(since the log-transformed Frechet random variable is distributed as a Gumbel).
loglogplot
plots the logarithm of sample vs the logarithm of the empirical exceedance probability. For the log-log plot, the tail probability is represented by a straight line for power-law distributions (e.g. log-pearson, log-logistic, Frechet, ..., HEAVY TAIL), but not for the other subexponential or exponential distributions (e.g. gumbel, gamma, Pearson type III, ..., MODERATE TAIL); see El Adlouni et al. (2008).
plotposRPhist
is based on the method in Stedinger et al. (1993, pp. 18.41-42).
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
These functons are analogous to qqnorm
; for the distributions, see Normal
, Lognormal
, LOGNORM
, GUMBEL
.
Examples
x <- rnorm(30,10,2)
plotpos(x)
normplot(x)
normplot(x,xlab=expression(D[m]),ylab=expression(hat(F)),
main="Normal plot",cex.main=1,font.main=1)
normplot(x,line=FALSE)
x <- rlnorm(30,log(100),log(10))
normplot(x)
lognormplot(x)
x <- rand.gumb(30,1000,100)
normplot(x)
gumbelplot(x)
x <- rnorm(30,10,2)
y <- rnorm(50,10,3)
z <- c(x,y)
codz <- c(rep(1,30),rep(2,50))
regionalplotpos(z,codz)
regionalnormplot(z,codz,xlab="z")
regionallognormplot(z,codz)
regionalgumbelplot(z,codz)
plotpos(x)
pointspos(y,pch=2,col=2)
x <- rnorm(50,10,2)
F <- seq(0.01,0.99,by=0.01)
qq <- qnorm(F,10,2)
plotpos(x)
pointspos(qq,type="l")
normplot(x,line=FALSE)
normpoints(x,type="l",lty=2,col=3)
lognormplot(x)
normpoints(x,type="l",lty=2,col=3)
gumbelplot(x)
gumbelpoints(x,type="l",lty=2,col=3)
# distributions comparison in probabilistic graphs
x <- rnorm(50,10,2)
F <- seq(0.001,0.999,by=0.001)
loglikelhood <- function(param) {-sum(dgamma(x, shape=param[1],
scale=param[2], log=TRUE))}
parameters <- optim(c(1,1),loglikelhood)$par
qq <- qgamma(F,shape=parameters[1],scale=parameters[2])
plotpos(x)
pointspos(qq,type="l")
normplot(x,line=FALSE)
normpoints(qq,type="l")
lognormplot(x,line=FALSE)
normpoints(qq,type="l")
Two parameter exponential distribution and L-moments
Description
EXP
provides the link between L-moments of a sample and the two parameter
exponential distribution.
Usage
f.exp (x, xi, alfa)
F.exp (x, xi, alfa)
invF.exp (F, xi, alfa)
Lmom.exp (xi, alfa)
par.exp (lambda1, lambda2)
rand.exp (numerosita, xi, alfa)
Arguments
x |
vector of quantiles |
xi |
vector of exp location parameters |
alfa |
vector of exp scale parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Exponential_distribution for a brief introduction on the Exponential distribution.
Definition
Parameters (2): \xi
(lower endpoint of the distribution), \alpha
(scale).
Range of x
: \xi \le x < \infty
.
Probability density function:
f(x) = \alpha^{-1} \exp\{-(x-\xi)/\alpha\}
Cumulative distribution function:
F(x) = 1 - \exp\{-(x-\xi)/\alpha\}
Quantile function:
x(F) = \xi - \alpha \log(1-F)
L-moments
\lambda_1 = \xi + \alpha
\lambda_2 = 1/2 \cdot \alpha
\tau_3 = 1/3
\tau_4 = 1/6
Parameters
If \xi
is known, \alpha
is given by \alpha = \lambda_1 - \xi
and the L-moment, moment, and maximum-likelihood estimators are identical.
If \xi
is unknown, the parameters are given by
\alpha = 2 \lambda_2
\xi = \lambda_1 - \alpha
For estimation based on a single sample these estimates are inefficient, but in regional frequency analysis they can give reasonable estimates of upper-tail quantiles.
Lmom.exp
and par.exp
accept input as vectors of equal length. In f.exp
, F.exp
, invF.exp
and rand.exp
parameters (xi
, alfa
) must be atomic.
Value
f.exp
gives the density f
, F.exp
gives the distribution function F
, invFexp
gives
the quantile function x
, Lmom.exp
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.exp
gives the parameters (xi
, alfa
), and rand.exp
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, GENLOGIS
, GENPAR
, GEV
, GUMBEL
, KAPPA
, LOGNORM
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.exp(ll[1],ll[2])
f.exp(1800,parameters$xi,parameters$alfa)
F.exp(1800,parameters$xi,parameters$alfa)
invF.exp(0.7870856,parameters$xi,parameters$alfa)
Lmom.exp(parameters$xi,parameters$alfa)
rand.exp(100,parameters$xi,parameters$alfa)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.exp(Rll[1],Rll[2])
Lmom.exp(parameters$xi,parameters$alfa)
Data-sample
Description
Flood Estimation Handbook flood peak data CD-ROM.
Usage
data(FEH1000)
Format
Data.frames:
am
is the data.frame of the annual maximum flows with 3 columns:
number
, the code of the station;
date
, date of the annual maximum;
year
, year of the annual maximum (we consider hydrologic year: 1 october - 30 september);
am
, the value of the annual maximum flow [m3/s].
cd
is the data.frame of parameters of 1000 catchements with 24 columns:
number
, the code of the station;
nominal_area
, catchment drainage area [km2];
nominal_ngr_x
, basin outflow coordinates [m];
nominal_ngr_y
, basin outflow coordinates [m];
ihdtm_ngr_x
, basin outflow coordinates by Institute of Hydrology digital terrain model [m];
ihdtm_ngr_y
, basin outflow coordinates by Institute of Hydrology digital terrain model [m];
dtm_area
, catchment drainage area [km2] derived by CEH using their DTM (IHDTM);
saar4170
, standard average annual rainfall 1941-1970 [mm];
bfihost
, baseflow index derived from HOST soils data;
sprhost
, standard percentage runoff derived from HOST soils data;
farl
, index of flood attenuation due to reservoirs and lakes;
saar
, standard average annual rainfall 1961-1990 [mm];
rmed_1d
, median annual maximum 1-day rainfall [mm];
rmed_2d
, median annual maximum 2-days rainfall [mm];
rmed_1h
, median annual maximum 1-hour rainfall [mm];
smdbar
, mean SMD (soil moisture deficit) for the period 1961-1990 calculated from MORECS month-end values [mm];
propwet
, proportion of time when soil moisture deficit <=6 mm during 1961-90, defined using MORECS;
ldp
, longest drainage path [km], defined by recording the greatest distance from a catchment node to the defined outlet;
dplbar
, mean drainage path length [km];
altbar
, mean catchment altitude [m];
dpsbar
, mean catchement slope [m/km];
aspbar
, index representing the dominant aspect of catchment slopes (its values increase clockwise from zero to 360, starting from the north). Mean direction of all inter-nodal slopes with north being zero;
aspvar
, index describing the invariability in aspect of catchment slopes. Values close to one when all slopes face a similar direction;
urbext1990
, extent of urban and suburban land cover in 1990 [fraction].
Note
For information on the package and the Author, and for all the references, see nsRFA
.
Source
http://www.environment-agency.gov.uk/hiflowsuk/
Examples
data(FEH1000)
names(cd)
am[1:20,]
Three parameter generalized logistic distribution and L-moments
Description
GENLOGIS
provides the link between L-moments of a sample and the three parameter
generalized logistic distribution.
Usage
f.genlogis (x, xi, alfa, k)
F.genlogis (x, xi, alfa, k)
invF.genlogis (F, xi, alfa, k)
Lmom.genlogis (xi, alfa, k)
par.genlogis (lambda1, lambda2, tau3)
rand.genlogis (numerosita, xi, alfa, k)
Arguments
x |
vector of quantiles |
xi |
vector of genlogis location parameters |
alfa |
vector of genlogis scale parameters |
k |
vector of genlogis shape parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Logistic_distribution for an introduction to the Logistic Distribution.
Definition
Parameters (3): \xi
(location), \alpha
(scale), k
(shape).
Range of x
: -\infty < x \le \xi + \alpha / k
if k>0
;
-\infty < x < \infty
if k=0
;
\xi + \alpha / k \le x < \infty
if k<0
.
Probability density function:
f(x) = \frac{\alpha^{-1} e^{-(1-k)y}}{(1+e^{-y})^2}
where y = -k^{-1}\log\{1 - k(x - \xi)/\alpha\}
if k \ne 0
,
y = (x-\xi)/\alpha
if k=0
.
Cumulative distribution function:
F(x) = 1/(1+e^{-y})
Quantile function:
x(F) = \xi + \alpha[1-\{(1-F)/F\}^k]/k
if k \ne 0
,
x(F) = \xi - \alpha \log\{(1-F)/F\}
if k=0
.
k=0
is the logistic distribution.
L-moments
L-moments are defined for -1<k<1
.
\lambda_1 = \xi + \alpha[1/k - \pi / \sin (k \pi)]
\lambda_2 = \alpha k \pi / \sin (k \pi)
\tau_3 = -k
\tau_4 = (1+5 k^2)/6
Parameters
k=-\tau_3
, \alpha = \frac{\lambda_2 \sin (k \pi)}{k \pi}
,
\xi = \lambda_1 - \alpha (\frac{1}{k} - \frac{\pi}{\sin (k \pi)})
.
Lmom.genlogis
and par.genlogis
accept input as vectors of equal length. In f.genlogis
, F.genlogis
, invF.genlogis
and rand.genlogis
parameters (xi
, alfa
, k
) must be atomic.
Value
f.genlogis
gives the density f
, F.genlogis
gives the distribution function F
, invF.genlogis
gives the quantile function x
, Lmom.genlogis
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.genlogis
gives the parameters (xi
, alfa
, k
), and rand.genlogis
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENPAR
, GEV
, GUMBEL
, KAPPA
, LOGNORM
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.genlogis(ll[1],ll[2],ll[4])
f.genlogis(1800,parameters$xi,parameters$alfa,parameters$k)
F.genlogis(1800,parameters$xi,parameters$alfa,parameters$k)
invF.genlogis(0.7697433,parameters$xi,parameters$alfa,parameters$k)
Lmom.genlogis(parameters$xi,parameters$alfa,parameters$k)
rand.genlogis(100,parameters$xi,parameters$alfa,parameters$k)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.genlogis(Rll[1],Rll[2],Rll[4])
Lmom.genlogis(parameters$xi,parameters$alfa,parameters$k)
Three parameter generalized Pareto distribution and L-moments
Description
GENPAR
provides the link between L-moments of a sample and the three parameter
generalized Pareto distribution.
Usage
f.genpar (x, xi, alfa, k)
F.genpar (x, xi, alfa, k)
invF.genpar (F, xi, alfa, k)
Lmom.genpar (xi, alfa, k)
par.genpar (lambda1, lambda2, tau3)
rand.genpar (numerosita, xi, alfa, k)
Arguments
x |
vector of quantiles |
xi |
vector of genpar location parameters |
alfa |
vector of genpar scale parameters |
k |
vector of genpar shape parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Pareto_distribution for an introduction to the Pareto distribution.
Definition
Parameters (3): \xi
(location), \alpha
(scale), k
(shape).
Range of x
: \xi < x \le \xi + \alpha / k
if k>0
;
\xi \le x < \infty
if k \le 0
.
Probability density function:
f(x) = \alpha^{-1} e^{-(1-k)y}
where y = -k^{-1}\log\{1 - k(x - \xi)/\alpha\}
if k \ne 0
,
y = (x-\xi)/\alpha
if k=0
.
Cumulative distribution function:
F(x) = 1-e^{-y}
Quantile function:
x(F) = \xi + \alpha[1-(1-F)^k]/k
if k \ne 0
,
x(F) = \xi - \alpha \log(1-F)
if k=0
.
k=0
is the exponential distribution; k=1
is the uniform distribution on the interval \xi < x \le \xi + \alpha
.
L-moments
L-moments are defined for k>-1
.
\lambda_1 = \xi + \alpha/(1+k)]
\lambda_2 = \alpha/[(1+k)(2+k)]
\tau_3 = (1-k)/(3+k)
\tau_4 = (1-k)(2-k)/[(3+k)(4+k)]
The relation between \tau_3
and \tau_4
is given by
\tau_4 = \frac{\tau_3 (1 + 5 \tau_3)}{5+\tau_3}
Parameters
If \xi
is known, k=(\lambda_1 - \xi)/\lambda_2 - 2
and \alpha=(1+k)(\lambda_1 - \xi)
;
if \xi
is unknown, k=(1 - 3 \tau_3)/(1 + \tau_3)
, \alpha=(1+k)(2+k)\lambda_2
and
\xi=\lambda_1 - (2+k)\lambda_2
.
Lmom.genpar
and par.genpar
accept input as vectors of equal length. In f.genpar
, F.genpar
, invF.genpar
and rand.genpar
parameters (xi
, alfa
, k
) must be atomic.
Value
f.genpar
gives the density f
, F.genpar
gives the distribution function F
, invF.genpar
gives
the quantile function x
, Lmom.genpar
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.genpar
gives the parameters (xi
, alfa
, k
), and rand.genpar
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GEV
, GUMBEL
, KAPPA
, LOGNORM
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.genpar(ll[1],ll[2],ll[4])
f.genpar(1800,parameters$xi,parameters$alfa,parameters$k)
F.genpar(1800,parameters$xi,parameters$alfa,parameters$k)
invF.genpar(0.7161775,parameters$xi,parameters$alfa,parameters$k)
Lmom.genpar(parameters$xi,parameters$alfa,parameters$k)
rand.genpar(100,parameters$xi,parameters$alfa,parameters$k)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.genpar(Rll[1],Rll[2],Rll[4])
Lmom.genpar(parameters$xi,parameters$alfa,parameters$k)
Three parameter generalized extreme value distribution and L-moments
Description
GEV
provides the link between L-moments of a sample and the three parameter
generalized extreme value distribution.
Usage
f.GEV (x, xi, alfa, k)
F.GEV (x, xi, alfa, k)
invF.GEV (F, xi, alfa, k)
Lmom.GEV (xi, alfa, k)
par.GEV (lambda1, lambda2, tau3)
rand.GEV (numerosita, xi, alfa, k)
Arguments
x |
vector of quantiles |
xi |
vector of GEV location parameters |
alfa |
vector of GEV scale parameters |
k |
vector of GEV shape parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution for an introduction to the GEV distribution.
Definition
Parameters (3): \xi
(location), \alpha
(scale), k
(shape).
Range of x
: -\infty < x \le \xi + \alpha / k
if k>0
;
-\infty < x < \infty
if k=0
;
\xi + \alpha / k \le x < \infty
if k<0
.
Probability density function:
f(x) = \alpha^{-1} e^{-(1-k)y - e^{-y}}
where y = -k^{-1}\log\{1 - k(x - \xi)/\alpha\}
if k \ne 0
,
y = (x-\xi)/\alpha
if k=0
.
Cumulative distribution function:
F(x) = e^{-e^{-y}}
Quantile function:
x(F) = \xi + \alpha[1-(-\log F)^k]/k
if k \ne 0
,
x(F) = \xi - \alpha \log(-\log F)
if k=0
.
k=0
is the Gumbel distribution; k=1
is the reverse exponential distribution.
L-moments
L-moments are defined for k>-1
.
\lambda_1 = \xi + \alpha[1 - \Gamma (1+k)]/k
\lambda_2 = \alpha (1-2^{-k}) \Gamma (1+k)]/k
\tau_3 = 2(1-3^{-k})/(1-2^{-k})-3
\tau_4 = [5(1-4^{-k})-10(1-3^{-k})+6(1-2^{-k})]/(1-2^{-k})
Here \Gamma
denote the gamma function
\Gamma (x) = \int_0^{\infty} t^{x-1} e^{-t} dt
Parameters
To estimate k
, no explicit solution is possible, but the following approximation has accurancy better than 9 \times 10^{-4}
for -0.5 \le \tau_3 \le 0.5
:
k \approx 7.8590 c + 2.9554 c^2
where
c = \frac{2}{3+\tau_3} - \frac{\log 2}{\log 3}
The other parameters are then given by
\alpha = \frac{\lambda_2 k}{(1-2^{-k})\Gamma(1+k)}
\xi = \lambda_1 - \alpha[1 - \Gamma(1+k)]/k
Lmom.GEV
and par.GEV
accept input as vectors of equal length. In f.GEV
, F.GEV
, invF.GEV
and rand.GEV
parameters (xi
, alfa
, k
) must be atomic.
Value
f.GEV
gives the density f
, F.GEV
gives the distribution function F
, invF.GEV
gives
the quantile function x
, Lmom.GEV
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.GEV
gives the parameters (xi
, alfa
, k
), and rand.GEV
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GENPAR
, GUMBEL
, KAPPA
, LOGNORM
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.GEV(ll[1],ll[2],ll[4])
f.GEV(1800,parameters$xi,parameters$alfa,parameters$k)
F.GEV(1800,parameters$xi,parameters$alfa,parameters$k)
invF.GEV(0.7518357,parameters$xi,parameters$alfa,parameters$k)
Lmom.GEV(parameters$xi,parameters$alfa,parameters$k)
rand.GEV(100,parameters$xi,parameters$alfa,parameters$k)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.GEV(Rll[1],Rll[2],Rll[4])
Lmom.GEV(parameters$xi,parameters$alfa,parameters$k)
Goodness of fit tests
Description
Anderson-Darling goodness of fit tests for extreme-value distributions, from Laio (2004).
Usage
A2_GOFlaio (x, dist="NORM")
A2 (F)
W2 (F)
fw2 (w)
Arguments
x |
data sample |
dist |
distribution: normal |
F |
cumulative distribution function (that has to be sorted increasingly) |
w |
Transformed test statistic (Laio, 2004) |
Details
An introduction on the Anderson-Darling test is available on https://en.wikipedia.org/wiki/Anderson-Darling_test and in the GOFmontecarlo
help page.
The original paper of Laio (2004) is available on his web site.
Value
A2_GOFlaio
tests the goodness of fit of a distribution with the sample x
; it return the value A_2
of the Anderson-Darling statistics and its non-exceedence probability P(A_2)
.
Note that P
is the probability of obtaining the test statistic A_2
lower than the one that was actually observed, assuming that the null hypothesis is true, i.e., P
is one minus the p-value usually employed in statistical testing (see https://en.wikipedia.org/wiki/P-value).
If P(A_2)
is, for example, greater than 0.90, the null hypothesis at significance level \alpha=10\%
is rejected.
A2
is the Anderson-Darling test statistic; it is used by A2_GOFlaio
.
W2
is the Cramer-von Mises test statistic.
fw2
is the approximation of the probability distribution of w
(first 2 terms) when H_0
is true (Anderson-Darling, 1952); it is used by A2_GOFlaio
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
sm <- rand.gumb(100, 0, 1)
ml <- ML_estimation (sm, dist="GEV"); ml
F.GEV(sm, ml[1], ml[2], ml[3])
A2(sort(F.GEV(sm, ml[1], ml[2], ml[3])))
A2_GOFlaio(sm, dist="GEV")
ml <- ML_estimation (sm, dist="P3"); ml
A2(sort(sort(F.gamma(sm, ml[1], ml[2], ml[3]))))
A2_GOFlaio(sm, dist="P3")
Goodness of fit tests
Description
Anderson-Darling goodness of fit tests for Regional Frequency Analysis: Monte-Carlo method.
Usage
gofNORMtest (x)
gofEXPtest (x, Nsim=1000)
gofGUMBELtest (x, Nsim=1000)
gofGENLOGIStest (x, Nsim=1000)
gofGENPARtest (x, Nsim=1000)
gofGEVtest (x, Nsim=1000)
gofLOGNORMtest (x, Nsim=1000)
gofP3test (x, Nsim=1000)
Arguments
x |
data sample |
Nsim |
number of simulated samples from the hypothetical parent distribution |
Details
An introduction, analogous to the following one, on the Anderson-Darling test is available on https://en.wikipedia.org/wiki/Anderson-Darling_test.
Given a sample x_i \ (i=1,\ldots,m)
of data extracted from a distribution F_R(x)
, the test is used to check the null hypothesis H_0 : F_R(x) = F(x,\theta)
, where F(x,\theta)
is the hypothetical distribution and \theta
is an array of parameters estimated from the sample x_i
.
The Anderson-Darling goodness of fit test measures the departure between the hypothetical distribution F(x,\theta)
and the cumulative frequency function F_m(x)
defined as:
F_m(x) = 0 \ , \ x < x_{(1)}
F_m(x) = i/m \ , \ x_{(i)} \leq x < x_{(i+1)}
F_m(x) = 1 \ , \ x_{(m)} \leq x
where x_{(i)}
is the i
-th element of the ordered sample (in increasing order).
The test statistic is:
Q^2 = m \! \int_x \left[ F_m(x) - F(x,\theta) \right]^2 \Psi(x) \,dF(x)
where \Psi(x)
, in the case of the Anderson-Darling test (Laio, 2004), is \Psi(x) = [F(x,\theta) (1 - F(x,\theta))]^{-1}
.
In practice, the statistic is calculated as:
A^2 = -m -\frac{1}{m} \sum_{i=1}^m \left\{ (2i-1)\ln[F(x_{(i)},\theta)] + (2m+1-2i)\ln[1 - F(x_{(i)},\theta)] \right\}
The statistic A^2
, obtained in this way, may be confronted with the population of the A^2
's that one obtain if samples effectively belongs to the F(x,\theta)
hypothetical distribution.
In the case of the test of normality, this distribution is defined (see Laio, 2004).
In other cases, e.g. the Pearson Type III case, can be derived with a Monte-Carlo procedure.
Value
gofNORMtest
tests the goodness of fit of a normal (Gauss) distribution with the sample x
.
gofEXPtest
tests the goodness of fit of a exponential distribution with the sample x
.
gofGUMBELtest
tests the goodness of fit of a Gumbel (EV1) distribution with the sample x
.
gofGENLOGIStest
tests the goodness of fit of a Generalized Logistic distribution with the sample x
.
gofGENPARtest
tests the goodness of fit of a Generalized Pareto distribution with the sample x
.
gofGEVtest
tests the goodness of fit of a Generalized Extreme Value distribution with the sample x
.
gofLOGNORMtest
tests the goodness of fit of a 3 parameters Lognormal distribution with the sample x
.
gofP3test
tests the goodness of fit of a Pearson type III (gamma) distribution with the sample x
.
They return the value A_2
of the Anderson-Darling statistics and its non exceedence probability P
.
Note that P
is the probability of obtaining the test statistic A_2
lower than the one that was actually observed, assuming that the null hypothesis is true, i.e., P
is one minus the p-value usually employed in statistical testing (see https://en.wikipedia.org/wiki/P-value).
If P(A_2)
is, for example, greater than 0.90, the null hypothesis at significance level \alpha=10\%
is rejected.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
x <- rnorm(30,10,1)
gofNORMtest(x)
x <- rand.gamma(50, 100, 15, 7)
gofP3test(x, Nsim=200)
x <- rand.GEV(50, 0.907, 0.169, 0.0304)
gofGEVtest(x, Nsim=200)
x <- rand.genlogis(50, 0.907, 0.169, 0.0304)
gofGENLOGIStest(x, Nsim=200)
x <- rand.genpar(50, 0.716, 0.418, 0.476)
gofGENPARtest(x, Nsim=200)
x <- rand.lognorm(50, 0.716, 0.418, 0.476)
gofLOGNORMtest(x, Nsim=200)
Two parameter Gumbel distribution and L-moments
Description
GUMBEL
provides the link between L-moments of a sample and the two parameter
Gumbel distribution.
Usage
f.gumb (x, xi, alfa)
F.gumb (x, xi, alfa)
invF.gumb (F, xi, alfa)
Lmom.gumb (xi, alfa)
par.gumb (lambda1, lambda2)
rand.gumb (numerosita, xi, alfa)
Arguments
x |
vector of quantiles |
xi |
vector of gumb location parameters |
alfa |
vector of gumb scale parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Fisher-Tippett_distribution for an introduction to the Gumbel distribution.
Definition
Parameters (2): \xi
(location), \alpha
(scale).
Range of x
: -\infty < x < \infty
.
Probability density function:
f(x) = \alpha^{-1} \exp[-(x-\xi)/\alpha] \exp\{- \exp[-(x-\xi)/\alpha]\}
Cumulative distribution function:
F(x) = \exp[-\exp(-(x-\xi)/\alpha)]
Quantile function:
x(F) = \xi - \alpha \log(-\log F)
.
L-moments
\lambda_1 = \xi + \alpha \gamma
\lambda_2 = \alpha \log 2
\tau_3 = 0.1699 = \log(9/8)/ \log 2
\tau_4 = 0.1504 = (16 \log 2 - 10 \log 3)/ \log 2
Here \gamma
is Euler's constant, 0.5772...
Parameters
\alpha=\lambda_2 / \log 2
\xi = \lambda_1 - \gamma \alpha
Lmom.gumb
and par.gumb
accept input as vectors of equal length. In f.gumb
, F.gumb
, invF.gumb
and rand.gumb
parameters (xi
, alfa
) must be atomic.
Value
f.gumb
gives the density f
, F.gumb
gives the distribution function F
, invF.gumb
gives
the quantile function x
, Lmom.gumb
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
)), par.gumb
gives the parameters (xi
, alfa
), and rand.gumb
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GENPAR
, GEV
, KAPPA
, LOGNORM
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows[1:10,]
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.gumb(ll[1],ll[2])
f.gumb(1800,parameters$xi,parameters$alfa)
F.gumb(1800,parameters$xi,parameters$alfa)
invF.gumb(0.7686843,parameters$xi,parameters$alfa)
Lmom.gumb(parameters$xi,parameters$alfa)
rand.gumb(100,parameters$xi,parameters$alfa)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.gumb(Rll[1],Rll[2])
Lmom.gumb(parameters$xi,parameters$alfa)
Homogeneity tests
Description
Homogeneity tests for Regional Frequency Analysis.
Usage
ADbootstrap.test (x, cod, Nsim=500, index=2)
HW.tests (x, cod, Nsim=500)
DK.test (x, cod)
discordancy (x, cod)
criticalD ()
Arguments
x |
vector representing data from many samples defined with |
cod |
array that defines the data subdivision among sites |
Nsim |
number of regions simulated with the bootstrap of the original region |
index |
if |
Details
The Hosking and Wallis heterogeneity measures
The idea underlying Hosking and Wallis (1993) heterogeneity
statistics is to measure the sample variability of the L-moment
ratios and compare it to the variation that would be expected in a
homogeneous region. The latter is estimated through repeated
simulations of homogeneous regions with samples drawn from a four
parameter kappa distribution (see e.g., Hosking and Wallis,
1997, pp. 202-204).
More in detail, the steps are the following:
with regards to the k
samples belonging to the region under analysis, find
the sample L-moment ratios (see, Hosking and Wallis, 1997)
pertaining to the i
-th site: these are the
L-coefficient of variation (L-CV),
t^{(i)}=\frac{\frac{1}{n_i}\sum_{j=1}^{n_i}\left(\frac{2(j-1)}{(n_i-1)}-1\right)Y_{i,j}}{\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{i,j}}
the coefficient of L-skewness,
t_3^{(i)}=\frac{\frac{1}{n_i}\sum_{j=1}^{n_i}\left(\frac{6(j-1)(j-2)}{(n_i-1)(n_i-2)}-\frac{6(j-1)}{(n_i-1)}+1\right)Y_{i,j}}{\frac{1}{n_i}\sum_{j=1}^{n_i}\left(\frac{2(j-1)}{(n_i-1)}-1\right)Y_{i,j}}
and the coefficient of L-kurtosis
t_4^{(i)}=\frac{\frac{1}{n_i}\sum_{j=1}^{n_i}\left(\frac{20(j-1)(j-2)(j-3)}{(n_i-1)(n_i-2)(n_i-3)}-\frac{30(j-1)(j-2)}{(n_i-1)(n_i-2)}+\frac{12(j-1)}{(n_i-1)}-1\right)Y_{i,j}}{\frac{1}{n_i}\sum_{j=1}^{n_i}\left(\frac{2(j-1)}{(n_i-1)}-1\right)Y_{i,j}}
Note that the L-moment ratios are not affected by the
normalization by the index value, i.e. it is the same to use
X_{i,j}
or Y_{i,j}
in Equations.
Define the regional averaged L-CV, L-skewness and L-kurtosis coefficients,
t^R = \frac{\sum_{i=1}^k n_i t^{(i)}}{ \sum_{i=1}^k n_i}
t_3^R =\frac{ \sum_{i=1}^k n_i t_3^{(i)}}{ \sum_{i=1}^k n_i}
t_4^R =\frac{ \sum_{i=1}^k n_i t_4^{(i)}}{\sum_{i=1}^k n_i}
and compute the statistic
V = \left\{ \sum_{i=1}^k n_i (t^{(i)} - t^R )^2 / \sum_{i=1}^k n_i\right\} ^{1/2}
Fit the parameters of a
four-parameters kappa distribution to the regional averaged L-moment ratios
t^R
, t_3^R
and t_4^R
, and then generate a large number
N_{sim}
of realizations of sets of k
samples. The i
-th site sample in each set
has a kappa distribution as its parent and
record length equal to n_i
. For each simulated
homogeneous set, calculate the statistic V
, obtaining N_{sim}
values.
On this vector of V
values determine the mean \mu_V
and standard
deviation \sigma_V
that relate to the hypothesis of homogeneity
(actually, under the composite hypothesis of homogeneity and kappa
parent distribution).
An heterogeneity measure, which is called here
HW_1
, is finally found as
\theta_{HW_1} = \frac{V - \mu_V}{\sigma_V}
\theta_{HW_1}
can be approximated by a normal distributed with zero
mean and unit variance: following Hosking and Wallis (1997),
the region under analysis can therefore be regarded as
‘acceptably homogeneous’ if \theta_{HW_1}<1
, ‘possibly
heterogeneous’ if 1 \leq \theta_{HW_1} < 2
, and ‘definitely
heterogeneous’ if \theta_{HW_1}\geq2
. Hosking and Wallis
(1997) suggest that these limits should be treated as useful
guidelines. Even if the \theta_{HW_1}
statistic is constructed
like a significance test, significance levels obtained from such a
test would in fact be accurate only under special assumptions: to have
independent data both serially and between sites, and the true
regional distribution being kappa.
Hosking and Wallis (1993) also give alternative heterogeneity measures
(that we call HW_2
and HW_3
), in which V
is
replaced by:
V_2 = \sum_{i=1}^k n_i \left\{ (t^{(i)} - t^R)^2 + (t_3^{(i)} - t_3^R)^2\right\}^{1/2} / \sum_{i=1}^k n_i
or
V_3 = \sum_{i=1}^k n_i \left\{ (t_3^{(i)} - t_3^R)^2 + (t_4^{(i)} - t_4^R)^2\right\}^{1/2} / \sum_{i=1}^k n_i
The test statistic in this case becomes
\theta_{HW_2} = \frac{V_2 - \mu_{V_2}}{\sigma_{V_2}}
or
\theta_{HW_3} = \frac{V_3 - \mu_{V_3}}{\sigma_{V_3}}
with similar acceptability limits as the HW_1
statistic.
Hosking and Wallis (1997) judge \theta_{HW_2}
and \theta_{HW_3}
to be inferior to
\theta_{HW_1}
and say that it rarely yields values larger than 2 even for grossly heterogeneous regions.
The bootstrap Anderson-Darling test
A test that does not make any assumption on the parent distribution is the
Anderson-Darling (AD
) rank test (Scholz and Stephens, 1987).
The AD
test is the generalization of the classical
Anderson-Darling goodness of fit test (e.g., D'Agostino and
Stephens, 1986), and it is used to test the hypothesis that k
independent samples belong to the same population without
specifying their common distribution function.
The test is based on the comparison between local and regional
empirical distribution functions. The empirical distribution
function, or sample distribution function, is defined by
F(x)=\frac{j}{\eta}, x_{(j)}\leq x < x_{(j+1)}
, where \eta
is
the size of the sample and x_{(j)}
are the order statistics,
i.e. the observations arranged in ascending order. Denote the
empirical distribution function of the i
-th sample (local) by \hat{F}_i(x)
, and that of the pooled sample of all N = n_1 + ... + n_k
observations (regional) by H_N (x)
. The k
-sample Anderson-Darling test
statistic is then defined as
\theta_{AD} = \sum_{i=1}^k n_i \int _{{\rm all}\ x} \frac{[\hat{F}_i (x) - H_N (x) ]^2}{H_N (x) [ 1 - H_N (x) ] } dH_N (x)
If the pooled ordered sample is Z_1 < ... < Z_N
, the
computational formula to evaluate \theta_{AD}
is:
\theta_{AD} = \frac{1}{N} \sum_{i=1}^k \frac{1}{n_i}\sum_{j=1}^{N-1} \frac{(N M_{ij} - j n_i)^2 }{j (N-j)}
where M_{ij}
is the number of observations in the i
-th sample
that are not greater than Z_j
. The homogeneity test can be
carried out by comparing the obtained \theta_{AD}
value to the
tabulated percentage points reported by Scholz and Stephens
(1987) for different significance levels.
The statistic \theta_{AD}
depends on the sample values only
through their ranks. This guarantees that the test statistic
remains unchanged when the samples undergo monotonic
transformations, an important stability property not possessed by
HW
heterogeneity measures. However, problems arise in applying this test in a
common index value procedure. In fact, the index
value procedure corresponds to dividing each site sample by a different
value, thus modifying the ranks in the pooled sample. In
particular, this has the effect of making the
local empirical distribution functions much more similar to the
other, providing an impression of homogeneity even when the
samples are highly heterogeneous. The effect is analogous to that
encountered when applying goodness-of-fit tests to distributions
whose parameters are estimated from the same sample used for the
test (e.g., D'Agostino and Stephens, 1986; Laio,
2004). In both cases, the percentage points for the test should be
opportunely redetermined. This can be done with a nonparametric bootstrap approach
presenting the following steps:
build up the pooled sample \cal S
of the observed
non-dimensional data.
Sample with replacement from \cal S
and generate k
artificial local samples, of size n_1, \dots ,n_k
.
Divide each sample for its index value, and calculate
\theta^{(1)}_{AD}
.
Repeat the procedure for N_{sim}
times and obtain a sample
of \theta^{(j)}_{AD}
, j=1,\dots , N_{sim}
values, whose
empirical distribution function can be used as an approximation of
G_{H_0}(\theta_{AD})
, the distribution of \theta_{AD}
under
the null hypothesis of homogeneity.
The acceptance limits for the test, corresponding to any
significance level \alpha
, are then easily determined as the
quantiles of G_{H_0}(\theta_{AD})
corresponding to a probability
(1-\alpha)
.
We will call the test obtained with the above procedure the bootstrap Anderson-Darling test, hereafter referred to as AD
.
Durbin and Knott test
The last considered homogeneity test derives from a
goodness-of-fit statistic originally proposed by Durbin and
Knott (1971). The test is formulated to measure discrepancies in
the dispersion of the samples, without accounting for the possible
presence of discrepancies in the mean or skewness of the data.
Under this aspect, the test is similar to the HW_1
test, while it
is analogous to the AD
test for the fact that it is a rank test.
The original goodness-of-fit test is very simple: suppose to have
a sample X_i
, i = 1, ..., n
, with hypothetical
distribution F(x)
; under the null hypothesis the random variable
F(X_i)
has a uniform distribution in the (0,1)
interval, and
the statistic D = \sum_{i=1}^n \cos[2 \pi F(X_i)]
is
approximately normally distributed with mean 0 and variance 1
(Durbin and Knott, 1971). D
serves the purpose of
detecting discrepancy in data dispersion: if the variance of X_i
is greater than that of the hypothetical distribution F(x)
, D
is significantly greater than
0, while D
is significantly below 0 in the reverse case.
Differences between the mean (or the median) of X_i
and F(x)
are instead not detected by D
, which guarantees that the
normalization by the index value does not affect the test.
The extension to homogeneity testing of the Durbin and
Knott (DK
) statistic is straightforward: we substitute the empirical
distribution function obtained with the pooled observed data,
H_N(x)
, for F(x)
in D
, obtaining at each site a statistic
D_i = \sum_{j=1}^{n_i} \cos[2 \pi H_N(X_j)]
which is normal under the hypothesis of homogeneity. The statistic
\theta_{DK} = \sum_{i=1}^k D_i^2
has then a chi-squared
distribution with k-1
degrees of freedom, which allows one to
determine the acceptability limits for the test, corresponding to
any significance level \alpha
.
Comparison among tests
The comparison (Viglione et al, 2007) shows that the Hosking and Wallis heterogeneity measure HW_1
(only based on L-CV) is preferable when skewness is low, while the bootstrap Anderson-Darling test should be used for more skewed regions.
As for HW_2
, the Hosking and Wallis heterogeneity measure based on L-CV and L-CA, it is shown once more how much it lacks power.
Our suggestion is to guide the choice of the test according to a compromise between power and Type I error of the HW_1
and AD
tests.
The L-moment space is divided into two regions:
if the t_3^R
coefficient for the region under analysis is lower than 0.23, we propose to use the Hosking and Wallis heterogeneity measure HW_1
;
if t_3^R > 0.23
, the bootstrap Anderson-Darling test is preferable.
Value
ADbootstrap.test
and DK.test
test gives its test statistic and its distribution value P
.
If P
is, for example, 0.92, samples shouldn't be considered heterogeneous with significance level minor of 8%.
HW.tests
gives the two Hosking and Wallis heterogeneity measures H_1
and H_2
; following Hosking and Wallis (1997), the region under analysis can therefore be regarded as ‘acceptably homogeneous’ if H_1<1
, ‘possibly heterogeneous’ if 1 \leq H_1 < 2
, and ‘definitely heterogeneous’ if H \geq 2
.
discordancy
returns the discordancy measure D
of Hosking and Wallis for all sites.
Hosking and Wallis suggest to consider a site discordant if D \geq 3
if N \geq 15
(where N
is the number of sites considered in the region). For N<15
the critical values of D
can be listed with criticalD
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
traceWminim
, roi
, KAPPA
, HW.original
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
split(x,cod)
#ADbootstrap.test(x,cod,Nsim=100) # it takes some time
#HW.tests(x,cod) # it takes some time
DK.test(x,cod)
fac <- factor(annualflows["cod"][,],levels=c(34:38))
x2 <- annualflows[!is.na(fac),"dato"]
cod2 <- annualflows[!is.na(fac),"cod"]
ADbootstrap.test(x2,cod2,Nsim=100)
ADbootstrap.test(x2,cod2,index=1,Nsim=200)
HW.tests(x2,cod2,Nsim=100)
DK.test(x2,cod2)
discordancy(x,cod)
criticalD()
Four parameter kappa distribution and L-moments
Description
KAPPA
provides the link between L-moments of a sample and the four parameter
kappa distribution.
Usage
f.kappa (x, xi, alfa, k, h)
F.kappa (x, xi, alfa, k, h)
invF.kappa (F, xi, alfa, k, h)
Lmom.kappa (xi, alfa, k, h)
par.kappa (lambda1, lambda2, tau3, tau4)
rand.kappa (numerosita, xi, alfa, k, h)
Arguments
x |
vector of quantiles |
xi |
vector of kappa location parameters |
alfa |
vector of kappa scale parameters |
k |
vector of kappa third parameters |
h |
vector of kappa fourth parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
tau4 |
vector of L-kurtosis |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
Definition
Parameters (4): \xi
(location), \alpha
(scale), k
, h
.
Range of x
: upper bound is \xi + \alpha/k
if k>0
, \infty
if k \le 0
;
lower bound is \xi + \alpha(1-h^{-k})/k
if h>0
, \xi + \alpha/k
if h \le 0
and k<0
and -\infty
if h \le 0
and k \ge 0
Probability density function:
f(x)=\alpha^{-1} [1-k(x-\xi)/\alpha]^{1/k-1} [F(x)]^{1-h}
Cumulative distribution function:
F(x)=\{1-h[1-k(x-\xi)/\alpha]^{1/k}\}^{1/h}
Quantile function:
x(F)= \xi + \frac{\alpha}{k} \left[ 1-\left( \frac{1-F^h}{h} \right)^k \right]
h=-1
is the generalized logistic distribution;
h=0
is the generalized eztreme value distribution;
h=1
is the generalized Pareto distribution.
L-moments
L-moments are defined for h \ge 0
and k>-1
, or if h<0
and -1<k<-1/h
.
\lambda_1 = \xi + \alpha(1-g_1)/k
\lambda_2 = \alpha(g_1 - g_2)/k
\tau_3 = (-g_1 + 3g_2 - 2g_3)/(g_1 - g_2)
\tau_4 = (-g_1 + 6g_2 - 10g_3 + 5g_4)/(g_1 - g_2)
where
g_r = \frac{r\Gamma(1+k)\Gamma(r/h)}{h^{1+k}\Gamma(1+k+r/h)}
if h>0
;
g_r = \frac{r \Gamma(1+k)\Gamma(-k-r/h)}{(-h)^{1+k}\Gamma(1-r/h)}
if h<0
;
Here \Gamma
denote the gamma function
\Gamma (x) = \int_0^{\infty} t^{x-1} e^{-t} dt
Parameters
There are no simple expressions for the parameters in terms of the L-moments.
However they can be obtained with a numerical algorithm considering the formulations of \tau_3
and \tau_4
in terms of k
and h
.
Here we use the function optim
to minimize (t_3-\tau_3)^2 + (t_4-\tau_4)^2
where t_3
and t_4
are the sample L-moment ratios.
Lmom.kappa
and par.kappa
accept input as vectors of equal length. In f.kappa
, F.kappa
,
invF.kappa
and rand.kappa
parameters (xi
, alfa
, k
, h
) must be atomic.
Value
f.kappa
gives the density f
, F.kappa
gives the distribution function F
, invFkappa
gives
the quantile function x
, Lmom.kappa
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.kappa
gives the parameters (xi
, alfa
, k
, h
), and rand.kappa
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GENPAR
, GEV
, GUMBEL
, LOGNORM
, P3
; optim
, DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.kappa(ll[1],ll[2],ll[4],ll[5])
f.kappa(1800,parameters$xi,parameters$alfa,parameters$k,parameters$h)
F.kappa(1800,parameters$xi,parameters$alfa,parameters$k,parameters$h)
invF.kappa(0.771088,parameters$xi,parameters$alfa,parameters$k,parameters$h)
Lmom.kappa(parameters$xi,parameters$alfa,parameters$k,parameters$h)
rand.kappa(100,parameters$xi,parameters$alfa,parameters$k,parameters$h)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.kappa(Rll[1],Rll[2],Rll[4],Rll[5])
Lmom.kappa(parameters$xi,parameters$alfa,parameters$k,parameters$h)
Three parameter lognormal distribution and L-moments
Description
LOGNORM
provides the link between L-moments of a sample and the three parameter
log-normal distribution.
Usage
f.lognorm (x, xi, alfa, k)
F.lognorm (x, xi, alfa, k)
invF.lognorm (F, xi, alfa, k)
Lmom.lognorm (xi, alfa, k)
par.lognorm (lambda1, lambda2, tau3)
rand.lognorm (numerosita, xi, alfa, k)
Arguments
x |
vector of quantiles |
xi |
vector of lognorm location parameters |
alfa |
vector of lognorm scale parameters |
k |
vector of lognorm shape parameters |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
numerosita |
numeric value indicating the length of the vector to be generated |
Details
See https://en.wikipedia.org/wiki/Log-normal_distribution for an introduction to the lognormal distribution.
Definition
Parameters (3): \xi
(location), \alpha
(scale), k
(shape).
Range of x
: -\infty < x \le \xi + \alpha / k
if k>0
;
-\infty < x < \infty
if k=0
;
\xi + \alpha / k \le x < \infty
if k<0
.
Probability density function:
f(x) = \frac{e^{ky-y^2/2}}{\alpha \sqrt{2\pi}}
where y = -k^{-1}\log\{1 - k(x - \xi)/\alpha\}
if k \ne 0
,
y = (x-\xi)/\alpha
if k=0
.
Cumulative distribution function:
F(x) = \Phi(x)
where
\Phi(x)=\int_{-\infty}^x \phi(t)dt
.
Quantile function:
x(F)
has no explicit analytical form.
k=0
is the Normal distribution with parameters \xi
and alpha
.
L-moments
L-moments are defined for all values of k
.
\lambda_1 = \xi + \alpha(1 - e^{k^2/2})/k
\lambda_2 = \alpha/k e^{k^2/2} [1 - 2 \Phi(-k/\sqrt{2})]
There are no simple expressions for the L-moment ratios \tau_r
with r \ge 3
.
Here we use the rational-function approximation given in Hosking and Wallis (1997, p. 199).
Parameters
The shape parameter k
is a function of \tau_3
alone.
No explicit solution is possible.
Here we use the approximation given in Hosking and Wallis (1997, p. 199).
Given k
, the other parameters are given by
\alpha = \frac{\lambda_2 k e^{-k^2/2}}{1-2 \Phi(-k/\sqrt{2})}
\xi = \lambda_1 - \frac{\alpha}{k} (1 - e^{k^2/2})
Lmom.lognorm
and par.lognorm
accept input as vectors of equal length. In f.lognorm
, F.lognorm
, invF.lognorm
and rand.lognorm
parameters (xi
, alfa
, k
) must be atomic.
Value
f.lognorm
gives the density f
, F.lognorm
gives the distribution function F
, invFlognorm
gives the quantile function x
, Lmom.lognorm
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.lognorm
gives the parameters (xi
, alfa
, k
), and rand.lognorm
generates random deviates.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GENPAR
, GEV
, GUMBEL
, KAPPA
, P3
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.lognorm(ll[1],ll[2],ll[4])
f.lognorm(1800,parameters$xi,parameters$alfa,parameters$k)
F.lognorm(1800,parameters$xi,parameters$alfa,parameters$k)
invF.lognorm(0.7529877,parameters$xi,parameters$alfa,parameters$k)
Lmom.lognorm(parameters$xi,parameters$alfa,parameters$k)
rand.lognorm(100,parameters$xi,parameters$alfa,parameters$k)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.lognorm(Rll[1],Rll[2],Rll[4])
Lmom.lognorm(parameters$xi,parameters$alfa,parameters$k)
Hosking and Wallis sample L-moments
Description
Lmoments
provides the estimate of L-moments of a sample or regional L-moments of a region.
Usage
Lmoments (x)
regionalLmoments (x,cod)
LCV (x)
LCA (x)
Lkur (x)
Arguments
x |
vector representing a data-sample (or data from many samples defined with |
cod |
array that defines the data subdivision among sites |
Details
The estimation of L-moments is based on a sample of size n
, arranged in ascending order.
Let x_{1:n} \le x_{2:n} \le \dots \le x_{n:n}
be the ordered sample.
An unbiased estimator of the probability weighted moments \beta_r
is:
b_r = n^{-1} \sum_{j=r+1}^n \frac{(j-1)(j-2)\dots(j-r)}{(n-1)(n-2)\dots(n-r)} x_{j:n}
The sample L-moments are defined by:
l_1 = b_0
l_2 = 2b_1 - b_0
l_3 = 6b_2 - 6b_1 + b_0
l_4 = 20b_3-30b_2+12b_1-b_0
and in general
l_{r+1} = \sum_{k=0}^r \frac{(-1)^{r-k}(r+k)!}{(k!)^2(r-k)!} b_k
where r=0,1,\dots,n-1
.
The sample L-moment ratios are defined by
t_r=l_r/l_2
and the sample L-CV by
t=l_2/l_1
Sample regional L-CV, L-skewness and L-kurtosis coefficients are defined as
t^R = \frac{\sum_{i=1}^k n_i t^{(i)}}{ \sum_{i=1}^k n_i}
t_3^R =\frac{ \sum_{i=1}^k n_i t_3^{(i)}}{ \sum_{i=1}^k n_i}
t_4^R =\frac{ \sum_{i=1}^k n_i t_4^{(i)}}{\sum_{i=1}^k n_i}
Value
Lmoments
gives the L-moments (l_1
, l_2
, t
, t_3
, t_4
), regionalLmoments
gives the regional weighted L-moments (l_1^R
, l_2^R
, t^R
, t_3^R
, t_4^R
), LCV
gives the coefficient of L-variation, LCA
gives the L-skewness and Lkur
gives the L-kurtosis of x
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
x <- rnorm(30,10,2)
Lmoments(x)
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
split(x,cod)
camp <- split(x,cod)$"45"
Lmoments(camp)
sapply(split(x,cod),Lmoments)
regionalLmoments(x,cod)
Maximum likelihood parameters estimation
Description
Maximum Likelihood estimation of parameters for extreme-value distributions, from Laio (2004).
Usage
ML_estimation (x, dist="NORM")
moment_estimation (x, dist="NORM")
Arguments
x |
data sample |
dist |
distribution: normal |
Value
ML_estimation
estimate the parameters of the distribution dist
from a sample x
using the maximum likelihood approach.
moment_estimation
estimate the parameters of the distribution dist
from a sample x
using the moment method.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
# sample from an EV1 distribution
sm <- rand.gumb(100, 0, 1)
moment_estimation (sm, dist="GEV")
ML_estimation (sm, dist="GEV")
F.GEV(sm, -0.051, 0.97, -0.024)
rand.GEV (100, -0.051, 0.97, -0.024)
moment_estimation (sm, dist="P3")
ML_estimation (sm, dist="P3")
Model Selection Criteria
Description
Model selection criteria for the frequency analysis of hydrological extremes, from Laio et al (2008).
Usage
MSClaio2008 (sample, dist=c("NORM","LN","GUMBEL","EV2","GEV","P3","LP3"),
crit=c("AIC", "AICc", "BIC", "ADC"))
## S3 method for class 'MSClaio2008'
print(x, digits=max(3, getOption("digits") - 3), ...)
## S3 method for class 'MSClaio2008'
summary(object, ...)
## S3 method for class 'MSClaio2008'
plot(x, ...)
Arguments
sample |
data sample |
dist |
distributions: normal |
crit |
Model-selection criteria: Akaike Information Criterion |
x |
object of class |
object |
object of class |
digits |
minimal number of "significant" digits, see 'print.default' |
... |
other arguments |
Details
The following lines are extracted from Laio et al. (2008). See the paper for more details and references.
Model selection criteria
The problem of model selection can be formalized as follows: a sample of n
data, D=(x_1, \dots, x_n)
, arranged in ascending order is available, sampled from an unknown parent distribution f(x)
;
N_m
operating models, M_j
, j=1,\dots, N_m
, are used to represent the data.
The operating models are in the form of probability distributions, M_j = g_j(x,\hat{\theta})
, with parameters \hat{\theta}
estimated from the available data sample D
.
The scope of model selection is to identify the model M_{opt}
which is better suited to represent the data, i.e. the model which is closer in some sense to the parent distribution f(x)
.
Three different model selection criteria are considered here, namely, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Anderson-Darling Criterion (ADC). Of the three methods, the first two belong to the category of classical literature approaches, while the third derives from a heuristic interpretation of the results of a standard goodness-of-fit test (see Laio, 2004).
Akalike Information Criterion
The Akaike information Criterion (AIC) for the j-th operational model can be computed as
AIC_j = -2 ln (L_j(\hat{\theta})) + 2 p_j
where
L_j(\hat{\theta}) = \prod_{i=1}^n g_j(x_i, \hat{\theta})
is the likelihood function, evaluated at the point \theta=\hat{\theta}
corresponding to the maximum likelihood estimator of the parameter vector \theta
and p_j
is the number of estimated parameter of the j-th operational model.
In practice, after the computation of the AIC_j
, for all of the operating models, one selects the model with the minimum AIC value, AIC_{min}
.
When the sample size, n
, is small, with respect to the number of estimated parameters, p
, the AIC may perform inadequately. In those cases a second-order variant of AIC, called AICc, should be used:
AICc_j = -2 ln (L_j(\hat{\theta})) + 2 p_j (n/(n - p_j - 1))
Indicatively, AICc should be used when n/p < 40
.
Bayesian Information Criterion
The Bayesian Information Criterion (BIC) for the j-th operational model reads
BIC_j = -2 ln (L_j(\hat{\theta})) + ln(n) p_j
In practical application, after the computation of the BIC_j
, for all of the operating models, one selects the model with the minimum BIC value, BIC_{min}
.
Anderson-Darling Criterion
The Anderson-Darling criterion has the form:
ADC_j = 0.0403 + 0.116 ((\Delta_{AD,j} - \epsilon_j)/\beta_j)^{(\eta_j/0.851)}
if 1.2 \epsilon_j < \Delta_{AD,j}
,
ADC_j = [0.0403 + 0.116 ((0.2 \epsilon_j)/\beta_j)^{(\eta_j/0.851)}] (\Delta_{AD,j} - 0.2 \epsilon_j / \epsilon_j)
if 1.2 \epsilon_j \ge \Delta_{AD,j}
,
where \Delta_{AD,j}
is the discrepancy measure characterizing the criterion, the Anderson-Darling statistic A2
in GOFlaio2004
, and \epsilon_j
, \beta_j
and \eta_j
are distribution-dependent coefficients that are tabled by Laio [2004, Tables 3 and 5] for a set of seven distributions commonly employed for the frequency analysis of extreme events.
In practice, after the computation of the ADC_j
, for all of the operating models, one selects the model with the minimum ADC value, ADC_{min}
.
Value
MSClaio2008
returns the value of the criteria crit
(see Details) chosen applied to the sample
, for every distribution dist
.
plot.MSClaio2008
plots the empirical distribution function of sample
(Weibull plotting position) on a log-normal probability plot, plots the candidate distributions dist
(whose parameters are evaluated with the maximum likelihood technique, see MLlaio2004
, and highlights the ones chosen by the criteria crit
.)
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
data(FEH1000)
sitedata <- am[am[,1]==53004, ] # data of site 53004
serieplot(sitedata[,4], sitedata[,3])
MSC <- MSClaio2008(sitedata[,4])
MSC
summary(MSC)
plot(MSC)
sitedata <- am[am[,1]==69023, ] # data of site 69023
serieplot(sitedata[,4], sitedata[,3])
MSC <- MSClaio2008(sitedata[,4], crit=c("AIC", "ADC"))
MSC
summary(MSC)
plot(MSC)
sitedata <- am[am[,1]==83802, ] # data of site 83802
serieplot(sitedata[,4], sitedata[,3])
MSC <- MSClaio2008(sitedata[,4], dist=c("GEV", "P3", "LP3"))
MSC
summary(MSC)
plot(MSC)
# short sample, high positive L-CA
sitedata <- am[am[,1]==40012, ] # data of site 40012
serieplot(sitedata[,4], sitedata[,3])
MSC <- MSClaio2008(sitedata[,4])
MSC
summary(MSC)
plot(MSC)
# negative L-CA
sitedata <- am[am[,1]==68002, ] # data of site 68002
serieplot(sitedata[,4], sitedata[,3])
MSC <- MSClaio2008(sitedata[,4])
MSC
summary(MSC)
plot(MSC)
Three parameters Pearson type III distribution and L-moments
Description
P3
provides the link between L-moments of a sample and the three parameter
Pearson type III distribution.
Usage
f.gamma (x, xi, beta, alfa)
F.gamma (x, xi, beta, alfa)
invF.gamma (F, xi, beta, alfa)
Lmom.gamma (xi, beta, alfa)
par.gamma (lambda1, lambda2, tau3)
rand.gamma (numerosita, xi, beta, alfa)
mom2par.gamma (mu, sigma, gamm)
par2mom.gamma (alfa, beta, xi)
Arguments
x |
vector of quantiles |
mu |
vector of gamma mean |
sigma |
vector of gamma standard deviation |
gamm |
vector of gamma third moment |
F |
vector of probabilities |
lambda1 |
vector of sample means |
lambda2 |
vector of L-variances |
tau3 |
vector of L-CA (or L-skewness) |
numerosita |
numeric value indicating the length of the vector to be generated |
alfa |
vector of gamma shape parameters |
beta |
vector of gamma scale parameters |
xi |
vector of gamma location parameters |
Details
See https://en.wikipedia.org/wiki/Pearson_distribution for an introduction to the Pearson distribution, and https://en.wikipedia.org/wiki/Gamma_distribution for an introduction to the Gamma distribution (the Pearson type III distribution is, essentially, a Gamma distribution with 3 parameters).
Definition
Parameters (3): \xi
(location), \beta
(scale), \alpha
(shape).
Moments (3): \mu
(mean), \sigma
(standard deviation), \gamma
(skewness).
If \gamma \ne 0
, let \alpha=4/\gamma^2
, \beta=\frac{1}{2}\sigma |\gamma|
, and \xi= \mu - 2 \sigma/\gamma
.
If \gamma > 0
, then the range of x
is \xi \le x < \infty
and
f(x) = \frac{(x - \xi)^{\alpha - 1} e^{-(x-\xi)/\beta}}{\beta^{\alpha} \Gamma(\alpha)}
F(x) = G \left(\alpha, \frac{x-\xi}{\beta}\right)/ \Gamma(\alpha)
If \gamma=0
, then the distribution is Normal, the range of x
is -\infty < x < \infty
and
f(x) = \phi \left(\frac{x-\mu}{\sigma}\right)
F(x) = \Phi \left(\frac{x-\mu}{\sigma}\right)
where
\phi(x)=(2\pi)^{-1/2}\exp(-x^2/2)
and
\Phi(x)=\int_{-\infty}^x \phi(t)dt
.
If \gamma < 0
, then the range of x
is -\infty < x \le \xi
and
f(x) = \frac{(\xi - x)^{\alpha - 1} e^{-(\xi-x)/\beta}}{\beta^{\alpha} \Gamma(\alpha)}
F(x) = G \left(\alpha, \frac{\xi-x}{\beta}\right)/ \Gamma(\alpha)
In each case, x(F)
has no explicit analytical form.
Here \Gamma
is the gamma function, defined as
\Gamma (x) = \int_0^{\infty} t^{x-1} e^{-t} dt
and
G(\alpha, x) = \int_0^x t^{\alpha-1} e^{-t} dt
is the incomplete gamma function.
\gamma=2
is the exponential distribution; \gamma=0
is the Normal distribution; \gamma=-2
is the reverse exponential distribution.
The parameters \mu
, \sigma
and \gamma
are the conventional moments of the distribution.
L-moments
Assuming \gamma>0
, L-moments are defined for 0<\alpha<\infty
.
\lambda_1 = \xi + \alpha \beta
\lambda_2 = \pi^{-1/2} \beta \Gamma(\alpha + 1/2)/\Gamma(\alpha)
\tau_3 = 6 I_{1/3} (\alpha, 2 \alpha)-3
where I_x(p,q)
is the incomplete beta function ratio
I_x(p,q) = \frac{\Gamma(p+q)}{\Gamma(p)\Gamma(q)} \int_0^x t^{p-1} (1-t)^{q-1} dt
There is no simple expression for \tau_4
.
Here we use the rational-funcion approximation given by Hosking and Wallis (1997, pp. 201-202).
The corresponding results for \gamma <0
are obtained by changing the signs of \lambda_1
, \tau_3
and \xi
wherever they occur above.
Parameters
alpha
is obtained with an approximation.
If 0<|\tau_3|<1/3
, let z=3 \pi \tau_3^2
and use
\alpha \approx \frac{1+0.2906 z}{z + 0.1882 z^2 + 0.0442 z^3}
if 1/3<|\tau_3|<1
, let z=1-|\tau_3|
and use
\alpha \approx \frac{0.36067 z - 0.59567 z^2 + 0.25361 z^3}{1-2.78861 z + 2.56096 z^2 -0.77045 z^3}
Given \alpha
, then
\gamma=2 \alpha^{-1/2} sign(\tau_3)
,
\sigma=\lambda_2 \pi^{1/2} \alpha^{1/2} \Gamma(\alpha)/\Gamma(\alpha+1/2)
,
\mu=\lambda_1
.
Lmom.gamma
and par.gamma
accept input as vectors of equal length.
In f.gamma
, F.gamma
, invF.gamma
and rand.gamma
parameters (mu
, sigma
, gamm
) must be atomic.
Value
f.gamma
gives the density f
, F.gamma
gives the distribution function F
, invFgamma
gives
the quantile function x
, Lmom.gamma
gives the L-moments (\lambda_1
, \lambda_2
, \tau_3
, \tau_4
), par.gamma
gives the parameters (mu
, sigma
, gamm
), and rand.gamma
generates random deviates.
mom2par.gamma
returns the parameters \alpha
, \beta
and \xi
, given the parameters (moments) \mu
, \sigma
, \gamma
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
rnorm
, runif
, EXP
, GENLOGIS
, GENPAR
, GEV
, GUMBEL
, KAPPA
, LOGNORM
; DISTPLOTS
, GOFmontecarlo
, Lmoments
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
fac <- factor(annualflows["cod"][,])
split(x,fac)
camp <- split(x,fac)$"45"
ll <- Lmoments(camp)
parameters <- par.gamma(ll[1],ll[2],ll[4])
f.gamma(1800,parameters$xi,parameters$beta,parameters$alfa)
F.gamma(1800,parameters$xi,parameters$beta,parameters$alfa)
invF.gamma(0.7511627,parameters$xi,parameters$beta,parameters$alfa)
Lmom.gamma(parameters$xi,parameters$beta,parameters$alfa)
rand.gamma(100,parameters$xi,parameters$beta,parameters$alfa)
Rll <- regionalLmoments(x,fac); Rll
parameters <- par.gamma(Rll[1],Rll[2],Rll[4])
Lmom.gamma(parameters$xi,parameters$beta,parameters$alfa)
moments <- par2mom.gamma(parameters$alfa,parameters$beta,parameters$xi); moments
mom2par.gamma(moments$mu,moments$sigma,moments$gamm)
Diagnostics of regressions
Description
Diagnostics of the output of lm
, that is used to fit linear models.
Usage
R2.lm (x)
prt.lm (x)
mantel.lm (x, Nperm = 1000)
vif.lm (x)
RMSE.lm (x)
MAE.lm (x)
predinterval.lm (x, level = 0.95)
jackknife1.lm (x)
RMSEjk.lm (x)
MAEjk.lm (x)
Arguments
x |
object of class “lm” (output of ‘lm’) |
Nperm |
number of permutations |
level |
significance level |
Details
mantel.lm
is performed under the assumption that the dependent distance matrix is variable, while the independent distance matrices are fixed and measured without error (they are not related to random variables, see Smouse et al., 1986).
Under this assumption, the significance of the regression between distance matrices can be evaluated simultaneously permuting the rows and corresponding columns in only the dependent distance matrix, while the others are held constant.
Value
R2.lm
returns the coefficient of determination R^2
and the adjusted coefficient of determination R^2_{adj}
of the regression.
prt.lm
returns the probability Pr(>|t|)
of the significance test (Student t) of the independent variables.
If the value is 0.06 for a regressor, its coefficient is not significantly different from 0 for a test with significance level of 5%.
mantel.lm
returns the probability P
of the Mantel test on every variable conditionated to the others.
It substitutes prt.lm
when dealing with distance matrices.
If P
is, for example, 0.92, the variable should be considered significant with significance level greater of 8%.
vif.lm
returns the variance inflation factors (VIF) of the independent values of the regression. If VIF > 5
(or 10) there is a problem of multicollinearity.
RMSE.lm
returns the root mean squared error of the regression.
MAE.lm
returns the mean absolute error of the regression.
predinterval.lm
returns the prediction intervals at a specified level
in correspondence to the fitted data.
jackknife1.lm
returns predicted values by a jackknife (cross-validation) procedure.
The procedure (remove 1 observation, fit the model, estimate in the removed point) is repeated for all the points.
RMSEjk.lm
returns the root mean squared error of the cross-validation (performed with jackknife1.lm
).
MAEjk.lm
returns the mean absolute error of the cross-validation (performed with jackknife1.lm
).
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
data(hydroSIMN)
D <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
#Dm <- tapply(D,cod,mean)
#datregr <- cbind(Dm,parameters)
datregr <- parameters
regr0 <- lm(Dm ~ .,datregr); summary(regr0)
regr1 <- lm(Dm ~ Am + Hm + Ybar,datregr); summary(regr1)
R2.lm(regr0)
R2.lm(regr1)
prt.lm(regr0)
prt.lm(regr1)
vif.lm(regr0)
vif.lm(regr1)
RMSE.lm(regr0)
RMSE.lm(regr1)
MAE.lm(regr0)
MAE.lm(regr1)
predinterval.lm(regr0)
jackknife1.lm(regr0)
jackknife1.lm(regr1)
RMSEjk.lm(regr0)
RMSEjk.lm(regr1)
MAEjk.lm(regr0)
MAEjk.lm(regr1)
# mantel test on distance matrices
#Y <- AD.dist(D,cod) # it takes some time
#X <- data.frame(apply(datregr[,c("Hm","Ybar")],2,dist))
#dati <- cbind(X)
#modello <- lm(Y ~ Hm + Ybar, dati)
#mantel.lm(modello, Nperm=100)
Series plots
Description
Plots for time-series investigation.
Usage
serieplot (x, t, lim.x=c(min(x),max(x)), lim.t=c(min(t),max(t)),
...)
consistencyplot (t, cod, cex.axis=.8, mark.code=TRUE, ...)
Arguments
x |
data sample |
t |
vector representing time (e.g. years) of data-samples defined with |
lim.x , lim.t |
plot limits |
cod |
array that defines the data subdivision among sites |
cex.axis |
dimensions of points and labels |
mark.code |
if TRUE (default) codes of samples are plotted on y axis |
... |
graphical parameters as |
Value
consistencyplot
displays time-series consistency.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
data(hydroSIMN)
annualflows[c(1:10),]
x <- annualflows["dato"][,]
y <- annualflows["anno"][,]
cod <- annualflows["cod"][,]
consistencyplot(y,cod)
for (i in unique(cod)) {
serieplot(x[cod==i], y[cod==i], c(0,max(x)), c(min(y),max(y)),
xlab="", ylab="D [mm]", main=i)
readline()
}
Static plots
Description
Plots from books and articles.
Usage
Lmoment.ratio.diagram (grid=TRUE, ...)
Lspace.HWvsAD (grid=TRUE, ...)
Lspace.limits (grid=TRUE, ...)
Arguments
grid |
should a grid be plotted? |
... |
other arguments |
Value
Lmoment.ratio.diagram
plots points corresponding to two parameters distributions and lines corresponding to three parameters distributions on the 'L-CA - L-kur' plane.
The distributions are:
E = exponential,
G = gumbel,
L = logistic,
N = normal,
U = uniform,
GLO = generalized logistic,
GEV = generalized extreme-value,
GPA = generalized Pareto,
LN3 = lognormal,
PE3 = Pearson type III.
Lspace.HWvsAD
separate regions, in 'L-CA - L-CV' space, where the homogeneity tests of Hosking and Wallis (HW) and Anderson-Darling (AD) are preferable.
Lspace.limits
displays limits for regional L-moments in the 'L-CA - L-CV'.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
EXP
, GENLOGIS
, GENPAR
, LOGNORM
, GUMBEL
, GEV
, P3
Examples
Lmoment.ratio.diagram()
Lspace.HWvsAD()
Lspace.limits()
data(hydroSIMN)
annualflows[c(1:10),]
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
rlm <- regionalLmoments(x,cod)
Lmoment.ratio.diagram()
points(rlm["lcaR"],rlm["lkurR"],col="red",pch=19)
Lspace.HWvsAD()
points(rlm["lcaR"],rlm["lcvR"],col="red",pch=19)
Data-sample
Description
Functions for inversion calculation.
Usage
data(functionsLaio)
Format
Data.frames:
df.kgev
is a data.frame with the skewness coefficient (first column) and the corresponding shape parameter of the GEV (second column)
df.polig
represents the poligamma function.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
Data-sample
Description
SIMN (Servizio Idrografico e Mareografico Nazionale) flow data samples and catchment parameters.
Usage
data(hydroSIMN)
Format
Data.frames:
annualflows
is the data.frame of the annual flows with 3 columns:
cod
, the code of the station;
anno
, the year;
dato
, the value of the annual flow [mm].
parameters
is the data.frame of parameters of 47 catchements with 16 columns:
cod
, the code of the station;
Dm
, the mean annual streamflow [mm] as reported in the ‘Pubblicazione n. 17’;
Am
, the mean annual rainfall [mm] as reported in the ‘Pubblicazione n. 17’;
S
, area of the plane projection of the drainage basin [km2];
Hm
, mean elevation of the drainage basin [m a.s.l.];
Pm
, mean slope of the basin [%]:
P_m=arctg(2(H_{med} - H_{min})/\sqrt{S})
where S
is the basin area, H_{med}
the median elevation and H_{min}
the elevation of the closing section.
Pm
is a slope measure of a square equivalent basin, and does not account for basin shape;
LLDP
, length of the longest drainage path [km].
The longest drainage path is the longest path between the basin outlet and the most distant point on the basin border, following drainage directions.
Actually the longest drainage path corresponds to the main stream plus the path on the hillslope that connects the stream source to the basin border;
PLDP
, slope of the longest drainage path [%]. Average of the slope values associated to each pixel in the longest drainage path;
S2000
, area above 2000 m a.s.l. [%];
EST
, ‘easting’, sine of the angle between the segment connecting the center of mass and the outlet of the basin and the north.
EST
is 1 if the basin is oriented eastward, -1 if it is oriented westward;
NORD
, ‘northing’, cosine of the angle between the segment connecting the center of mass and the outlet of the basin and the north.
NORD
is 1 if the basin is oriented northward, -1 if it is oriented southward;
Rc
, circularity ratio Rc.
Ratio between the basin area and the area of a circle having the same perimeter:
R_c = \frac{4 \pi S}{P^2}
where P
is the watershed perimeter;
Xbar
, longitude [deg] of the centroid of the plane projection of the drainage basin;
Ybar
, latitude [deg] of the centroid of the plane projection of the drainage basin;
IT
, Thornthwaite index: a global moisture index that can be estimated, in its simplest form, as the ratio
I_T=\frac{A_m - ET_0}{ET_0}
where ET_0
is the mean annual potential evapotranspiration on the basin;
IB
, Budyko index: a radiational aridity index expressed as
I_B=\frac{R_n}{\lambda A_m}
where R_n
is the mean annual net radiation and \lambda
is the latent vaporization heat.
Values assumed by I_B
are lower than 1 for humid regions and greater than 1 in arid regions.
meanmonthlyflows
is the data.frame of the mean monthly streamflows [mm] as reported in the ‘Pubblicazione n. 17’.
It has 13 columns because the first one, cod
, is the code of the station.
monthlyflows
is the data.frame of the monthly streamflows [mm] with 4 columns:
cod
, the code of the station;
anno
, the year;
mese
, the month;
dato
, the value of the annual flow [mm].
Note
For information on the package and the Author, and for all the references, see nsRFA
.
Examples
data(hydroSIMN)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
split(x,cod)
sapply(split(x,cod),mean)
sapply(split(x,cod),median)
sapply(split(x,cod),quantile)
sapply(split(x,cod),Lmoments)
parameters
Sample moments
Description
moments
provides the estimate of the first 4 moment-statistics of a sample.
Usage
moments (x)
CV (x)
skew (x)
kurt (x)
Arguments
x |
vector representing a data-sample |
Details
Skewness and kurtosis are defined as:
skew = n^{-1} \frac{\sum_{i=1}^n \left(x_i - mean(x)\right)^3}{sd(x)^{3}}
kurt = n^{-1} \frac{\sum_{i=1}^n \left(x_i - mean(x)\right)^4}{sd(x)^{4}} - 3
where n
is the size of x
.
See https://en.wikipedia.org/wiki/Skewness and https://en.wikipedia.org/wiki/Kurtosis for additional informations.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
x <- rnorm(30,10,2)
moments(x)
data(hydroSIMN)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
sapply(split(x,cod),moments)
Internal functions
Description
User-level objects which are for ‘internal’ use only
ksampleA2
provides the Anderson-Darling test statistic;
nonparboot
provides a non-parametric bootstrap;
Usage
ksampleA2 (x,cod)
nonparboot (z,n=length(z))
Arguments
x |
vector representing data from many samples defined with |
cod |
array that defines the data subdivision among sites |
z |
data sample, used for bootstrap |
n |
length of sample (extracted in |
Note
For information on the package and the Author, and for all the references, see nsRFA
.
Region of influence
Description
Formation of clusters for Regional Frequency Analysis: region of influence (Burn, 1990).
Usage
roi (p.ungauged, p.gauged, cod.p, x=NULL, cod=NULL)
roi.hom (p.ungauged, p.gauged, cod.p, x, cod,
test="HW", limit=2, Nsim=500, index=2)
roi.st.year (p.ungauged, p.gauged, cod.p, x, cod,
test="HW", station.year=500, Nsim=500, index=2)
Arguments
x |
vector representing data from many samples defined with |
cod |
array that defines the data subdivision among sites |
index |
if |
p.ungauged |
parameters of the ungauged site (1 row) |
p.gauged |
parameters of gauged sites |
cod.p |
code of gauged sites |
test |
homogeneity test to apply: |
limit |
limit over which regions must be considered heterogeneous: for example 2 for |
Nsim |
number of simulations in |
station.year |
number of station years to form the region |
Details
The Euclidean distance is used.
Given p
different classification variables, the distance between two elements i
and j
is:
d_{i j} = \sqrt{\frac{1}{p} \sum_{h=1}^{p} (x_{h i} - x_{h j})^2}
where x_{h i}
is the value of the h
-th variable of the i
-th element.
Value
roi
returns the ‘region of influence’ for the site defined with p.ungauged
.
It the gauged sites ordered according to the euclidean distance against the site of interest (the distance is evaluated in the space defined by parameters p.ungauged
and p.gauged
).
If x=NULL
and cod=NULL
(default), a data.frame with the ordered sites and the distances against the site of interest is returned.
If x
and cod
are provided, the data.frame will contain also statistics of samples (number of data n
and L-moments).
roi.hom
returns the ‘region of influence’ for the site defined with p.ungauged
.
It returns codes of gauged sites that form an homogeneous region according to the Hosking and Wallis "HW"
or Anderson-Darling "AD"
tests.
The region is formed using distances in the space defined by parameters p.ungauged
and p.gauged
.
roi.st.year
returns the ‘region of influence’ for the site defined with p.ungauged
.
It returns codes of gauged sites that form a region and the risult of homogeneity tests, according to the station-year criterion.
It also return the similarity ranking factor S_i
, the weights w_i
and the regional L-moments as evaluated in the Flood Estimation Handbook (Robson and Reed, 1999).
The region is formed using distances in the space defined by parameters p.ungauged
and p.gauged
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
traceWminim
, AD.dist
, HOMTESTS
for the definition of the Hosking and Wallis "HW"
or Anderson-Darling "AD"
tests.
Examples
data(hydroSIMN)
parameters
summary(parameters)
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
roi(parameters[5,3:5],parameters[-5,3:5],parameters[-5,1])
roi(parameters[5,3:5],parameters[-5,3:5],parameters[-5,1],x,cod)
# roi.hom
#roi.hom(parameters[5,3:5],parameters[-5,3:5],parameters[-5,1],x,cod)
# it takes some time
#roi.hom(parameters[5,3:5],parameters[-5,3:5],parameters[-5,1],x,cod,
# test="AD",limit=.95) # it takes some time
#roi.hom(parameters[8,3:5],parameters[-8,3:5],
# parameters[-8,1],x,cod) # it takes some time
# roi.st.year
roi.st.year(parameters[5,3:5],parameters[-5,3:5],
parameters[-5,1],x,cod)
roi.st.year(parameters[5,3:5],parameters[-5,3:5],parameters[-5,1],
x,cod,test="AD",station.year=100)
Cluster analysis: disjoint regions
Description
Formation of disjoint regions for Regional Frequency Analysis.
Usage
traceWminim (X, centers)
sumtraceW (clusters, X)
nearest (clusters, X)
Arguments
X |
a numeric matrix of characteristics, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns) |
centers |
the number of clusters |
clusters |
a numeric vector containing the subdivision of |
Details
The Euclidean distance is used.
Given p
different classification variables, the distance between two elements i
and j
is:
d_{i j} = \sqrt{\frac{1}{p} \sum_{h=1}^{p} (x_{h i} - x_{h j})^2}
where x_{h i}
is the value of the h
-th variable of the i
-th element.
The function traceWminim
is a composition of a jerarchical algorithm, the Ward (1963) one, and an optimisation procedure consisting in the minimisation of:
W = \sum_{i=1}^k \left( \sum_{j=1}^{n_i} \delta_{i j}^2 \right)
where
k
is the number of clusters (obtained initially with Ward's algorithm), n_i
is the number of sites in the i
-th cluster and \delta_{i j}
is the Euclidean distance between the j
-th element of the i
-th group and the center of mass of the i
-th cluster.
W
is calculated with sumtraceW
.
The algorithm consist in moving a site from one cluster to another if this makes W
decrease.
Value
traceWminim
gives a vector defining the subdivision of elements characterized by X
in n=centers
clusters.
sumtraceW
gives W
(it is used by traceWminim
).
nearest
gives the nearest site to the centers of mass of clusters (it is used by traceWminim
).
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
data(hydroSIMN)
parameters
summary(parameters)
# traceWminim
param <- parameters[c("Hm","Ybar")]
n <- dim(param)[1]; k <- dim(param)[2]
param.norm <- (param - matrix(apply(param,2,mean),nrow=n,ncol=k,
byrow=TRUE))/matrix(apply(param,2,sd),
nrow=n,ncol=k,byrow=TRUE)
clusters <- traceWminim(param.norm,4);
names(clusters) <- parameters["cod"][,]
clusters
annualflows
summary(annualflows)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
fac <- factor(annualflows["cod"][,],
levels=names(clusters[clusters==1]))
x1 <- annualflows[!is.na(fac),"dato"]
cod1 <- annualflows[!is.na(fac),"cod"]
#HW.tests(x1,cod1) # it takes some time
fac <- factor(annualflows["cod"][,],
levels=names(clusters[clusters==3]))
x3 <- annualflows[!is.na(fac),"dato"]
cod3 <- annualflows[!is.na(fac),"cod"]
#HW.tests(x3,cod3) # it takes some time
Exact variance structure of sample L-moments
Description
varLmoments
provides distribution-free unbiased estimators of the variances and covariances of sample L-moments.
Usage
varLmoments (x, matrix=TRUE)
varLCV (x)
varLCA (x)
varLkur (x)
Arguments
x |
vector representing a data-sample |
matrix |
if |
Details
The estimation of the exact variance structure of sample L-moments is based on Elamir et Seheult (2004).
Value
varLmoments
gives the matrix of unbiased estimates of the variance structure of sample L-moments:
this is a 4x4 matrix containg var(l_1)
, var(l_2)
, var(l_3)
,
var(l_4)
on the main diagonal,
and the correspondant covariances elsewhere (cov(l_1,l_2)
, cov(l_1,l_3)
, etc.);
varLCV
gives the unbiased estimate of the variance of sample coefficient of L-variation of x
;
varLCA
gives the unbiased estimate of the variance of sample L-skewness of x
;
varLkur
gives the unbiased estimate of the variance of sample L-kurtosis of x
.
Note
For information on the package and the Author, and for all the references, see nsRFA
.
See Also
Examples
x <- rnorm(30,10,2)
varLmoments(x)
varLmoments(x, FALSE)
varLCV(x)
varLCA(x)
varLkur(x)
data(hydroSIMN)
x <- annualflows["dato"][,]
cod <- annualflows["cod"][,]
dvarLmom <- function(x) {diag(varLmoments(x))}
sapply(split(x,cod),dvarLmom)