Type: | Package |
Title: | P-Values for Classification |
Version: | 1.4.1 |
Date: | 2025-04-30 |
Imports: | Matrix |
Description: | Computes nonparametric p-values for the potential class memberships of new observations as well as cross-validated p-values for the training data. The p-values are based on permutation tests applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the Gaussian model, 'k nearest neighbors', 'weighted nearest neighbors' or 'penalized logistic regression'. Additionally, it provides graphical displays and quantitative analyses of the p-values. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
LazyLoad: | yes |
NeedsCompilation: | no |
Packaged: | 2025-04-30 06:57:23 UTC; znn1 |
Author: | Niki Zumbrunnen [aut, cre], Lutz Duembgen. [aut] |
Maintainer: | Niki Zumbrunnen <niki.zumbrunnen@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-04-30 07:30:01 UTC |
P-Values for Classification
Description
Computes nonparametric p-values for the potential class memberships of new observations as well as cross-validated p-values for the training data. The p-values are based on permutation tests applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the Gaussian model, 'k nearest neighbors', 'weighted nearest neighbors' or 'penalized logistic regression'.
Additionally, it provides graphical displays and quantitative analyses of the p-values.
Details
Use cvpvs
to compute cross-validated p-values, pvs
to classify new observations and analyze.pvs
to analyze the p-values.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
cv <- cvpvs(X,Y)
analyze.pvs(cv,Y)
pv <- pvs(NewX, X, Y, method = 'k', k = 10)
analyze.pvs(pv)
Analyze P-Values
Description
Graphical displays and quantitative analyses of a matrix of p-values.
Usage
analyze.pvs(pv, Y = NULL, alpha = 0.05, roc = TRUE, pvplot = TRUE, cex = 1)
Arguments
pv |
|
Y |
optional. Vector indicating the classes which the observations belong to. |
alpha |
test level, i.e. 1 - confidence level. |
roc |
logical. If |
pvplot |
logical. If |
cex |
A numerical value giving the amount by which plotting text should be magnified relative to the default. |
Details
Displays the p-values graphically, i.e. it plots for each p-value a rectangle. The area of this rectangle is proportional to the the p-value. The rectangle is drawn blue if the p-value is greater than alpha
and red otherwise.
If Y
is not NULL
, i.e. the class memberships of the observations are known (e.g. cross-validated p-values), then additionally it plots the empirical ROC curves and prints some empirical conditional inclusion probabilities I(b,\theta)
and/or pattern probabilities P(b,S)
. Precisely, I(b,\theta)
is the proportion of training observations of class b
whose p-value for class \theta
is greater than \alpha
, while P(b,S)
is the proportion of training observations of class b
such that the (1 - \alpha)
-prediction region equals S
.
Value
T |
Table containing empirical conditional inclusion and/or pattern probabilities for each class |
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
cv <- cvpvs(X,Y)
analyze.pvs(cv,Y)
pv <- pvs(NewX, X, Y, method = 'k', k = 10)
analyze.pvs(pv)
Medical Dataset
Description
This data set collected by Dr. Bürk at the university hospital in Lübeck contains data of 21556 surgeries in a certain time period (end of the nineties). Besides the mortality and the morbidity it contains 21 variables describing the condition of the patient and the surgery.
Usage
data(buerk)
Format
A data frame with 21556 observations on the following 23 variables.
age
Age in years
sex
Sex (1 = female, 0 = male)
asa
ASA-Score (American Society of Anesthesiologists), describes the physical condition on an ordinal scale:
1 = A normal healthy patient
2 = A patient with mild systemic disease
3 = A patient with severe systemic disease
4 = A patient with severe systemic disease that is a constant threat to life
5 = A moribund patient who is not expected to survive without the operation
6 = A declared brain-dead patient whose organs are being removed for donor purposesrf_cer
Risk factor: cerebral (1 = yes, 0 = no)
rf_car
Risk factor: cardiovascular (1 = yes, 0 = no)
rf_pul
Risk factor: pulmonary (1 = yes, 0 = no)
rf_ren
Risk factor: renal (1 = yes, 0 = no)
rf_hep
Risk factor: hepatic (1 = yes, 0 = no)
rf_imu
Risk factor: immunological (1 = yes, 0 = no)
rf_metab
Risk factor: metabolic (1 = yes, 0 = no)
rf_noc
Risk factor: uncooperative, unreliable (1 = yes, 0 = no)
e_malig
Etiology: malignant (1 = yes, 0 = no)
e_vascu
Etiology: vascular (1 = yes, 0 = no)
antibio
Antibiotics therapy (1 = yes, 0 = no)
op
Surgery indicated (1 = yes, 0 = no)
opacute
Emergency operation (1 = yes, 0 = no)
optime
Surgery time in minutes
opsepsis
Septic surgery (1 = yes, 0 = no)
opskill
Expirienced surgeond, i.e. senior physician (1 = yes, 0 = no)
blood
Blood transfusion necessary (1 = yes, 0 = no)
icu
Intensive care necessary (1 = yes, 0 = no)
mortal
Mortality (1 = yes, 0 = no)
morb
Morbidity (1 = yes, 0 = no)
Source
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
Cross-Validated P-Values
Description
Computes cross-validated nonparametric p-values for the potential class memberships of the training data.
Usage
cvpvs(X, Y, method = c('gaussian','knn','wnn', 'logreg'), ...)
Arguments
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
method |
one of the following methods: |
... |
further arguments depending on the method (see |
Details
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the Gaussian model, 'k nearest neighbors', 'weighted nearest neighbors' or multicategory logistic regression with l1
-penalization (see cvpvs.gaussian, cvpvs.knn, cvpvs.wnn, cvpvs.logreg
) with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
Value
PV
is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
cvpvs.gaussian, cvpvs.knn, cvpvs.wnn, cvpvs.logreg, pvs, analyze.pvs
Examples
X <- iris[,1:4]
Y <- iris[,5]
cvpvs(X,Y,method='k',k=10,distance='d')
Cross-Validated P-Values (Gaussian)
Description
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on a plug-in statistic for the standard Gaussian model. The latter means that the conditional distribution of X
, given Y=y
, is Gaussian with mean depending on y
and a global covariance matrix.
Usage
cvpvs.gaussian(X, Y, cova = c('standard', 'M', 'sym'))
Arguments
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
cova |
estimator for the covariance matrix: |
Details
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the standard Gaussian model with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
Value
PV
is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
cvpvs, cvpvs.knn, cvpvs.wnn, cvpvs.logreg
Examples
X <- iris[, 1:4]
Y <- iris[, 5]
cvpvs.gaussian(X, Y, cova = 'standard')
Cross-Validated P-Values (k Nearest Neighbors)
Description
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on 'k nearest neighbors'.
Usage
cvpvs.knn(X, Y, k = NULL, distance = c('euclidean', 'ddeuclidean',
'mahalanobis'), cova = c('standard', 'M', 'sym'))
Arguments
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
k |
number of nearest neighbors. If |
distance |
the distance measure: |
cova |
estimator for the covariance matrix: |
Details
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'k nearest neighbors' with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
If k
is a vector, the program searches for the best k
. To determine the best k
for the p-value PV[i,b]
, the class label of the training observation X[i,]
is set temporarily to b
and then for all training observations with Y[j] != b
the proportion of the k
nearest neighbors of X[j,]
belonging to class b
is computed. Then the k
which minimizes the sum of these values is chosen.
If k = NULL
, it is set to 2:ceiling(length(Y)/2).
Value
PV
is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If k
is a vector or NULL
, PV
has an attribute "opt.k"
, which is a matrix and opt.k[i,b]
is the best k
for observation X[i,]
and class b
(see section 'Details'). opt.k[i,b]
is used to compute the p-value for observation X[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
cvpvs, cvpvs.gaussian, cvpvs.wnn, cvpvs.logreg
Examples
X <- iris[, 1:4]
Y <- iris[, 5]
cvpvs.knn(X, Y, k = c(5, 10, 15))
Cross-Validated P-Values (Penalized Multicategory Logistic Regression)
Description
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on 'penalized logistic regression'.
Usage
cvpvs.logreg(X, Y, tau.o=10, find.tau=FALSE, delta=2, tau.max=80, tau.min=1,
pen.method = c("vectors", "simple", "none"), progress = TRUE)
Arguments
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
tau.o |
the penalty parameter (see section 'Details' below). |
find.tau |
logical. If TRUE the program searches for the best |
delta |
factor for the penalty parameter. Should be greater than 1. Only needed if |
tau.max |
maximal penalty parameter considered. Only needed if |
tau.min |
minimal penalty parameter considered. Only needed if |
pen.method |
the method of penalization (see section 'Details' below). |
progress |
optional parameter for reporting the status of the computations. |
Details
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i]
equals b
, based on the remaining training observations.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'penalized logistic regression'. This means, the conditional probability of Y = y
, given X = x
, is assumed to be proportional to exp(a_y + b_y^T x)
. The parameters a_y
, b_y
are estimated via penalized maximum log-likelihood. The penalization is either a weighted sum of the euclidean norms of the vectors (b_1[j],b_2[j],\ldots,b_L[j])
(pen.method=='vectors'
) or a weighted sum of all moduli |b_y[j]|
(pen.method=='simple'
). The weights are given by tau.o
times the sample standard deviation (within groups) of the j
-th components of the feature vectors.
In case of pen.method=='none'
, no penalization is used, but this option may be unstable.
If find.tau == TRUE
, the program searches for the best penalty parameter. To determine the best parameter tau
for the p-value PV[i,b]
, the class label of the training observation X[i,]
is set temporarily to b
and then for all training observations with Y[j] != b
the estimated probability of X[j,]
belonging to class b
is computed. Then the tau
which minimizes the sum of these values is chosen. First, tau.o
is compared with tau.o*delta
. If tau.o*delta
is better, it is compared with tau.o*delta^2
, etc. The maximal parameter considered is tau.max
. If tau.o
is better than tau.o*delta
, it is compared with tau.o*delta^-1
, etc. The minimal parameter considered is tau.min
.
Value
PV
is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
, based on the remaining training observations.
If find.tau == TRUE
, PV
has an attribute "tau.opt"
, which is a matrix and tau.opt[i,b]
is the best tau
for observation X[i,]
and class b
(see section 'Details'). tau.opt[i,b]
is used to compute the p-value for observation X[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
cvpvs, cvpvs.gaussian, cvpvs.knn, cvpvs.wnn
Examples
## Not run:
X <- iris[, 1:4]
Y <- iris[, 5]
cvpvs.logreg(X, Y, tau.o=1, pen.method="vectors",progress=TRUE)
## End(Not run)
# A bigger data example: Buerk's hospital data.
## Not run:
data(buerk)
X.raw <- as.matrix(buerk[,1:21])
Y.raw <- buerk[,22]
n0.raw <- sum(1 - Y.raw)
n1 <- sum(Y.raw)
n0 <- 3*n1
X0 <- X.raw[Y.raw==0,]
X1 <- X.raw[Y.raw==1,]
tmpi0 <- sample(1:n0.raw,size=n0,replace=FALSE)
tmpi1 <- sample(1:n1 ,size=n1,replace=FALSE)
X <- rbind(X0[tmpi0,],X1)
Y <- c(rep(1,n0),rep(2,n1))
str(X)
str(Y)
PV <- cvpvs.logreg(X,Y,
tau.o=5,pen.method="v",progress=TRUE)
analyze.pvs(Y=Y,pv=PV,pvplot=FALSE)
## End(Not run)
Cross-Validated P-Values (Weighted Nearest Neighbors)
Description
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on 'weighted nearest-neighbors'.
Usage
cvpvs.wnn(X, Y, wtype = c('linear', 'exponential'), W = NULL,
tau = 0.3, distance = c('euclidean', 'ddeuclidean',
'mahalanobis'), cova = c('standard', 'M', 'sym'))
Arguments
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
wtype |
type of the weight function (see section 'Details' below). |
W |
vector of the (decreasing) weights (see section 'Details' below). |
tau |
parameter of the weight function. If |
distance |
the distance measure: |
cova |
estimator for the covariance matrix: |
Details
Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i]
equals b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'weighted nearest neighbors' with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
The (decreasing) weights for the observations can be either indicated with a n
dimensional vector W
or (if W = NULL
) one of the following weight functions can be used:
linear:
W_i = \max(1-\frac{i}{n}/\tau,0),
exponential:
W_i = (1-\frac{i}{n})^\tau.
If tau
is a vector, the program searches for the best tau
. To determine the best tau
for the p-value PV[i,b]
, the class label of the training observation X[i,]
is set temporarily to b
and then for all training observations with Y[j] != b
the sum of the weights of the observations belonging to class b
is computed. Then the tau
which minimizes the sum of these values is chosen.
If W = NULL
and tau = NULL
, tau
is set to seq(0.1,0.9,0.1)
if wtype = "l"
and to c(1,5,10,20)
if wtype = "e"
.
Value
PV
is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If tau
is a vector or NULL
(and W = NULL
), PV
has an attribute "opt.tau"
, which is a matrix and opt.tau[i,b]
is the best tau
for observation X[i,]
and class b
(see section 'Details'). "opt.tau"
is used to compute the p-values.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
cvpvs, cvpvs.gaussian, cvpvs.knn, cvpvs.logreg
Examples
X <- iris[, 1:4]
Y <- iris[, 5]
cvpvs.wnn(X, Y, wtype = 'l', tau = 0.5)
P-Values to Classify New Observations
Description
Computes nonparametric p-values for the potential class memberships of new observations.
Usage
pvs(NewX, X, Y, method = c('gaussian', 'knn', 'wnn', 'logreg'), ...)
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
method |
one of the following methods: |
... |
further arguments depending on the method (see |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the Gaussian model, 'k nearest neighbors', 'weighted nearest neighbors' or multicategory logistic regression with l1
-penalization (see pvs.gaussian, pvs.knn, pvs.wnn, pvs.logreg
) with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs.gaussian, pvs.knn, pvs.wnn, pvs.logreg, cvpvs, analyze.pvs
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs(NewX, X, Y, method = 'k', k = 10)
P-Values to Classify New Observations (Gaussian)
Description
Computes nonparametric p-values for the potential class memberships of new observations. The p-values are based on a plug-in statistic for the standard Gaussian model. The latter means that the conditional distribution of X
, given Y=y
, is Gaussian with mean depending on y
and a global covariance matrix.
Usage
pvs.gaussian(NewX, X, Y, cova = c('standard', 'M', 'sym'))
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
cova |
estimator for the covariance matrix: |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using a plug-in statistic for the standard Gaussian model with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs, pvs.knn, pvs.wnn, pvs.logreg
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs.gaussian(NewX, X, Y, cova = 'standard')
P-Values to Classify New Observations (k Nearest Neighbors)
Description
Computes nonparametric p-values for the potential class memberships of new observations. The p-values are based on 'k nearest neighbors'.
Usage
pvs.knn(NewX, X, Y, k = NULL, distance = c('euclidean', 'ddeuclidean',
'mahalanobis'), cova = c('standard', 'M', 'sym'))
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
k |
number of nearest neighbors. If |
distance |
the distance measure: |
cova |
estimator for the covariance matrix: |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'k nearest neighbors' with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
If k
is a vector, the program searches for the best k
. To determine the best k
for the p-value PV[i,b]
, the new observation NewX[i,]
is added to the training data with class label b
and then for all training observations with Y[j] != b
the proportion of the k
nearest neighbors of X[j,]
belonging to class b
is computed. Then the k
which minimizes the sum of these values is chosen.
If k = NULL
, it is set to 2:ceiling(length(Y)/2).
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If k
is a vector or NULL
, PV
has an attribute "opt.k"
, which is a matrix and opt.k[i,b]
is the best k
for observation NewX[i,]
and class b
(see section 'Details'). opt.k[i,b]
is used to compute the p-value for observation NewX[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs, pvs.gaussian, pvs.wnn, pvs.logreg
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs.knn(NewX, X, Y, k = c(5, 10, 15))
P-Values to Classify New Observations (Penalized Multicategory Logistic Regression)
Description
Computes nonparametric p-values for the potential class memberships of new observations. The p-values are based on 'penalized logistic regression'.
Usage
pvs.logreg(NewX, X, Y, tau.o = 10, find.tau=FALSE, delta=2, tau.max=80, tau.min=1,
a0 = NULL, b0 = NULL,
pen.method = c('vectors', 'simple', 'none'),
progress = FALSE)
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
tau.o |
the penalty parameter (see section 'Details' below). |
find.tau |
logical. If TRUE the program searches for the best |
delta |
factor for the penalty parameter. Should be greater than 1. Only needed if |
tau.max |
maximal penalty parameter considered. Only needed if |
tau.min |
minimal penalty parameter considered. Only needed if |
a0 , b0 |
optional starting values for logistic regression. |
pen.method |
the method of penalization (see section 'Details' below). |
progress |
optional parameter for reporting the status of the computations. |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i]
equals b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'penalized logistic regression'. This means, the conditional probability of Y = y
, given X = x
, is assumed to be proportional to exp(a_y + b_y^T x)
. The parameters a_y
, b_y
are estimated via penalized maximum log-likelihood. The penalization is either a weighted sum of the euclidean norms of the vectors (b_1[j],b_2[j],\ldots,b_L[j])
(pen.method=='vectors'
) or a weighted sum of all moduli |b_{\theta}[j]|
(pen.method=='simple'
). The weights are given by tau.o
times the sample standard deviation (within groups) of the j
-th components of the feature vectors.
In case of pen.method=='none'
, no penalization is used, but this option may be unstable.
If find.tau == TRUE
, the program searches for the best penalty parameter. To determine the best parameter tau
for the p-value PV[i,b]
, the new observation NewX[i,]
is added to the training data with class label b
and then for all training observations with Y[j] != b
the estimated probability of X[j,]
belonging to class b
is computed. Then the tau
which minimizes the sum of these values is chosen. First, tau.o
is compared with tau.o*delta
. If tau.o*delta
is better, it is compared with tau.o*delta^2
, etc. The maximal parameter considered is tau.max
. If tau.o
is better than tau.o*delta
, it is compared with tau.o*delta^-1
, etc. The minimal parameter considered is tau.min
.
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If find.tau == TRUE
, PV
has an attribute "tau.opt"
, which is a matrix and tau.opt[i,b]
is the best tau
for observation NewX[i,]
and class b
(see section 'Details'). tau.opt[i,b]
is used to compute the p-value for observation NewX[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs, pvs.gaussian, pvs.knn, pvs.wnn
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs.logreg(NewX, X, Y, tau.o=1, pen.method="vectors", progress=TRUE)
# A bigger data example: Buerk's hospital data.
## Not run:
data(buerk)
X.raw <- as.matrix(buerk[,1:21])
Y.raw <- buerk[,22]
n0.raw <- sum(1 - Y.raw)
n1 <- sum(Y.raw)
n0 <- 3*n1
X0 <- X.raw[Y.raw==0,]
X1 <- X.raw[Y.raw==1,]
tmpi0 <- sample(1:n0.raw,size=3*n1,replace=FALSE)
tmpi1 <- sample(1:n1 ,size= n1,replace=FALSE)
Xtrain <- rbind(X0[tmpi0[1:(n0-100)],],X1[1:(n1-100),])
Ytrain <- c(rep(1,n0-100),rep(2,n1-100))
Xtest <- rbind(X0[tmpi0[(n0-99):n0],],X1[(n1-99):n1,])
Ytest <- c(rep(1,100),rep(2,100))
PV <- pvs.logreg(Xtest,Xtrain,Ytrain,tau.o=2,progress=TRUE)
analyze.pvs(Y=Ytest,pv=PV,pvplot=FALSE)
## End(Not run)
P-Values to Classify New Observations (Weighted Nearest Neighbors)
Description
Computes nonparametric p-values for the potential class memberships of new observations. The p-values are based on 'weighted nearest-neighbors'.
Usage
pvs.wnn(NewX, X, Y, wtype = c('linear', 'exponential'), W = NULL,
tau = 0.3, distance = c('euclidean', 'ddeuclidean',
'mahalanobis'), cova = c('standard', 'M', 'sym'))
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
wtype |
type of the weight function (see section 'Details' below). |
W |
vector of the (decreasing) weights (see section 'Details' below). |
tau |
parameter of the weight function. If |
distance |
the distance measure: |
cova |
estimator for the covariance matrix: |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'weighted nearest neighbors' with estimated prior probabilities N(b)/n
. Here N(b)
is the number of observations of class b
and n
is the total number of observations.
The (decreasing) weights for the observation can be either indicated with a n
dimensional vector W
or (if W = NULL
) one of the following weight functions can be used:
linear:
W_i = \max(1-\frac{i}{n}/\tau,0),
exponential:
W_i = (1-\frac{i}{n})^\tau.
If tau
is a vector, the program searches for the best tau
. To determine the best tau
for the p-value PV[i,b]
, the new observation NewX[i,]
is added to the training data with class label b
and then for all training observations with Y[j] != b
the sum of the weights of the observations belonging to class b
is computed. Then the tau
which minimizes the sum of these values is chosen.
If tau = NULL
, it is set to seq(0.1,0.9,0.1)
if wtype = "l"
and to c(1,5,10,20)
if wtype = "e"
.
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If tau
is a vector or NULL
(and W = NULL
), PV
has an attribute "opt.tau"
, which is a matrix and opt.tau[i,b]
is the best tau
for observation NewX[i,]
and class b
(see section 'Details'). opt.tau[i,b]
is used to compute the p-value for observation NewX[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
https://www.imsv.unibe.ch/about_us/staff/prof_dr_duembgen_lutz/index_eng.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at doi:10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs, pvs.gaussian, pvs.knn, pvs.logreg
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs.wnn(NewX, X, Y, wtype = 'l', tau = 0.5)