Version: | 1.0-2 |
Author: | Simon Wood <simon.wood@r-project.org> |
Maintainer: | Simon Wood <simon.wood@r-project.org> |
Title: | Data for 'GAMs: An Introduction with R' |
Description: | Data sets and scripts used in the book 'Generalized Additive Models: An Introduction with R', Wood (2006,2017) CRC. |
Depends: | R (≥ 2.10) |
Suggests: | mgcv, lattice, MASS, nlme, lme4, geoR, survival |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
NeedsCompilation: | no |
Packaged: | 2019-08-23 12:14:27 UTC; sw283 |
Repository: | CRAN |
Date/Publication: | 2019-08-23 12:40:02 UTC |
Data and scripts for ‘Generalized Additive Models: An Introduction with R’
Description
This package contains the data sets used in the book
Generalized Additive Models: An Introduction with R, which covers linear and generalized linear models, GAMs as implemented in package mgcv
and
mixed model extensions of these.
There are help files containing the R code for each chapter and its exercise solutions, for the second edition of the book.
The script files for the first edition of the book can be found in the 'scripts' folder of the 'inst' folder of the source package. They have been modified slightly to work with recent versions of mgcv (e.g. >= 1.7-0).
Details
Each dataset has its own help page, which describes the dataset, and
gives the original source and associated references. All datasets have been
reformatted into standard R data frames. Some smaller datasets from the book
have not been included. Datasets from other R packages have not been
included, with the exception of a distillation of one set from the
NMMAPSdata
package.
Index:
aral Aral sea chlorophyll aral.bnd Aral sea boundary bird Bird distribution data from Portugal blowfly Nicholson's Blowfly data bone Bone marrow treatment survival data. brain Brain scan data cairo Daily temperature data for Cairo CanWeather Canadian annual temperature curves chicago Chicago air pollution and death rate data chl Chlorophyll data co2s Atmospheric CO2 at South Pole coast European coastline from -11 to 0 East and from 43 to 59 North. engine Engine wear versus size data german.polys.rda Polygons defining german local regions gamair Generalized Additive Models: An Introduction With R harrier Hen Harriers Eating Grouse hubble Hubble Space Telescope Data ipo Initial Public Offering Data larynx Cancer of the larynx in Germany mack Egg data from 1992 mackerel survey mackp Prediction grid data for 1992 mackerel egg model med 2010 mackerel egg survey data meh 2010 horse mackerel egg survey data prostate Protein mass spectra for prostate diagnosis sitka Sitka spruce growth and ozone data sole Sole Eggs in the Bristol Channel sperm.comp1 Sperm competition data I sperm.comp2 Sperm competition data II swer Swiss extreme ranfall data stomata Artifical stomatal area data wesdr Diabetic retinopathy data wine Bordeaux Wines
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2006,2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(help=gamair)
Canadian Weather data
Description
Data on temperature throughout the year at 35 Canadian locations, originally form the fda
package.
Usage
data(canWeather)
Format
The CanWeather
data frame has the following 5 columns
- time
Day of year from 1 to 365.
- T
Mean temperature for that day in centigrade.
- region
A four level factor classifiying locations as
Arctic
,Atlantic
,Continental
orPacific
.- latitude
Degrees north of the equator.
- place
A factor with 35 levels: the names of each locagtion.
Details
The data provide quite a nice application of function on scalar regression. Note that the data are for a single year, so will not generally be cyclic.
Source
Data are from the fda
package.
https://cran.r-project.org/package=fda
References
Ramsay J.O. and B.W. Silverman (2006) Functional data analysis (2nd ed). Springer
Examples
require(gamair);require(mgcv)
data(canWeather)
reg <- unique(CanWeather$region)
place <- unique(CanWeather$place)
col <- 1:4;names(col) <- reg
for (k in 1:35) {
if (k==1) plot(1:365,CanWeather$T[CanWeather$place==place[k]],
ylim=range(CanWeather$T),type="l",
col=col[CanWeather$region],xlab="day",ylab="temperature") else
lines(1:365,CanWeather$T[CanWeather$place==place[k]],
col=col[CanWeather$region[CanWeather$place==place[k]]])
}
## Function on scalar regression.
## T(t) = f_r(t) + f(t)*latitude + e(t)
## where e(t) is AR1 Gaussian and f_r is
## a smooth for region r.
## 'rho' chosen to minimize AIC or (-ve) REML score.
b <- bam(T ~ region + s(time,k=20,bs="cr",by=region) +
s(time,k=40,bs="cr",by=latitude),
data=CanWeather,AR.start=time==1,rho=.97)
## Note: 'discrete==TRUE' option even faster.
par(mfrow=c(2,3))
plot(b,scale=0,scheme=1)
acf(b$std)
Cancer of the larynx in Germany
Description
The data give counts of deaths from cancer of the Larynx by region of Germany from 1986 to 1990, along with the expected count according to the populaiton of the region and the total deaths for the whle of Germany. A list of polygons defining the boundaries of the districts is also provided.
Usage
data(larynx)
data(german.polys)
Format
The Larynx
data frame has the following columns
- region
A factor with 544 levels identifying the health reporting region.
- E
Expected number of deaths according to population of region and pan-German total.
- Y
Number of deaths from cancer of the Larynx in the region.
- x
A measure of level of smoking in the region.
german.polys
is a list with one item per health reporting region in Larynx
. The name of each item identifies the region using the same labels as Larynx$region
. Each item is a two column matrix defining a polygon approximating the outline of the region it relates to. Each row of the matrix defines a polygon vertex. NA
rows separate geographically disjoint areas which are part of the same region.
Details
Note that the polygons are set up to exactly share vertices with their neighbours, which facilitates the auto-identification of neighbourhood structures.
Source
Data are from the INLA website:
Examples
require(gamair);require(mgcv)
data(larynx);data(german.polys)
## plot raw deaths over expected deaths by region...
polys.plot(german.polys,Larynx$Y/Larynx$E)
## Fit additive model with Gauss MRF for space and smooth of
## smoking level. k somewhat low to reduce computational time
b <- gam(Y~s(region,k=60,bs="mrf",xt=list(polys=german.polys)) +
offset(log(E))+s(x,k=10),family=poisson,data=Larynx,method="REML")
summary(b)
plot(b,scheme=c(0,1),pages=1)
Aral sea remote sensed chlorophyll data
Description
SeaWifs satellite chlorophyll measurements for the 38th 8-day observation period of the year in the Aral sea, averaged over 1998-2002, along with an Aral sea boundary file.
Usage
data(aral)
data(aral.bnd)
Format
The aral
data frame has the following columns
- lon
longitude of pixel or boundary vertex.
- lat
latitude of pixel or boundary vertex.
- chl
chlorophyll measurement
- exra
The highest rainfall observed in any 12 hour period in that year, in mm.
Details
Trying to smooth the data with a conventional smoother, such as a thin plate spline, leads to linkage between the two arms of the sea, which is clearly an artefact. A soap film smoother avoids this problem.
Source
https://seawifs.gsfc.nasa.gov/
Examples
require(gamair);require(mgcv)
data(aral); data(aral.bnd)
## define some knots...
knt <- list(lon=c(58.55,59.09,59.36,59.64,59.91,60.18,58.27,58.55,59.09,
59.36,59.64,59.91,60.18,60.45,58.27,58.55,58.82,59.09,59.36,59.64,59.91,
60.18,60.45,58.27,58.55,59.36,59.64,59.91,60.18,58.55,59.36,59.64,59.91,
60.18,58.55,58.82,59.36,59.64,59.91,60.18,60.45,58.82,59.09,59.64,59.91,
60.18,59.64),
lat=c(44.27,44.27,44.27,44.27,44.27,44.27,44.55,44.55,44.55,44.55,44.55,
44.55,44.55,44.55,44.82,44.82,44.82,44.82,44.82,44.82,44.82,44.82,44.82,
45.09,45.09,45.09,45.09,45.09,45.09,45.36,45.36,45.36,45.36,45.36,45.64,
45.64,45.64,45.64,45.64,45.64,45.64,45.91,45.91,45.91,45.91,45.91,46.18))
## fit soap film...
b <- gam(chl~s(lon,lat,k=30,bs="so",xt=list(bnd=list(aral.bnd),
nmax=150)),knots=knt,data=aral)
## plot results...
plot(b)
Bird distribution data from Portugal
Description
Data from the compilation of the Portuguese Atlas of Breeding Birds.
Usage
data(bird)
Format
A data frame with 6 columns and 25100 rows. Each row refers to one 2km by 2km square (tetrad). The columns are:
- QUADRICULA
An identifier for the 10km by 10km square that this tetrad belongs to.
- TET
Which tetrad the observation is from.
- crestlark
Were crested lark (or possibly thekla lark!) found (1), not found (0) breading in this tetrad, or was the tetrad not visited (
NA
).- linnet
As
crestlark
, but for linnet.- x
location of tetrad (km east of an origin).
- y
location of tetrad (km north of an origin).
Details
At least 6 tetrads from each 10km square were visited, to establish whether each species was breeding there, or not. Each Tetrad was visited twice for one hour each visit. These data are not definitive: at time of writing the fieldwork was not quite complete.
The data were kindly supplied by Jose Pedro Granadeiro.
Source
The Atlas of the Portuguese Breeding Birds.
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R
Examples
data(bird)
species <- "crestlark"
op<-par(bg="white",mfrow=c(1,1),mar=c(5,5,1,1))
ind <- bird[[species]]==0&!is.na(bird[[species]])
plot(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white",
ylab="km west",xlab="km north",cex.lab=1.4,cex.axis=1.3,type="n")
polygon(c(4000,4700,4700,4000),c(250,250,600,600),col="grey",border="black")
points(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white")
ind <- bird[[species]]==1&!is.na(bird[[species]])
with(bird,points(y[ind]/1000,1000-x[ind]/1000,pch=19,cex=.3))
par(op)
Nicholson's Blowfly data
Description
Data on a laboratory population of Blowfies, from the classic ecological studies of Nicholson.
Usage
data(blowfly)
Format
A data frame with 2 columns and 180 rows. The columns are:
- pop
Counts (!) of the population of adult Blowflies in one of the experiments.
- day
Day of experiment.
Details
The population counts are actually obtained by counting dead blowflies and back calculating.
References
Nicholson, A.J. (1954a) Compensatory reactions of populations to stresses and their evolutionary significance. Australian Journal of Zoology 2, 1-8.
Nicholson, A.J. (1954b) An outline of the dynamics of animal populations. Australian Journal of Zoology 2, 9-65.
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R
Examples
data(blowfly)
with(blowfly,plot(day,pop,type="l"))
Bone marrow treatemtn survival data
Description
Data from Klein and Moeschberger (2003), for 23 patients with non-Hodgkin's lymphoma.
Usage
data(bone)
Format
A data frame with 3 columns and 23 rows. Each row refers to one patient. The columns are:
- t
Time of death, relapse or last follow up after treatment, in days.
- d
1 for death or relapse. 0 otherwise.
- trt
2 level factor.
allo
orauto
depending on treatment recieved.
Details
The data were collected at the Ohio State University bone marrow transplant unit. The allo
treatment is bone marrow transplant from a matched sibling donor. The auto
treatment consists of bone marrow removal and replacement after chemotherapy.
Source
Klein and Moeschberger (2003).
References
Klein and Moeschberger (2003) Survival Analysis: techniques for censored and truncated data.
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R
Examples
## example of fitting a Cox PH model as a Poisson GLM...
## First a function to convert data frame of raw data
## to data frame of artificial data...
psurv <- function(surv,time="t",censor="d",event="z") {
## create data frame to fit Cox PH as Poisson model.
## surv[[censor]] should be 1 for event or zero for censored.
if (event %in% names(surv)) warning("event name already in use in data frame")
surv <- as.data.frame(surv)[order(surv[[time]]),] ## ascending time order
et <- unique(surv[[time]][surv[[censor]]==1]) ## unique times requiring record
es <- match(et,surv[[time]]) ## starts of risk sets in surv
n <- nrow(surv); t <- rep(et,1+n-es) ## times for risk sets
st <- cbind(0,surv[unlist(apply(matrix(es),1,function(x,n) x:n,n=n)),])
st[st[[time]]==t&st[[censor]]!=0,1] <- 1 ## signal events
st[[time]] <- t ## reset time field to time for this risk set
names(st)[1] <- event
st
} ## psurv
## Now fit the model...
require(gamair)
data(bone);bone$id <- 1:nrow(bone)
pb <- psurv(bone); pb$tf <- factor(pb$t)
## Note that time factor tf should go first to ensure
## it has no contrasts applied...
b <- glm(z ~ tf + trt - 1,poisson,pb)
drop1(b,test="Chisq") ## test for effect - no evidence
## martingale and deviance residuals
chaz <- tapply(fitted(b),pb$id,sum) ## cum haz by subject
mrsd <- bone$d - chaz
drsd <- sign(mrsd)*sqrt(-2*(mrsd + bone$d*log(chaz)))
## Estimate and plot survivor functions and standard
## errors for the two groups...
te <- sort(unique(bone$t[bone$d==1])) ## event times
## predict survivor function for "allo"...
pd <- data.frame(tf=factor(te),trt=bone$trt[1])
fv <- predict(b,pd)
H <- cumsum(exp(fv)) ## cumulative hazard
plot(stepfun(te,c(1,exp(-H))),do.points=FALSE,ylim=c(0,1),xlim=c(0,550),
main="black allo, grey auto",ylab="S(t)",xlab="t (days)")
## add s.e. bands...
X <- model.matrix(~tf+trt-1,pd)
J <- apply(exp(fv)*X,2,cumsum)
se <- diag(J%*%vcov(b)%*%t(J))^.5
lines(stepfun(te,c(1,exp(-H+se))),do.points=FALSE,lty=2)
lines(stepfun(te,c(1,exp(-H-se))),do.points=FALSE,lty=2)
## same for "auto"...
pd <- data.frame(tf=factor(te),trt=bone$trt[23])
fv <- predict(b,pd); H <- cumsum(exp(fv))
lines(stepfun(te,c(1,exp(-H))),col="grey",lwd=2,do.points=FALSE)
X <- model.matrix(~tf+trt-1,pd)
J <- apply(exp(fv)*X,2,cumsum)
se <- diag(J%*%vcov(b)%*%t(J))^.5
lines(stepfun(te,c(1,exp(-H+se))),do.points=FALSE,lty=2,col="grey",lwd=2)
lines(stepfun(te,c(1,exp(-H-se))),do.points=FALSE,lty=2,col="grey",lwd=2)
Brain scan data
Description
Functional magnetic resonance imaging measurements for a human brain subject to a particular experimental stimulus. One slice of the image is provided, described as a near-axial slice through the dorsal cerebral cortex.
Usage
data(brain)
Format
A data frame with 5 columns and 1567 rows. Each row refers to one ‘voxel’ of the image. The columns are:
- X
voxel position on horizontal axis.
- Y
voxel position on vertical axis.
- medFPQ
median of three replicate 'Fundamental Power Quotient' values at the voxel: this is the main measurement of brain activivity.
- region
code indicating which of several regions of the brain the voxel belongs to. The regions are defined by the experimenters.
0
is the base region;1
is the region of interest;2
is the region activated by the experimental stimulus;NA
denotes a voxel with no allocation.- meanTheta
mean phase shift at the Voxel, over three measurments.
Details
See the source article for fuller details.
Source
S. Landau et al (2003) ‘Tests for a difference in timing of physiological response between two brain regions measured by using functional magnetic resonance imaging’. Journal of the Royal Statistical Society, Series C, Applied Statistics, 53(1):63-82
Daily temperature data for Cairo
Description
The average air temperature (F) in Cairo from Jan 1st 1995.
Usage
data(cairo)
Format
A data frame with 6 columns and 3780 rows. The columns are:
- month
month of year from 1 to 12.
- day.of.month
day of month, from 1 to 31.
- year
Year, starting 1995.
- temp
Average temperature (F).
- day.of.year
Day of year from 1 to 366.
- time
Number of days since 1st Jan 1995.
Source
http://academic.udayton.edu/kissock/http/Weather/citylistWorld.htm
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R
Examples
data(cairo)
with(cairo,plot(time,temp,type="l"))
Code for Chapter 1: Linear Models
Description
R code from Chapter 1 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 1.1.2
data(hubble)
hub.mod <- lm(y ~ x - 1, data=hubble)
summary(hub.mod)
plot(fitted(hub.mod),residuals(hub.mod),xlab="fitted values",
ylab="residuals")
hub.mod1 <- lm(y ~ x - 1,data=hubble[-c(3,15),])
summary(hub.mod1)
plot(fitted(hub.mod1),residuals(hub.mod1),
xlab="fitted values",ylab="residuals")
hubble.const <- c(coef(hub.mod),coef(hub.mod1))/3.09e19
age <- 1/hubble.const
age/(60^2*24*365)
## 1.1.3
cs.hubble <- 163000000
t.stat <- (coef(hub.mod1)-cs.hubble)/vcov(hub.mod1)[1,1]^0.5
pt(t.stat,df=21)*2 # 2 because of |T| in p-value defn.
sigb <- summary(hub.mod1)$coefficients[2]
h.ci <- coef(hub.mod1)+qt(c(0.025,0.975),df=21)*sigb
h.ci
h.ci <- h.ci*60^2*24*365.25/3.09e19 # convert to 1/years
sort(1/h.ci)
## 1.5.1
data(sperm.comp1)
pairs(sperm.comp1[,-1])
sc.mod1 <- lm(count~time.ipc+prop.partner,sperm.comp1)
model.matrix(sc.mod1)
par(mfrow=c(2,2)) # split the graphics device into 4 panels
plot(sc.mod1) # (uses plot.lm as sc.mod1 is class `lm')
sperm.comp1[9,]
sc.mod1
sc.mod2 <- lm(count~time.ipc+I(prop.partner*time.ipc),
sperm.comp1)
## 1.5.2
summary(sc.mod1)
## 1.5.3
sc.mod3 <- lm(count~prop.partner,sperm.comp1)
summary(sc.mod3)
sc.mod4 <- lm(count~1,sperm.comp1) # null model
AIC(sc.mod1,sc.mod3,sc.mod4)
## 1.5.4
data(sperm.comp2)
sc2.mod1 <- lm(count~f.age+f.height+f.weight+m.age+m.height+
m.weight+m.vol,sperm.comp2)
plot(sc2.mod1)
summary(sc2.mod1)
sc2.mod2 <- lm(count~f.age+f.height+f.weight+m.height+
m.weight+m.vol,sperm.comp2)
summary(sc2.mod2)
sc2.mod7 <- lm(count~f.weight,sperm.comp2)
summary(sc2.mod7)
sc <- sperm.comp2[-19,]
sc3.mod1 <- lm(count~f.age+f.height+f.weight+m.age+m.height+
m.weight+m.vol,sc)
summary(sc3.mod1)
sperm.comp1$m.vol <-
sperm.comp2$m.vol[sperm.comp2$pair%in%sperm.comp1$subject]
sc1.mod1 <- lm(count~m.vol,sperm.comp1)
summary(sc1.mod1)
## 1.5.5
sc.c <- summary(sc1.mod1)$coefficients
sc.c # check info extracted from summary
sc.c[2,1]+qt(c(.025,.975),6)*sc.c[2,2]
## 1.5.6
df <- data.frame(m.vol=c(10,15,20,25))
predict(sc1.mod1,df,se=TRUE)
## 1.5.7
set.seed(1); n <- 100; x <- runif(n)
z <- x + rnorm(n)*.05
y <- 2 + 3 * x + rnorm(n)
summary(lm(y~z))
summary(lm(y~x+z))
## 1.6.4
z <- c(1,1,1,2,2,1,3,3,3,3,4)
z
z <- as.factor(z)
z
x <- c("A","A","C","C","C","er","er")
x
x <- factor(x)
x
PlantGrowth$group
# PlantGrowth$group <- as.factor(PlantGrowth$group)
pgm.1 <- lm(weight ~ group,data=PlantGrowth)
plot(pgm.1)
summary(pgm.1)
pgm.0 <- lm(weight~1,data=PlantGrowth)
anova(pgm.0,pgm.1)
Solution code for Chapter 1: Linear Models
Description
R code for Chapter 1 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.8 Rubber
## a)
library(MASS)
m1 <- lm(loss~hard+tens+I(hard*tens)+I(hard^2)+I(tens^2)+
I(hard^2*tens)+I(tens^2*hard)+I(tens^3)+I(hard^3),Rubber)
plot(m1) ## residuals OK
summary(m1) ## p-values => drop I(tens^2*hard)
m2 <- update(m1,.~.-I(tens^2*hard))
summary(m2)
m3 <- update(m2,.~.-hard)
summary(m3)
m4 <- update(m3,.~.-1)
summary(m4)
m5 <- update(m4,.~.-I(hard^2))
summary(m5) ## p-values => keep all remaining
plot(m5) ## residuals OK
## b)
AIC(m1,m2,m3,m4,m5)
m6 <- step(m1)
## c)
m <- 40;attach(Rubber)
mt <- seq(min(tens),max(tens),length=m)
mh <- seq(min(hard),max(hard),length=m)
lp <- predict(m6,data.frame(hard=rep(mh,rep(m,m)),
tens=rep(mt,m)))
contour(mt,mh,matrix(lp,m,m),xlab="tens",ylab="hard")
points(tens,hard)
detach(Rubber)
## Q.9 warpbreaks
wm <- lm(breaks~wool*tension,warpbreaks)
par(mfrow=c(2,2))
plot(wm) # residuals OK
anova(wm)
## ... so there is evidence for a wool:tension interaction.
par(mfrow=c(1,1))
with(warpbreaks,interaction.plot(tension,wool,breaks))
## Q.10 cars
## a)
cm1 <- lm(dist ~ speed + I(speed^2),cars)
summary(cm1)
## Intercept has very high p-value, so drop it
cm2 <- lm(dist ~ speed + I(speed^2)-1,cars)
summary(cm2)
## both terms now significant, but try the alternative of
## dropping `speed'
cm3 <- lm(dist ~ I(speed^2),cars)
AIC(cm1,cm2,cm3)
plot(cm2)
# Clearly cm2, with speed and speed squared terms, is to be preferred,
# but note that variance seems to be increasing with mean a little:
# perhaps a GLM, better?
## b)
# In seconds, the answer is obtained as follows..
b <- coef(cm2)
5280/(b[1]*60^2)
# This is a long time, but would have a rather wide associated confidence
# interval.
## Q.11 QR
# The following is a version of the function that you should end up with.
fitlm <- function(y,X)
{ qrx <- qr(X) ## get QR decomposition
y <- qr.qty(qrx,y) ## form Q'y efficiently
R <- qr.R(qrx) ## extract R
p <- ncol(R);n <- length(y) ## get dimensions
f <- y[1:p]; r <- y[(p+1):n]## partition Q'y
beta <- backsolve(R,f) ## parameter estimates (a)
sig2 <- sum(r^2)/(n-p) ## resid variance estimate (c)
Ri <- backsolve(R,diag(ncol(R))) ## inverse of R matrix
Vb <- Ri%*%t(Ri)*sig2 ## covariance matrix
se <- diag(Vb)^.5 ## standard errors (c)
F.ratio <- f^2/sig2 ## sequential F-ratios
seq.p.val <- 1-pf(F.ratio,1,n-p) ## seq. p-values (e)
list(beta=beta,se=se,sig2=sig2,seq.p.val=seq.p.val,df=n-p)
} ## fitlm
# The following code uses the function to answer some of the question parts.
## get example X ...
X <- model.matrix(dist ~ speed + I(speed^2),cars)
cm <- fitlm(cars$dist,X) # used fitting function
cm$beta;cm$se # print estimates and s.e.s (a,c)
cm1<-lm(dist ~ speed + I(speed^2),cars) # equiv. lm call
summary(cm1) # check estimates and s.e.s (b,c)
t.ratio <- cm$beta/cm$se # form t-ratios
p.val <- pt(-abs(t.ratio),df=cm$df)*2
p.val # print evaluated p-values (d)
## print sequential ANOVA p-values, and check them (e)
cm$seq.p.val
anova(cm1)
## Q.12 InsectSprays
X <- model.matrix(~spray-1,InsectSprays)
X <- cbind(rep(1,nrow(X)),X) # redundant model matrix
C <- matrix(c(0,rep(1,6)),1,7) # constraints
qrc <- qr(t(C)) # QR decomp. of C'
## use fact that Q=[D:Z] and XQ = (Q'X')' to form XZ ...
XZ <- t(qr.qty(qrc,t(X)))[,2:7]
m1 <- lm(InsectSprays$count~XZ-1) # fit model
bz <- coef(m1) # estimates in constrained parameterization
## form b = Z b_z, using fact that Q=[D:Z], again
b <- c(0,bz)
b <- qr.qy(qrc,b)
sum(b[2:7])
## Q.13 trees
## a)
EV.func <- function(b,g,h)
{ mu <- b[1]*g^b[2]*h^b[3]
J <- cbind(g^b[2]*h^b[3],mu*log(g),mu*log(h))
list(mu=mu,J=J)
}
## b)
attach(trees)
b <- c(.002,2,1);b.old <- 100*b+100
while (sum(abs(b-b.old))>1e-7*sum(abs(b.old))) {
EV <- EV.func(b,Girth,Height)
z <- (Volume-EV$mu) + EV$J%*%b
b.old <- b
b <- coef(lm(z~EV$J-1))
}
b
## c)
sig2 <- sum((Volume - EV$mu)^2)/(nrow(trees)-3)
Vb <- solve(t(EV$J)%*%EV$J)*sig2
se <- diag(Vb)^.5;se
Code for Chapter 2: Linear Mixed Models
Description
R code from Chapter 2 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 2.1.1
data(stomata)
m1 <- lm(area ~ CO2 + tree,stomata)
m0 <- lm(area ~ CO2,stomata)
anova(m0,m1)
m2 <- lm(area ~ tree,stomata)
anova(m2,m1)
st <- aggregate(data.matrix(stomata),
by=list(tree=stomata$tree),mean)
st$CO2 <- as.factor(st$CO2);st
m3 <- lm(area~CO2,st)
anova(m3)
summary(m3)$sigma^2 - summary(m1)$sigma^2/4
## 2.1.3
library(nlme) # load nlme `library', which contains data
data(Rail) # load data
Rail
m1 <- lm(travel ~ Rail,Rail)
anova(m1)
rt <- aggregate(data.matrix(Rail),by=list(Rail$Rail),mean)
rt
m0 <- lm(travel ~ 1,rt) # fit model to aggregated data
sigb <- (summary(m0)$sigma^2-summary(m1)$sigma^2/3)^0.5
# sigb^2 is variance component for rail
sig <- summary(m1)$sigma # sig^2 is resid. var. component
sigb
sig
summary(m0)
## 2.1.4
library(nlme)
data(Machines)
names(Machines)
attach(Machines) # make data available without `Machines$'
interaction.plot(Machine,Worker,score)
m1 <- lm(score ~ Worker*Machine,Machines)
m0 <- lm(score ~ Worker + Machine,Machines)
anova(m0,m1)
summary(m1)$sigma^2
Mach <- aggregate(data.matrix(Machines),by=
list(Machines$Worker,Machines$Machine),mean)
Mach$Worker <- as.factor(Mach$Worker)
Mach$Machine <- as.factor(Mach$Machine)
m0 <- lm(score ~ Worker + Machine,Mach)
anova(m0)
summary(m0)$sigma^2 - summary(m1)$sigma^2/3
M <- aggregate(data.matrix(Mach),by=list(Mach$Worker),mean)
m00 <- lm(score ~ 1,M)
summary(m00)$sigma^2 - (summary(m0)$sigma^2)/3
## 2.4.4
llm <- function(theta,X,Z,y) {
## untransform parameters...
sigma.b <- exp(theta[1])
sigma <- exp(theta[2])
## extract dimensions...
n <- length(y); pr <- ncol(Z); pf <- ncol(X)
## obtain \hat \beta, \hat b...
X1 <- cbind(X,Z)
ipsi <- c(rep(0,pf),rep(1/sigma.b^2,pr))
b1 <- solve(crossprod(X1)/sigma^2+diag(ipsi),
t(X1)%*%y/sigma^2)
## compute log|Z'Z/sigma^2 + I/sigma.b^2|...
ldet <- sum(log(diag(chol(crossprod(Z)/sigma^2 +
diag(ipsi[-(1:pf)])))))
## compute log profile likelihood...
l <- (-sum((y-X1%*%b1)^2)/sigma^2 - sum(b1^2*ipsi) -
n*log(sigma^2) - pr*log(sigma.b^2) - 2*ldet - n*log(2*pi))/2
attr(l,"b") <- as.numeric(b1) ## return \hat beta and \hat b
-l
}
library(nlme) ## for Rail data
options(contrasts=c("contr.treatment","contr.treatment"))
Z <- model.matrix(~Rail$Rail-1) ## r.e. model matrix
X <- matrix(1,18,1) ## fixed model matrix
## fit the model...
rail.mod <- optim(c(0,0),llm,hessian=TRUE,
X=X,Z=Z,y=Rail$travel)
exp(rail.mod$par) ## variance components
solve(rail.mod$hessian) ## approx cov matrix for theta
attr(llm(rail.mod$par,X,Z,Rail$travel),"b")
## 2.5.1
library(nlme)
lme(travel~1,Rail,list(Rail=~1))
## 2.5.2
Loblolly$age <- Loblolly$age - mean(Loblolly$age)
lmc <- lmeControl(niterEM=500,msMaxIter=100)
m0 <- lme(height ~ age + I(age^2) + I(age^3),Loblolly,
random=list(Seed=~age+I(age^2)+I(age^3)),
correlation=corAR1(form=~age|Seed),control=lmc)
plot(m0)
m1 <- lme(height ~ age+I(age^2)+I(age^3)+I(age^4),Loblolly,
list(Seed=~age+I(age^2)+I(age^3)),
cor=corAR1(form=~age|Seed),control=lmc)
plot(m1)
m2 <- lme(height~age+I(age^2)+I(age^3)+I(age^4)+I(age^5),
Loblolly,list(Seed=~age+I(age^2)+I(age^3)),
cor=corAR1(form=~age|Seed),control=lmc)
plot(m2)
plot(m2,Seed~resid(.))
qqnorm(m2,~resid(.))
qqnorm(m2,~ranef(.))
m3 <- lme(height~age+I(age^2)+I(age^3)+I(age^4)+I(age^5),
Loblolly,list(Seed=~age+I(age^2)+I(age^3)),control=lmc)
anova(m3,m2)
m4 <- lme(height~age+I(age^2)+I(age^3)+I(age^4)+I(age^5),
Loblolly,list(Seed=~age+I(age^2)),
correlation=corAR1(form=~age|Seed),control=lmc)
anova(m4,m2)
m5 <- lme(height~age+I(age^2)+I(age^3)+I(age^4)+I(age^5),
Loblolly,list(Seed=pdDiag(~age+I(age^2)+I(age^3))),
correlation=corAR1(form=~age|Seed),control=lmc)
anova(m2,m5)
plot(augPred(m2))
## 2.5.3
lme(score~Machine,Machines,list(Worker=~1,Machine=~1))
## 2.5.4
library(lme4)
a1 <- lmer(score~Machine+(1|Worker)+(1|Worker:Machine),
data=Machines)
a1
a2 <- lmer(score~Machine+(1|Worker)+(Machine-1|Worker),
data=Machines)
AIC(a1,a2)
anova(a1,a2)
## 2.5.5
library(mgcv)
b1 <- gam(score~ Machine + s(Worker,bs="re") +
s(Machine,Worker,bs="re"),data=Machines,method="REML")
gam.vcomp(b1)
b2 <- gam(score~ Machine + s(Worker,bs="re") +
s(Worker,bs="re",by=Machine),data=Machines,method="REML")
gam.vcomp(b2)
Solution code for Chapter 2: Linear Mixed Models
Description
R code for Chapter 2 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.6
## c)
library(nlme)
options(contrasts=c("contr.treatment",
"contr.treatment"))
m1 <- lme(Thickness~Site+Source,Oxide,~1|Lot/Wafer)
plot(m1) # check resids vs. fitted vals
qqnorm(residuals(m1)) # check resids for normality
abline(0,sd(resid(m1)))# adding a "reference line"
qqnorm(m1,~ranef(.,level=1)) # check normality of b_k
qqnorm(m1,~ranef(.,level=2)) # check normality of c_(k)l
m2 <- lme(Thickness~Site+Source,Oxide,~1|Lot)
anova(m1,m2)
anova(m1)
intervals(m1)
## Q.7
library(nlme)
attach(Machines)
interaction.plot(Machine,Worker,score) # note 6B
## base model
m1<-lme(score~Machine,Machines,~1|Worker/Machine)
## check it...
plot(m1)
plot(m1,Machine~resid(.),abline=0)
plot(m1,Worker~resid(.),abline=0)
qqnorm(m1,~resid(.))
qqnorm(m1,~ranef(.,level=1))
qqnorm(m1,~ranef(.,level=2)) ## note outlier
## try more general r.e. structure
m2<-lme(score~Machine,Machines,~Machine|Worker)
## check it...
qqnorm(m2,~resid(.))
qqnorm(m2,~ranef(.,level=1)) ## still an outlier
## simplified model...
m0 <- lme(score~Machine,Machines,~1|Worker)
## formal comparison
anova(m0,m1,m2) ## m1 most appropriate
anova(m1) ## significant Machine effect
intervals(m1) ## Machines B and C better than A
## remove problematic worker 6, machine B
Machines <- Machines[-(34:36),]
## re-running improves plots, but conclusions same.
## It seems that (2.6) is the most appropriate model of those tried,
## and broadly the same conclusions are reached with or without
## worker 6, on Machine B, which causes outliers on several checking
## plots. See next question for comparison of machines B and C.
## Q.8
# Using the data without worker 6 machine B:
intervals(m1,level=1-0.05/3,which="fixed")
levels(Machines$Machine)
Machines$Machine <- relevel(Machines$Machine,"B")
m1a <- lme(score ~ Machine, Machines, ~ 1|Worker/Machine)
intervals(m1a,level=1-0.05/3,which="fixed")
# So, there is evidence for differences between machine A and the
# other 2, but not between B and C, at the 5% level. However,
# depending on how economically significant the point estimate of
# the B-C difference is, it might be worth conducting a study with
# more workers in order to check whether the possible small
# difference might actually be real.
## Q.9
## c)
library(nlme);data(Gun)
options(contrasts=c("contr.treatment","contr.treatment"))
with(Gun,plot(Method,rounds))
with(Gun,plot(Physique,rounds))
m1 <- lme(rounds~Method+Physique,Gun,~1|Team)
plot(m1) # fitted vs. resid plot
qqnorm(residuals(m1))
abline(0,m1$sigma) # add line of "perfect Normality"
anova(m1) # balanced data: don't need type="marginal"
m2 <- lme(rounds~Method,Gun,~1|Team)
intervals(m2)
Code for Chapter 3: Generalized Linear Models
Description
R code from Chapter 3 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 3.2.2
x <- c(.6,1.5); y <- c(.02,.9)
ms <- exp(-x*4) # set initial values at lower left
glm(y ~ I(-x)-1,family=gaussian(link=log),mustart=ms)
ms <- exp(-x*0.1) # set initial values at upper right
glm(y ~ I(-x)-1,family=gaussian(link=log),mustart=ms)
## 3.3.1
heart <- data.frame(ck = 0:11*40+20,
ha=c(2,13,30,30,21,19,18,13,19,15,7,8),
ok=c(88,26,8,5,0,1,1,1,1,0,0,0))
p <- heart$ha/(heart$ha+heart$ok)
plot(heart$ck,p,xlab="Creatinine kinase level",
ylab="Proportion Heart Attack")
mod.0 <- glm(cbind(ha,ok) ~ ck, family=binomial(link=logit),
data=heart)
mod.0 <- glm(cbind(ha,ok) ~ ck, family=binomial, data=heart)
mod.0
(271.7-36.93)/271.7
1-pchisq(36.93,10)
par(mfrow=c(2,2))
plot(mod.0)
plot(heart$ck, p, xlab="Creatinine kinase level",
ylab="Proportion Heart Attack")
lines(heart$ck, fitted(mod.0))
mod.2 <- glm(cbind(ha,ok)~ck+I(ck^2)+I(ck^3),family=binomial,
data=heart)
mod.2
par(mfrow=c(2,2))
plot(mod.2)
par(mfrow=c(1,1))
plot(heart$ck,p,xlab="Creatinine kinase level",
ylab="Proportion Heart Attack")
lines(heart$ck,fitted(mod.2))
anova(mod.0,mod.2,test="Chisq")
## 3.3.2
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240)
t <- 1:13
plot(t+1980,y,xlab="Year",ylab="New AIDS cases",ylim=c(0,280))
m0 <- glm(y~t,poisson)
m0
par(mfrow=c(2,2))
plot(m0)
m1 <- glm(y~t+I(t^2),poisson)
plot(m1)
summary(m1)
anova(m0,m1,test="Chisq")
beta.1 <- summary(m1)$coefficients[2,]
ci <- c(beta.1[1]-1.96*beta.1[2],beta.1[1]+1.96*beta.1[2])
ci ## print 95% CI for beta_1
new.t <- seq(1,13,length=100)
fv <- predict(m1,data.frame(t=new.t),se=TRUE)
par(mfrow=c(1,1))
plot(t+1980,y,xlab="Year",ylab="New AIDS cases",ylim=c(0,280))
lines(new.t+1980,exp(fv$fit))
lines(new.t+1980,exp(fv$fit+2*fv$se.fit),lty=2)
lines(new.t+1980,exp(fv$fit-2*fv$se.fit),lty=2)
## 3.3.3
psurv <- function(surv,time="t",censor="d",event="z") {
## create data frame to fit Cox PH as Poisson model.
## surv[[censor]] should be 1 for event or zero for censored.
if (event %in% names(surv)) warning("event name clashes")
surv <- as.data.frame(surv)[order(surv[[time]]),] # t order
et <- unique(surv[[time]][surv[[censor]]==1]) # unique times
es <- match(et,surv[[time]]) # starts of risk sets in surv
n <- nrow(surv); t <- rep(et,1+n-es) # times for risk sets
st <- cbind(0,
surv[unlist(apply(matrix(es),1,function(x,n) x:n,n=n)),])
st[st[[time]]==t&st[[censor]]!=0,1] <- 1 # signal events
st[[time]] <- t ## reset event time to risk set time
names(st)[1] <- event
st
} ## psurv
require(gamair); data(bone); bone$id <- 1:nrow(bone)
pb <- psurv(bone); pb$tf <- factor(pb$t)
b <- glm(z ~ tf + trt - 1,poisson,pb)
chaz <- tapply(fitted(b),pb$id,sum) ## by subject cum. hazard
mrsd <- bone$d - chaz ## Martingale residuals
drop1(b,test="Chisq") ## test for effect - no evidence
te <- sort(unique(bone$t[bone$d==1])) ## event times
## predict survivor function for "allo"...
pd <- data.frame(tf=factor(te),trt=bone$trt[1])
fv <- predict(b,pd)
H <- cumsum(exp(fv)) ## cumulative hazard
plot(stepfun(te,c(1,exp(-H))),do.points=FALSE,ylim=c(0,1),
xlim=c(0,550),main="",ylab="S(t)",xlab="t (days)")
## add s.e. bands...
X <- model.matrix(~tf+trt-1,pd)
J <- apply(exp(fv)*X,2,cumsum)
se <- diag(J%*%vcov(b)%*%t(J))^.5
lines(stepfun(te,c(1,exp(-H+se))),do.points=FALSE,lty=2)
lines(stepfun(te,c(1,exp(-H-se))),do.points=FALSE,lty=2)
## 3.3.4
al <- data.frame(y=c(435,147,375,134),gender=
as.factor(c("F","F","M","M")),faith=as.factor(c(1,0,1,0)))
al
mod.0 <- glm(y ~ gender + faith, data=al, family=poisson)
model.matrix(mod.0)
mod.0
fitted(mod.0)
mod.1 <- glm(y~gender*faith,data=al,family=poisson)
model.matrix(mod.1)
mod.1
anova(mod.0,mod.1,test="Chisq")
## 3.3.5
data(sole)
sole$off <- log(sole$a.1-sole$a.0)# model offset term
sole$a<-(sole$a.1+sole$a.0)/2 # mean stage age
solr<-sole # make copy for rescaling
solr$t<-solr$t-mean(sole$t)
solr$t<-solr$t/var(sole$t)^0.5
solr$la<-solr$la-mean(sole$la)
solr$lo<-solr$lo-mean(sole$lo)
b <- glm(eggs ~ offset(off)+lo+la+t+I(lo*la)+I(lo^2)+I(la^2)
+I(t^2)+I(lo*t)+I(la*t)+I(lo^3)+I(la^3)+I(t^3)+
I(lo*la*t)+I(lo^2*la)+I(lo*la^2)+I(lo^2*t)+
I(la^2*t)+I(la*t^2)+I(lo*t^2)+ a +I(a*t)+I(t^2*a),
family=quasi(link=log,variance="mu"),data=solr)
summary(b)
b1 <- update(b, ~ . - I(lo*t))
b4 <- update(b1, ~ . - I(lo*la*t) - I(lo*t^2) - I(lo^2*t))
anova(b,b4,test="F")
par(mfrow=c(1,2)) # split graph window into 2 panels
plot(fitted(b4)^0.5,solr$eggs^0.5) # fitted vs. data plot
plot(fitted(b4)^0.5,residuals(b4)) # resids vs. sqrt(fitted)
## 3.5.1
rf <- residuals(b4,type="d") # extract deviance residuals
## create an identifier for each sampling station
solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))
## is there evidence of a station effect in the residuals?
solr$rf <-rf
rm <- lme(rf~1,solr,random=~1|station)
rm0 <- lm(rf~1,solr)
anova(rm,rm0)
## following is slow...
## Not run:
library(MASS)
form <- eggs ~ offset(off)+lo+la+t+I(lo*la)+I(lo^2)+
I(la^2)+I(t^2)+I(lo*t)+I(la*t)+I(lo^3)+I(la^3)+
I(t^3)+I(lo*la*t)+I(lo^2*la)+I(lo*la^2)+I(lo^2*t)+
I(la^2*t)+I(la*t^2)+I(lo*t^2)+ # end log spawn
a +I(a*t)+I(t^2*a)
b <- glmmPQL(form,random=list(station=~1),
family=quasi(link=log,variance="mu"),data=solr)
summary(b)
form4 <- eggs ~ offset(off)+lo+la+t+I(lo*la)+I(lo^2)+
I(la^2)+I(t^2)+I(lo*t)+I(la*t)+I(lo^3)+I(la^3)+
I(t^3)+I(lo^2*la)+I(lo*la^2)+
I(la^2*t)+I(lo*t^2)+ # end log spawn
a +I(a*t)+I(t^2*a)
b4 <- glmmPQL(form4,random=list(station=~1),
family=quasi(link=log,variance="mu"),data=solr)
fv <- exp(fitted(b4)+solr$off) # note need to add offset
resid <- solr$egg-fv # raw residuals
plot(fv^.5,solr$eggs^.5)
abline(0,1,lwd=2)
plot(fv^.5,resid/fv^.5)
plot(fv^.5,resid)
fl<-sort(fv^.5)
## add 1 s.d. and 2 s.d. reference lines
lines(fl,fl);lines(fl,-fl);lines(fl,2*fl,lty=2)
lines(fl,-2*fl,lty=2)
intervals(b4,which="var-cov")
## 3.5.2
form5 <- eggs ~ offset(off)+lo+la+t+I(lo*la)+I(lo^2)+
I(la^2)+I(t^2)+I(lo*t)+I(la*t)+I(lo^3)+I(la^3)+
I(t^3)+I(lo^2*la)+I(lo*la^2)+
I(la^2*t)+I(lo*t^2)+ # end log spawn
a +I(a*t)+I(t^2*a) + s(station,bs="re")
b <- gam(form5,family=quasi(link=log,variance="mu"),data=solr,
method="REML")
## 3.5.3
library(lme4)
solr$egg1 <- round(solr$egg * 5)
form <- egg1 ~ offset(off)+lo+la+t+I(lo*la)+I(lo^2)+
I(la^2)+I(t^2)+I(lo*t)+I(la*t)+I(lo^3)+I(la^3)+
I(t^3)+I(lo*la*t)+I(lo^2*la)+I(lo*la^2)+I(lo^2*t)+
I(la^2*t)+I(la*t^2)+I(lo*t^2)+ # end log spawn
a +I(a*t)+I(t^2*a) + (1|station)
glmer(form,family=poisson,data=solr)
## End(Not run)
Solution code for Chapter 3: Generalized Linear Models
Description
R code for Chapter 3 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.2 Residuals
n <- 100; m <- 10
x <- runif(n)
lp <- 3*x-1
mu <- binomial()$linkinv(lp)
y <- rbinom(1:n,m,mu)
par(mfrow=c(2,2))
plot(glm(y/m ~ x,family=binomial,weights=rep(m,n)))
## example glm fit...
b <- glm(y/m ~ x,family=binomial,weights=rep(m,n))
reps <- 200;mu <- fitted(b)
rsd <- matrix(0,reps,n) # array for simulated resids
runs <- rep(0,reps) # array for simulated run counts
for (i in 1:reps) { # simulation loop
ys <- rbinom(1:n,m,mu) # simulate from fitted model
## refit model to simulated data
br <- glm(ys/m ~ x,family=binomial,weights=rep(m,n))
rs <- residuals(br) # simulated resids (meet assumptions)
rsd[i,] <- sort(rs) # store sorted residuals
fv.sort <- sort(fitted(br),index.return=TRUE)
rs <- rs[fv.sort$ix] # order resids by sorted fit values
rs <- rs > 0 # check runs of +ve, -ve resids
runs[i] <- sum(rs[1:(n-1)]!=rs[2:n])
}
# plot original ordered residuals, and simulation envelope
for (i in 1:n) rsd[,i] <- sort(rsd[,i])
par(mfrow=c(1,1))
plot(sort(residuals(b)),(1:n-.5)/n) # original
## plot 95% envelope ....
lines(rsd[5,],(1:n-.5)/n);lines(rsd[reps-5,],(1:n-.5)/n)
# compare original runs to distribution under independence
rs <- residuals(b)
fv.sort <- sort(fitted(b),index.return=TRUE)
rs <- rs[fv.sort$ix]
rs <- rs > 0
obs.runs <- sum(rs[1:(n-1)]!=rs[2:n])
sum(runs>obs.runs)
## Q.3 Death penalty
## read in data...
count <- c(53,414,11,37,0,16,4,139)
death <- factor(c(1,0,1,0,1,0,1,0))
defendant <- factor(c(0,0,1,1,0,0,1,1))
victim <- factor(c(0,0,0,0,1,1,1,1))
levels(death) <- c("no","yes")
levels(defendant) <- c("white","black")
levels(victim) <- c("white","black")
## a)
sum(count[death=="yes"&defendant=="black"])/
sum(count[defendant=="black"])
sum(count[death=="yes"&defendant=="white"])/
sum(count[defendant=="white"])
## b)
dm <- glm(count~death*victim+death*defendant+
victim*defendant,family=poisson(link=log))
summary(dm)
dm0 <- glm(count~death*victim+victim*defendant,
family=poisson(link=log))
anova(dm0,dm,test="Chisq")
## Q.7 IRLS
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240)
t <-1:13
X <- cbind(rep(1,13),t,t^2) # model matrix
mu <- y;eta <- log(mu) # initial values
ok <- TRUE
while (ok) {
## evaluate pseudodata and weights
z <- (y-mu)/mu + eta
w <- as.numeric(mu)
## fit weighted working linear model
z <- sqrt(w)*z; WX <- sqrt(w)*X
beta <- coef(lm(z~WX-1))
## evaluate new eta and mu
eta.old <- eta
eta <- X%*%beta
mu <- exp(eta)
## test for convergence...
if (max(abs(eta-eta.old))<1e-7*max(abs(eta))) ok <- FALSE
}
plot(t,y);lines(t,mu) # plot fit
## Q.8
## b)
data(harrier)
m <- 1
b <- glm(Consumption.Rate~I(1/Grouse.Density^m),
family=quasi(link=inverse,variance=mu),data=harrier)
## c)
plot(harrier$Grouse.Density,residuals(b))
## clear pattern if $m=1$, and the parameter estimates lead to a rather odd curve.
## d)
## search leads to...
m <- 3.25
b <- glm(Consumption.Rate~I(1/Grouse.Density^m),
family=quasi(link=inverse,variance=mu),data=harrier)
## e)
pd <- data.frame(Grouse.Density = seq(0,130,length=200))
pr <- predict(b,newdata=pd,se=TRUE)
with(harrier,plot(Grouse.Density,Consumption.Rate))
lines(pd$Grouse.Density,1/pr$fit,col=2)
lines(pd$Grouse.Density,1/(pr$fit-pr$se*2),col=3)
lines(pd$Grouse.Density,1/(pr$fit+pr$se*2),col=3)
## f)
ll <- function(b,cr,d)
## evalates -ve quasi-log likelihood of model
## b is parameters, cr is consumption, d is density
{ ## get expected consumption...
dm <- d^b[3]
Ec <- exp(b[1])*dm/(1+exp(b[1])*exp(b[2])*dm)
## appropriate quasi-likelihood...
ind <- cr>0 ## have to deal with cr==0 case
ql <- cr - Ec
ql[ind] <- ql[ind] + cr[ind]*log(Ec[ind]/cr[ind])
-sum(ql)
}
## Now fit model ...
fit <- optim(c(log(.4),log(10),3),ll,method="L-BFGS-B",
hessian=TRUE,cr=harrier$Consumption.Rate,
d=harrier$Grouse.Density)
## and plot results ...
b <- fit$par
d <- seq(0,130,length=200); dm <- d^b[3]
Ec <- exp(b[1])*dm/(1+exp(b[1])*exp(b[2])*dm)
with(harrier,plot(Grouse.Density,Consumption.Rate))
lines(d,Ec,col=2)
## Q.9
death <- as.numeric(ldeaths)
month <- rep(1:12,6)
time <- 1:72
ldm <- glm(death ~ sin(month/12*2*pi)+cos(month/12*2*pi),
family=poisson(link=identity))
plot(time,death,type="l");lines(time,fitted(ldm),col=2)
summary(ldm)
plot(ldm)
## Q.10
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240)
t <- 1:13
b <- glm(y ~ t + I(t^2),family=poisson)
log.lik <- b1 <- seq(.4,.7,length=100)
for (i in 1:100)
{ log.lik[i] <- logLik(glm(y~offset(b1[i]*t)+I(t^2),
family=poisson))
}
plot(b1,log.lik,type="l")
points(coef(b)[2],logLik(b),pch=19)
abline(logLik(b)[1]-qchisq(.95,df=1),0,lty=2)
## Q.11 Soybean
## a)
library(nlme)
attach(Soybean)
lmc <- lmeControl(niterEM=300) ## needed for convergence
m1<-lme(weight~Variety*Time+Variety*I(Time^2)+
Variety*I(Time^3),Soybean,~Time|Plot,control=lmc)
plot(m1) ## clear increasing variance with mean
## b)
library(MASS)
m2<-glmmPQL(weight~Variety*Time+Variety*I(Time^2)+
Variety*I(Time^3),data=Soybean,random=~Time|Plot,
family=Gamma(link=log),control=lmc)
plot(m2) ## much better
## c)
m0<-glmmPQL(weight~Variety*Time+Variety*I(Time^2)+
Variety*I(Time^3),data=Soybean,random=~1|Plot,
family=Gamma(link=log),control=lmc) # simpler r.e.'s
m3<-glmmPQL(weight~Variety*Time+Variety*I(Time^2)+
Variety*I(Time^3),data=Soybean,random=~Time+
I(Time^2)|Plot,family=Gamma(link=log),control=lmc)
## ... m3 has more complex r.e. structure
## Following not strictly valid, but gives a rough
## guide. Suggests m2 is best...
AIC(m0,m2,m3)
summary(m2) ## drop Variety:Time
m4<-glmmPQL(weight~Variety+Time+Variety*I(Time^2)+
Variety*I(Time^3),data=Soybean,random=~Time|Plot,
family=Gamma(link=log),control=lmc)
summary(m4) ## perhaps drop Variety:I(Time^3)?
m5<-glmmPQL(weight~Variety+Time+Variety*I(Time^2)+
I(Time^3),data=Soybean,random=~Time|Plot,
family=Gamma(link=log),control=lmc)
summary(m5) ## don't drop any more
AIC(m2,m4,m5) ## supports m4
intervals(m5,which="fixed")
## So m4 or m5 are probably the best models to use, and
## both suggest that variety P has a higher weight on average.
Code for Chapter 4: Introducing GAMs
Description
R code from Chapter 4 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 4.2.1
data(engine); attach(engine)
plot(size,wear,xlab="Engine capacity",ylab="Wear index")
tf <- function(x,xj,j) {
## generate jth tent function from set defined by knots xj
dj <- xj*0;dj[j] <- 1
approx(xj,dj,x)$y
}
tf.X <- function(x,xj) {
## tent function basis matrix given data x
## and knot sequence xj
nk <- length(xj); n <- length(x)
X <- matrix(NA,n,nk)
for (j in 1:nk) X[,j] <- tf(x,xj,j)
X
}
sj <- seq(min(size),max(size),length=6) ## generate knots
X <- tf.X(size,sj) ## get model matrix
b <- lm(wear~X-1) ## fit model
s <- seq(min(size),max(size),length=200)## prediction data
Xp <- tf.X(s,sj) ## prediction matrix
plot(size,wear) ## plot data
lines(s,Xp%*%coef(b)) ## overlay estimated f
## 4.2.2
prs.fit <- function(y,x,xj,sp) {
X <- tf.X(x,xj) ## model matrix
D <- diff(diag(length(xj)),differences=2) ## sqrt penalty
X <- rbind(X,sqrt(sp)*D) ## augmented model matrix
y <- c(y,rep(0,nrow(D))) ## augmented data
lm(y~X-1) ## penalized least squares fit
}
sj <- seq(min(size),max(size),length=20) ## knots
b <- prs.fit(wear,size,sj,2) ## penalized fit
plot(size,wear) ## plot data
Xp <- tf.X(s,sj) ## prediction matrix
lines(s,Xp%*%coef(b)) ## plot the smooth
## 4.2.3
rho = seq(-9,11,length=90)
n <- length(wear)
V <- rep(NA,90)
for (i in 1:90) { ## loop through smoothing params
b <- prs.fit(wear,size,sj,exp(rho[i])) ## fit model
trF <- sum(influence(b)$hat[1:n]) ## extract EDF
rss <- sum((wear-fitted(b)[1:n])^2) ## residual SS
V[i] <- n*rss/(n-trF)^2 ## GCV score
}
plot(rho,V,type="l",xlab=expression(log(lambda)),
main="GCV score")
sp <- exp(rho[V==min(V)]) ## extract optimal sp
b <- prs.fit(wear,size,sj,sp) ## re-fit
plot(size,wear,main="GCV optimal fit")
lines(s,Xp%*%coef(b))
## 4.2.3 mixed model connection
## copy of llm from 2.2.4...
llm <- function(theta,X,Z,y) {
## untransform parameters...
sigma.b <- exp(theta[1])
sigma <- exp(theta[2])
## extract dimensions...
n <- length(y); pr <- ncol(Z); pf <- ncol(X)
## obtain \hat \beta, \hat b...
X1 <- cbind(X,Z)
ipsi <- c(rep(0,pf),rep(1/sigma.b^2,pr))
b1 <- solve(crossprod(X1)/sigma^2+diag(ipsi),
t(X1)%*%y/sigma^2)
## compute log|Z'Z/sigma^2 + I/sigma.b^2|...
ldet <- sum(log(diag(chol(crossprod(Z)/sigma^2 +
diag(ipsi[-(1:pf)])))))
## compute log profile likelihood...
l <- (-sum((y-X1%*%b1)^2)/sigma^2 - sum(b1^2*ipsi) -
n*log(sigma^2) - pr*log(sigma.b^2) - 2*ldet - n*log(2*pi))/2
attr(l,"b") <- as.numeric(b1) ## return \hat beta and \hat b
-l
}
X0 <- tf.X(size,sj) ## X in original parameterization
D <- rbind(0,0,diff(diag(20),difference=2))
diag(D) <- 1 ## augmented D
X <- t(backsolve(t(D),t(X0))) ## re-parameterized X
Z <- X[,-c(1,2)]; X <- X[,1:2] ## mixed model matrices
## estimate smoothing and variance parameters...
m <- optim(c(0,0),llm,method="BFGS",X=X,Z=Z,y=wear)
b <- attr(llm(m$par,X,Z,wear),"b") ## extract coefficients
## plot results...
plot(size,wear)
Xp1 <- t(backsolve(t(D),t(Xp))) ## re-parameterized pred. mat.
lines(s,Xp1%*%as.numeric(b),col="grey",lwd=2)
library(nlme)
g <- factor(rep(1,nrow(X))) ## dummy factor
m <- lme(wear~X-1,random=list(g=pdIdent(~Z-1)))
lines(s,Xp1%*%as.numeric(coef(m))) ## and to plot
## 4.3.1 Additive
tf.XD <- function(x,xk,cmx=NULL,m=2) {
## get X and D subject to constraint
nk <- length(xk)
X <- tf.X(x,xk)[,-nk] ## basis matrix
D <- diff(diag(nk),differences=m)[,-nk] ## root penalty
if (is.null(cmx)) cmx <- colMeans(X)
X <- sweep(X,2,cmx) ## subtract cmx from columns
list(X=X,D=D,cmx=cmx)
} ## tf.XD
am.fit <- function(y,x,v,sp,k=10) {
## setup bases and penalties...
xk <- seq(min(x),max(x),length=k)
xdx <- tf.XD(x,xk)
vk <- seq(min(v),max(v),length=k)
xdv <- tf.XD(v,vk)
## create augmented model matrix and response...
nD <- nrow(xdx$D)*2
sp <- sqrt(sp)
X <- cbind(c(rep(1,nrow(xdx$X)),rep(0,nD)),
rbind(xdx$X,sp[1]*xdx$D,xdv$D*0),
rbind(xdv$X,xdx$D*0,sp[2]*xdv$D))
y1 <- c(y,rep(0,nD))
## fit model..
b <- lm(y1~X-1)
## compute some useful quantities...
n <- length(y)
trA <- sum(influence(b)$hat[1:n]) ## EDF
rsd <- y-fitted(b)[1:n] ## residuals
rss <- sum(rsd^2) ## residual SS
sig.hat <- rss/(n-trA) ## residual variance
gcv <- sig.hat*n/(n-trA) ## GCV score
Vb <- vcov(b)*sig.hat/summary(b)$sigma^2 ## coeff cov matrix
## return fitted model...
list(b=coef(b),Vb=Vb,edf=trA,gcv=gcv,fitted=fitted(b)[1:n],
rsd=rsd,xk=list(xk,vk),cmx=list(xdx$cmx,xdv$cmx))
} ## am.fit
am.gcv <- function(lsp,y,x,v,k) {
## function suitable for GCV optimization by optim
am.fit(y,x,v,exp(lsp),k)$gcv
}
## find GCV optimal smoothing parameters...
fit <- optim(c(0,0), am.gcv, y=trees$Volume, x=trees$Girth,
v=trees$Height,k=10)
sp <- exp(fit$par) ## best fit smoothing parameters
## Get fit at GCV optimal smoothing parameters...
fit <- am.fit(trees$Volume,trees$Girth,trees$Height,sp,k=10)
am.plot <- function(fit,xlab,ylab) {
## produces effect plots for simple 2 term
## additive model
start <- 2 ## where smooth coeffs start in beta
for (i in 1:2) {
## sequence of values at which to predict...
x <- seq(min(fit$xk[[i]]),max(fit$xk[[i]]),length=200)
## get prediction matrix for this smooth...
Xp <- tf.XD(x,fit$xk[[i]],fit$cmx[[i]])$X
## extract coefficients and cov matrix for this smooth
stop <- start + ncol(Xp)-1; ind <- start:stop
b <- fit$b[ind];Vb <- fit$Vb[ind,ind]
## values for smooth at x...
fv <- Xp%*%b
## standard errors of smooth at x....
se <- rowSums((Xp%*%Vb)*Xp)^.5
## 2 s.e. limits for smooth...
ul <- fv + 2*se;ll <- fv - 2 * se
## plot smooth and limits...
plot(x,fv,type="l",ylim=range(c(ul,ll)),xlab=xlab[i],
ylab=ylab[i])
lines(x,ul,lty=2);lines(x,ll,lty=2)
start <- stop + 1
}
} ## am.plot
par(mfrow=c(1,3))
plot(fit$fitted,trees$Vol,xlab="fitted volume ",
ylab="observed volume")
am.plot(fit,xlab=c("Girth","Height"),
ylab=c("s(Girth)","s(Height)"))
## 4.4 Generalized additive
gam.fit <- function(y,x,v,sp,k=10) {
## gamma error log link 2 term gam fit...
eta <- log(y) ## get initial eta
not.converged <- TRUE
old.gcv <- -100 ## don't converge immediately
while (not.converged) {
mu <- exp(eta) ## current mu estimate
z <- (y - mu)/mu + eta ## pseudodata
fit <- am.fit(z,x,v,sp,k) ## penalized least squares
if (abs(fit$gcv-old.gcv)<1e-5*fit$gcv) {
not.converged <- FALSE
}
old.gcv <- fit$gcv
eta <- fit$fitted ## updated eta
}
fit$fitted <- exp(fit$fitted) ## mu
fit
} ## gam.fit
gam.gcv <- function(lsp,y,x,v,k=10) {
gam.fit(y,x,v,exp(lsp),k=k)$gcv
}
fit <- optim(c(0,0),gam.gcv,y=trees$Volume,x=trees$Girth,
v=trees$Height,k=10)
sp <- exp(fit$par)
fit <- gam.fit(trees$Volume,trees$Girth,trees$Height,sp)
par(mfrow=c(1,3))
plot(fit$fitted,trees$Vol,xlab="fitted volume ",
ylab="observed volume")
am.plot(fit,xlab=c("Girth","Height"),
ylab=c("s(Girth)","s(Height)"))
## 4.6 mgcv
library(mgcv) ## load the package
library(gamair) ## load the data package
data(trees)
ct1 <- gam(Volume~s(Height)+s(Girth),
family=Gamma(link=log),data=trees)
ct1
plot(ct1,residuals=TRUE)
## 4.6.1
ct2 <- gam(Volume~s(Height,bs="cr")+s(Girth,bs="cr"),
family=Gamma(link=log),data=trees)
ct2
ct3 <- gam(Volume ~ s(Height) + s(Girth,bs="cr",k=20),
family=Gamma(link=log),data=trees)
ct3
ct4 <- gam(Volume ~ s(Height) + s(Girth),
family=Gamma(link=log),data=trees,gamma=1.4)
ct4
plot(ct4,residuals=TRUE)
## 4.6.2
ct5 <- gam(Volume ~ s(Height,Girth,k=25),
family=Gamma(link=log),data=trees)
ct5
plot(ct5,too.far=0.15)
ct6 <- gam(Volume ~ te(Height,Girth,k=5),
family=Gamma(link=log),data=trees)
ct6
plot(ct6,too.far=0.15)
## 4.6.3
gam(Volume~Height+s(Girth),family=Gamma(link=log),data=trees)
trees$Hclass <- factor(floor(trees$Height/10)-5,
labels=c("small","medium","large"))
ct7 <- gam(Volume ~ Hclass+s(Girth),
family=Gamma(link=log),data=trees)
par(mfrow=c(1,2))
plot(ct7,all.terms=TRUE)
anova(ct7)
AIC(ct7)
summary(ct7)
Solution code for Chapter 4: Introducing GAMs
Description
R code for Chapter 4 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.1
set.seed(1)
x <- sort(runif(40)*10)^.5
y <- sort(runif(40))^0.1
## polynomial fits ...
xx <- seq(min(x),max(x),length=200)
plot(x,y)
b<-lm(y~poly(x,5))
lines(xx,predict(b,data.frame(x=xx)))
b<-lm(y~poly(x,10))
lines(xx,predict(b,data.frame(x=xx)),col=2)
## spline fits ...
sb <- function(x,xk) { abs(x-xk)^3}
q<-11
xk<-((1:(q-2)/(q-1))*10)^.5
## lazy person's formula construction ...
form<-paste("sb(x,xk[",1:(q-2),"])",sep="",collapse="+")
form <- paste("y~x+",form)
b<-lm(formula(form))
lines(xx,predict(b,data.frame(x=xx)),col=3)
## Q.2
## x,y, and xx from previous question
b1 <- lm(form)
plot(x,y)
lines(xx,predict(b1,data.frame(x=xx)),col=4)
X <- model.matrix(b1) # extract model matrix
beta <- solve(t(X)%*%X,t(X)%*%y,tol=0)
b1$coefficients <- beta # trick for simple prediction
lines(xx,predict(b1,data.frame(x=xx)),col=5)
## ... upping the basis dimension to 11 makes the
## normal equations estimates perform very badly.
## Q.8 Additive model as a mixed model
## from 4.2.1 and 4.3.1...
tf <- function(x,xj,j) {
## generate jth tent function from set defined by knots xj
dj <- xj*0;dj[j] <- 1
approx(xj,dj,x)$y
}
tf.X <- function(x,xj) {
## tent function basis matrix given data x
## and knot sequence xj
nk <- length(xj); n <- length(x)
X <- matrix(NA,n,nk)
for (j in 1:nk) X[,j] <- tf(x,xj,j)
X
}
tf.XD <- function(x,xk,cmx=NULL,m=2) {
## get X and D subject to constraint
nk <- length(xk)
X <- tf.X(x,xk)[,-nk] ## basis matrix
D <- diff(diag(nk),differences=m)[,-nk] ## root penalty
if (is.null(cmx)) cmx <- colMeans(X)
X <- sweep(X,2,cmx) ## subtract cmx from columns
list(X=X,D=D,cmx=cmx)
} ## tf.XD
## Solution code...
## a)
XZmixed <- function(x,xk=NULL,k=10,sep=TRUE) {
## Get re-parameterized model matrix/matrices...
if (is.null(xk)) xk <- seq(min(x),max(x),length=k)
xd <- tf.XD(x,xk)
D <- rbind(0,xd$D); D[1,1] <- 1
X <- t(solve(t(D),t(xd$X)))
if (sep) list(X=X[,1,drop=FALSE],Z=X[,-1],xk=xk)
else list(X=X,xk=xk)
} ## XZmixed
## b)
## get components of smooths for Height and Girth...
xh <- XZmixed(trees$Height)
xg <- XZmixed(trees$Girth)
## Fit as mixed model...
X <- cbind(1,xh$X,xg$X)
Zg <- xg$Z; Zh <- xh$Z
g1 <- g <- factor(rep(1,nrow(X)))
vol <- trees$Volume
b <- lme(vol~X-1,random=list(g=pdIdent(~Zh-1),
g1=pdIdent(~Zg-1)))
## c)
## raw vs. fitted and residual plot
par(mfrow=c(1,2))
plot(fitted(b),vol)
rsd <- vol - fitted(b)
plot(fitted(b),rsd)
## extract coefs for each smooth...
bh <- as.numeric(coef(b)[c(2,4:11)]) ## coefs for s(Height)
bg <- as.numeric(coef(b)[c(3,12:19)]) ## coefs for s(Height)
## get smooth specific prediction matrices...
Xh <- XZmixed(trees$Height,xk=xh$xk,sep=FALSE)$X
Xg <- XZmixed(trees$Girth,xk=xg$xk,sep=FALSE)$X
## d)
## plot smooths over partial residuals...
sh <- Xh%*%bh
sg <- Xg%*%bg
par(mfrow=c(1,2))
plot(trees$Girth,sg+rsd,pch=19,col="grey",
xlab="Girth",ylab="s(Girth)")
lines(trees$Girth,sg)
plot(trees$Height,sh+rsd,pch=19,col="grey",
xlab="Height",ylab="s(Height)")
lines(trees$Height,sh)
## Q.9 Generalized version of 8 by PQL
## a)
gamm.fit <- function(y,X,Zh,Zg) {
## gamma error log link 2 term gam fit via PQL...
eta <- log(y) ## get initial eta
g <- g1 <- factor(rep(1,nrow(X)))
not.converged <- TRUE
old.reml <- 1e100 ## don't converge immediately
while (not.converged) {
mu <- exp(eta) ## current mu estimate
z <- (y - mu)/mu + eta ## pseudodata
fit <- lme(z~X-1,random=list(g=pdIdent(~Zh-1),g1=pdIdent(~Zg-1)))
if (abs(logLik(fit)-old.reml)<1e-5*abs(logLik(fit))) {
not.converged <- FALSE
}
old.reml <- logLik(fit)
eta <- fitted(fit) ## updated eta
}
fit
} ## gamm.fit
## b) re-using arguments from Q.8...
m <- gamm.fit(vol,X,Zh,Zg)
## c)
rsd <- residuals(m)
par(mfrow=c(1,2))
plot(exp(fitted(m)),vol);abline(0,1)
plot(fitted(m),rsd)
## d)
bh <- as.numeric(coef(m)[c(2,4:11)]) ## coefs for s(Height)
bg <- as.numeric(coef(m)[c(3,12:19)]) ## coefs for s(Height)
sh <- Xh%*%bh
sg <- Xg%*%bg
par(mfrow=c(1,2))
plot(trees$Girth,sg+rsd,pch=19,col="grey",
xlab="Girth",ylab="s(Girth)")
lines(trees$Girth,sg)
plot(trees$Height,sh+rsd,pch=19,col="grey",
xlab="Height",ylab="s(Height)")
lines(trees$Height,sh)
Code for Chapter 5: Smoothers
Description
R code from Chapter 5 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 5.3.3 P-splines
bspline <- function(x,k,i,m=2)
# evaluate ith b-spline basis function of order m at the values
# in x, given knot locations in k
{ if (m==-1) # base of recursion
{ res <- as.numeric(x<k[i+1]&x>=k[i])
} else # construct from call to lower order basis
{ z0 <- (x-k[i])/(k[i+m+1]-k[i])
z1 <- (k[i+m+2]-x)/(k[i+m+2]-k[i+1])
res <- z0*bspline(x,k,i,m-1)+ z1*bspline(x,k,i+1,m-1)
}
res
} ## bspline
k<-6 # example basis dimension
P <- diff(diag(k),differences=1) # sqrt of penalty matrix
S <- t(P)%*%P
## 5.3.6 SCOP-splines
x <- 0:200/200
set.seed(32)
y <- binomial()$linkinv((x-.5)*10) + rnorm(length(x))*.1
plot(x,y)
k <- 7
ssp <- s(x,bs="ps",k=k); ssp$mono <- 1
sm <- smoothCon(ssp,data.frame(x))[[1]]
X <- sm$X; XX <- crossprod(X); sp <- .005
gamma <- rep(0,k); S <- sm$S[[1]]
for (i in 1:20) {
gt <- c(gamma[1],exp(gamma[2:k]))
dg <- c(1,gt[2:k])
g <- -dg*(t(X)%*%(y-X%*%gt)) + sp*S%*%gamma
H <- dg*t(dg*XX)
gamma <- gamma - solve(H+sp*S,g)
}
lines(x,X%*%gt)
Solution code for Chapter 5: Smoothers
Description
R code for Chapter 5 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.4 P-spline
## a)
library(splines)
pspline.XB <- function(x,q=10,m=2,p.m=2)
# Get model matrix and sqrt Penalty matrix for P-spline
{ # first make knot sequence, k
k <- seq(min(x),max(x),length=q-m)
dk <- k[2]-k[1]
k <- c(k[1]-dk*((m+1):1),k,k[q-m]+dk*(1:(m+1)))
# now get model matrix and root penalty
X <- splineDesign(k,x,ord=m+2)
B <- diff(diag(q),difference=p.m)
list(X=X,B=B)
} ## pspline.XB
## b)
n<-100
x <- sort(runif(n))
ps <- pspline.XB(x,q=9,m=2,p.m=2)
par(mfrow=c(3,3)) # plot the original basis functions
for (i in 1:9) plot(x,ps$X[,i],type="l")
## c)
S <- t(ps$B)%*%ps$B
es <- eigen(S);U <- es$vectors
XU <- ps$X%*%U # last p.m cols are penalty null space
par(mfrow=c(3,3)) # plot penalty eigenbasis functions
for (i in 1:9) plot(x,XU[,i],type="l")
## d)
qrx <- qr(ps$X) # QR of X
R <- qr.R(qrx)
RSR <- solve(t(R),S);RSR <- t(solve(t(R),t(RSR)))
ersr <- eigen(RSR)
U <- ersr$vectors
Q <- qr.Q(qrx)
QU <- Q%*%U
par(mfrow=c(3,3)) # plot the natural basis functions
for (i in 1:9) plot(x,QU[,i],type="l")
## Q.5
test1<-function(x,z,sx=0.3,sz=0.4)
{ 1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2)
}
n <- 200
x <- matrix(runif(2*n),n,2)
f <- test1(x[,1],x[,2])
y <- f + rnorm(n)*.1
eta <- function(r)
{ # thin plate spline basis functions
ind <- r<=0
eta <- r
eta[!ind] <- r[!ind]^2*log(r[!ind])/(8*pi)
eta[ind] <- 0
eta
} ## eta
XSC <- function(x,xk=x)
{ # set up t.p.s., given covariates, x, and knots, xk
n <- nrow(x);k <- nrow(xk)
X <- matrix(1,n,k+3) # tps model matrix
for (j in 1:k) {
r <- sqrt((x[,1]-xk[j,1])^2+(x[,2]-xk[j,2])^2)
X[,j] <- eta(r)
}
X[,j+2] <- x[,1];X[,j+3] <- x[,2]
C <- matrix(0,3,k+3) # tps constraint matrix
S <- matrix(0,k+3,k+3)# tps penalty matrix
for (i in 1:k) {
C[1,i]<-1;C[2,i] <- xk[i,1];C[3,i] <- xk[i,2]
for (j in i:k) S[j,i]<-S[i,j] <-
eta(sqrt(sum((xk[i,]-xk[j,])^2)))
}
list(X=X,S=S,C=C)
} ## XSC
absorb.con <- function(X,S,C)
{ # get constraint null space, Z...
qrc <- qr(t(C)) # QR=C', Q=[Y,Z]
m <- nrow(C);k <- ncol(X)
X <- t(qr.qty(qrc,t(X)))[,(m+1):k] # form XZ
# now form Z'SZ ...
S <- qr.qty(qrc,t(qr.qty(qrc,t(S))))[(m+1):k,(m+1):k]
list(X=X,S=S,qrc=qrc)
} ## absorb.con
fit.tps <- function(y,x,xk=x,lambda=0)
{ tp <- XSC(x,xk) # get tps matrices
tp <- absorb.con(tp$X,tp$S,tp$C) # make unconstrained
ev <- eigen(tp$S,symmetric=TRUE) # get sqrt penalty, rS
rS <- ev$vectors%*%(ev$values^.5*t(ev$vectors))
X <- rbind(tp$X,rS*sqrt(lambda)) # augmented model matrix
z <- c(y,rep(0,ncol(rS))) # augmented data
beta <- coef(lm(z~X-1)) # fit model
beta <- qr.qy(tp$qrc,c(0,0,0,beta)) # backtransform beta
} ## fit.tps
eval.tps <- function(x,beta,xk)
{ # evaluate tps at x, given parameters, beta, and knots, xk.
k <- nrow(xk);n <- nrow(x)
f <- rep(beta[k+1],n)
for (i in 1:k) {
r <- sqrt((x[,1]-xk[i,1])^2+(x[,2]-xk[i,2])^2)
f <- f + beta[i]*eta(r)
}
f <- f + beta[k+2]*x[,1] + beta[k+3]*x[,2]
} ## eval.tps
## select some `knots', xk ...
ind <- sample(1:n,100,replace=FALSE)
xk <- x[ind,]
## fit model ...
beta <- fit.tps(y,x,xk=xk,lambda=.01)
## contour truth and fit
par(mfrow=c(1,2))
xp <- matrix(0,900,2)
x1<-seq(0,1,length=30);x2<-seq(0,1,length=30)
xp[,1]<-rep(x1,30);xp[,2]<-rep(x2,rep(30,30))
truth<-matrix(test1(xp[,1],xp[,2]),30,30)
contour(x1,x2,truth)
fit <- matrix(eval.tps(xp,beta,xk),30,30)
contour(x1,x2,fit)
## Q.6 smooth.construct
tf <- function(x,xj,j) {
## generate jth tent function from set defined by knots xj
dj <- xj*0;dj[j] <- 1
approx(xj,dj,x)$y
}
tf.X <- function(x,xj) {
## tent function basis matrix given data x
## and knot sequence xj
nk <- length(xj); n <- length(x)
X <- matrix(NA,n,nk)
for (j in 1:nk) X[,j] <- tf(x,xj,j)
X
}
smooth.construct.pl.smooth.spec<-function(object,data,knots) {
## a piecewise linear smooth constructor method function
m <- object$p.order[1]
if (is.na(m)) m <- 2 ## default
if (m<1) stop("silly m supplied")
if (object$bs.dim<0) object$bs.dim <- 20 ## default
x <- data[[object$term]] ## the data
k <- knots[[object$term]] ## will be NULL if none supplied
if (is.null(k)) { # space knots through data
k <- seq(min(x),max(x),length=object$bs.dim)
} else {
if (length(k)!=object$bs.dim) # right number of knots?
k <- seq(min(k),max(k),length=object$bs.dim)
}
object$X <- tf.X(x,k)
if (!object$fixed) { # create the penalty matrix
object$S[[1]] <- crossprod(diff(diag(object$bs.dim),difference=m))
}
object$rank <- object$bs.dim - m # penalty rank
object$null.space.dim <- m # dim. of unpenalized space
## store "tr" specific stuff ...
object$knots <- k
object$df <- ncol(object$X) # maximum DoF (if unconstrained)
class(object) <- "pl.smooth" # Give object a class
object
}
Predict.matrix.pl.smooth<-function(object,data)
## prediction method function for the `pl' smooth class
{ x <- data[[object$term]]
X <- tf.X(x,object$knots)
X # return the prediction matrix
}
# an example, using the new class....
require(mgcv)
set.seed(10)
dat <- gamSim(1,n=400,scale=2)
b <- gam(y~s(x0,bs="pl",m=2)+s(x1,bs="pl",m=2) +
s(x2,bs="pl",m=3)+s(x3,bs="pl",m=2),
data=dat,method="REML")
plot(b,pages=1)
Code for Chapter 6: GAM Theory
Description
R code from Chapter 6 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## 6.13.2 backfitting
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
edf <- c(3,3,8,3)
y <- dat$y
x <- cbind(dat$x0,dat$x1,dat$x2,dat$x3)
f <- x*0; alpha <- mean(y); ok <- TRUE; rss0 <- 0
while (ok) { # backfitting loop
for (i in 1:ncol(x)) { # loop through the smooth terms
ep <- y - rowSums(f[,-i]) - alpha
b <- smooth.spline(x[,i],ep,df=edf[i])
f[,i] <- predict(b,x[,i])$y
}
rss <- sum((y-rowSums(f))^2)
if (abs(rss-rss0)<1e-6*rss) ok <- FALSE
rss0 <- rss
}
par(mfrow=c(2,2))
for (i in 1:ncol(x)) {
plot(x[,i],y-mean(y),col="grey",pch=19,cex=.3)
ii <- order(x[,i])
lines(x[ii,i],f[ii,i],col=2,lwd=2)
}
Solution code for Chapter 6: GAM Theory
Description
R code for Chapter 6 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## code from Chapter 5 solutions...
## Q.3
pspline.XB <- function(x,q=10,m=2,p.m=2)
# Get model matrix and sqrt Penalty matrix for P-spline
{ # first make knot sequence, k
k <- seq(min(x),max(x),length=q-m)
dk <- k[2]-k[1]
k <- c(k[1]-dk*((m+1):1),k,k[q-m]+dk*(1:(m+1)))
# now get model matrix and root penalty
X <- splineDesign(k,x,ord=m+2)
B <- diff(diag(q),difference=p.m)
list(X=X,B=B)
} ## pspline.XB
## a) and b)
fit.wPs <- function(y,X,B,lambda=0,w=rep(1,length(y)))
# fit to y by weighted penalized least squares, X is
# model matrix, B is sqrt penalty, lambda is smoothing p.
{ w <- as.numeric(w^.5)
n <- nrow(X)
X<-rbind(w*X,sqrt(lambda)*B)
y<-c(w*y,rep(0,nrow(B)))
b <- lm(y~X-1) # actually estimate model
trA <- sum(influence(b)$hat[1:n])
rss <- sum((y-fitted(b))[1:n]^2) ## not really needed here
list(trA=trA,rss=rss,b=coef(b))
}
fitPoiPs <- function(y,X,B,lambda=0)
# Fit Poisson model with log-link by P-IRLS
{ mu <- y;mu[mu==0] <- .1
eta <- log(mu)
converged <- FALSE
dev <- ll.sat <- sum(dpois(y,y,log=TRUE))
while (!converged) {
z <- (y-mu)/mu + eta
w <- mu
fPs <- fit.wPs(z,X,B,lambda,w)
eta <- X%*%fPs$b
mu=exp(eta)
old.dev <- dev
dev <- 2*(ll.sat-sum(dpois(y,mu,log=TRUE)))
if (abs(dev-old.dev)<1e-6*dev) converged <- TRUE
}
list(dev=dev,rss=fPs$rss,trA=fPs$trA,b=fPs$b,fv=mu)
}
## c)
## simulate data as in question...
set.seed(1)
f <- function(x) .04*x^11*(10*(1-x))^6+2*(10*x)^3*(1-x)^10
n <- 100;x <- sort(runif(n))
y <- rpois(rep(1,n),exp(f(x)))
## fitting...
library(splines)
ps <- pspline.XB(x,q=10,m=2,p.m=2)
lambda <- 1e-4;reps <- 60
sp <- trA <- gcv <- rep(0,reps)
for (i in 1:reps) { # loop through trial s.p.s
fps <- fitPoiPs(y,ps$X,ps$B,lambda=lambda)
trA[i] <- fps$trA;sp[i] <- lambda
gcv[i] <- n*fps$dev/(n-trA[i])^2
lambda <- lambda*1.3
}
plot(trA,gcv,type="l")
fps1 <- fitPoiPs(y,ps$X,ps$B,lambda=sp[gcv==min(gcv)])
plot(x,y);lines(x,fps1$fv)
## Q.6 Fellner-Schall for GCV and AIC...
## b)
library(mgcv);library(MASS)
sm <- smoothCon(s(times,k=20),data=mcycle)[[1]]
X <- sm$X; S <- sm$S[[1]]; y <- mcycle$accel
lambda <- 1; n <- length(y)
XX <- crossprod(X);
with(mcycle,plot(times,accel))
for (i in 1:20) {
R <- chol(XX+lambda*S)
b <- backsolve(R,forwardsolve(t(R),t(X) %*% y))
f <- X %*% b
lines(mcycle$times,f,col="grey")
HiS <- backsolve(R,forwardsolve(t(R),S))
HiH <- backsolve(R,forwardsolve(t(R),XX))
tau <- sum(diag(HiH))
if (i>1) { ## convergence test
if (abs(tau-tau0)<1e-5*tau) break
}
tau0 <- tau
dt.dl <- -sum(t(HiH)*HiS)
db.dl <- -HiS %*% b
dD.db <- 2*t(X) %*% (f - y)
lambda <- -sum(2*(y-f)^2)/(n-tau)*dt.dl/sum(db.dl*dD.db) * lambda
}
lines(mcycle$times,f)
## c)
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240)
t <- 1:13
sm <- smoothCon(s(t),data=data.frame(t=t,y=y))[[1]]
X <- sm$X; S <- sm$S[[1]]; lambda <- .001; n <- length(y)
plot(t,y)
mu <- y; eta <- log(mu)
for (i in 1:50) {
w <- mu; z <- (y-mu)/mu + eta
XWX <- crossprod(sqrt(w)*X)
R <- chol(XWX+lambda*S)
b <- backsolve(R,forwardsolve(t(R),t(X) %*% (w*z)))
eta <- drop(X %*% b);mu <- exp(eta)
lines(t,mu,col="grey")
HiS <- backsolve(R,forwardsolve(t(R),S))
HiH <- backsolve(R,forwardsolve(t(R),XWX))
tau <- sum(diag(HiH))
if (i>1) { ## convergence test
if (abs(tau-tau0)<1e-5*tau) break
}
tau0 <- tau
dt.dl <- -sum(t(HiH)*HiS)
db.dl <- -HiS %*% b
dl.db <- t(X) %*% (y-mu) ## especially simple for this case
lambda <- dt.dl/sum(db.dl*dl.db) * lambda
}
i;tau;lines(t,mu)
## Q.8 log det stabilty (or lack of)
set.seed(1);lam <- 1
A1 <- crossprod(diff(diag(3),diff=1))
A2 <- crossprod(matrix(runif(9),3,3))
A <- matrix(0,5,5);A[1:3,1:3] <- A1
A[3:5,3:5] <- A[3:5,3:5] + lam * A2
ldetA.qr <- ldetA.ev <- ldetA.svd <- ldetA <-
rho <- seq(-40,-25,length=100)
for (i in 1:length(rho)) {
lam <- exp(rho[i])
A <- matrix(0,5,5);A[1:3,1:3] <- A1
A[3:5,3:5] <- A[3:5,3:5] + lam * A2
ea1 <- eigen(A1)
Q <- diag(5);Q[1:3,1:3] <- ea1$vectors
At <- matrix(0,5,5)
At[3:5,3:5] <- At[3:5,3:5] + lam * A2
At <- t(Q)%*%At%*%Q
diag(At)[1:2] <- diag(At)[1:2]+ea1$values[1:2]
ldetA[i] <- sum(log(abs(diag(qr.R(qr(At))))))
ldetA.qr[i] <- sum(log(abs(diag(qr.R(qr(A))))))
ldetA.ev[i] <- sum(log(abs(eigen(A)$values)))
ldetA.svd[i] <- sum(log(abs(svd(A)$d)))
}
plot(rho,ldetA,type="l") ## nice and stable
## not...
lines(rho,ldetA.qr,lty=2)
lines(rho,ldetA.ev,lty=3)
lines(rho,ldetA.svd,lty=4)
Code for Chapter 7: GAMs in Practice: mgcv
Description
R code from Chapter 7 of the second edition of ‘Generalized Additive Models: An Introduction with R’ is in the examples section below.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## NOTE: Examples are marked 'Not run' to save CRAN check time
## 7.1.1 using smooth constructors
library(mgcv); library(MASS) ## load for mcycle data.
## set up a smoother...
sm <- smoothCon(s(times,k=10),data=mcycle,knots=NULL)[[1]]
## use it to fit a regression spline model...
beta <- coef(lm(mcycle$accel~sm$X-1))
with(mcycle,plot(times,accel)) ## plot data
times <- seq(0,60,length=200) ## create prediction times
## Get matrix mapping beta to spline prediction at 'times'
Xp <- PredictMat(sm,data.frame(times=times))
lines(times,Xp%*%beta) ## add smooth to plot
## Not run:
## 7.2 Brain scan
## 7.2.1 preliminary modelling
require(gamair); require(mgcv); data(brain)
brain <- brain[brain$medFPQ>5e-3,] # exclude 2 outliers
m0 <- gam(medFPQ~s(Y,X,k=100),data=brain)
gam.check(m0)
e <- residuals(m0); fv <- fitted(m0)
lm(log(e^2)~log(fv))
m1<-gam(medFPQ^.25~s(Y,X,k=100),data=brain)
gam.check(m1)
m2<-gam(medFPQ~s(Y,X,k=100),data=brain,family=Gamma(link=log))
mean(fitted(m1)^4);mean(fitted(m2));mean(brain$medFPQ)
m2
vis.gam(m2,plot.type="contour",too.far=0.03,
color="gray",n.grid=60,zlim=c(-1,2))
## 7.2.2 additive?
m3 <- gam(medFPQ~s(Y,k=30)+s(X,k=30),data=brain,
family=Gamma(link=log))
m3
AIC(m2,m3)
## 7.2.3 isotropic or tensor
tm <- gam(medFPQ~te(Y,X,k=10),data=brain,family=Gamma(link=log))
tm1 <- gam(medFPQ ~ s(Y,k=10,bs="cr") + s(X,bs="cr",k=10) +
ti(X,Y,k=10), data=brain, family=Gamma(link=log))
AIC(m2,tm,tm1)
anova(tm1)
## 7.2.4 Detecting symmetry
brain$Xc <- abs(brain$X - 64.5)
brain$right <- as.numeric(brain$X<64.5)
m.sy <- gam(medFPQ~s(Y,Xc,k=100),data=brain,
family=Gamma(link=log))
m.as <- gam(medFPQ~s(Y,Xc,k=100)+s(Y,Xc,k=100,by=right),
data=brain,family=Gamma(link=log))
m.sy
m.as
anova(m.as)
vis.gam(m.sy,plot.type="contour",view=c("Xc","Y"),too.far=.03,
color="gray",n.grid=60,zlim=c(-1,2),main="both sides")
vis.gam(m.as,plot.type="contour",view=c("Xc","Y"),
cond=list(right=0),too.far=.03,color="gray",n.grid=60,
zlim=c(-1,2),main="left side")
vis.gam(m.as,plot.type="contour",view=c("Xc","Y"),
cond=list(right=1),too.far=.03,color="gray",n.grid=60,
zlim=c(-1,2),main="right side")
## 7.2.5 Comparing surfaces
brain1 <- brain
mu <- fitted(m2)
n<-length(mu)
ind <- brain1$X<60 & brain1$Y<20
mu[ind] <- mu[ind]/3
set.seed(1)
brain1$medFPQ <- rgamma(rep(1,n),mu/m2$sig2,scale=m2$sig2)
brain2=rbind(brain,brain1)
brain2$sample1 <- c(rep(1,n),rep(0,n))
brain2$sample0 <- 1 - brain2$sample1
m.same<-gam(medFPQ~s(Y,X,k=100),data=brain2,
family=Gamma(link=log))
m.diff<-gam(medFPQ~s(Y,X,k=100)+s(Y,X,by=sample1,k=100),
data=brain2,family=Gamma(link=log))
AIC(m.same,m.diff)
anova(m.diff)
## 7.2.6 Prediction
predict(m2)[1:5]
pv <- predict(m2,se=TRUE)
pv$fit[1:5]
pv$se[1:5]
predict(m2,type="response")[1:5]
pv <- predict(m2,type="response",se=TRUE)
pv$se[1:5]
pd <- data.frame(X=c(80.1,68.3),Y=c(41.8,41.8))
predict(m2,newdata=pd)
predict(m2,newdata=pd,type="response",se=TRUE)
predict(m3,newdata=pd,type="terms",se=TRUE)
Xp <- predict(m2,newdata=pd,type="lpmatrix")
fv <- Xp%*%coef(m2)
fv
d <- t(c(1,-1))
d%*%fv
d%*%Xp%*%m2$Vp%*%t(Xp)%*%t(d)
## 7.2.7 Variance of non-linear function
ind <- brain$region==1& ! is.na(brain$region)
Xp <- predict(m2,newdata=brain[ind,],type="lpmatrix")
set.seed(8) ## for repeatability
br <- rmvn(n=1000,coef(m2),vcov(m2)) # simulate from posterior
mean.FPQ<-rep(0,1000)
for (i in 1:1000)
{ lp <- Xp%*%br[i,] # replicate linear predictor
mean.FPQ[i] <- mean(exp(lp)) # replicate region 1 mean FPQ
}
mean.FPQ <- colMeans(exp(Xp%*%t(br)))
## 7.3 Retinopathy
require(gamair); require(mgcv); data(wesdr)
k <- 7
b <- gam(ret ~ s(dur,k=k) + s(gly,k=k) + s(bmi,k=k) +
ti(dur,gly,k=k) + ti(dur,bmi,k=k) + ti(gly,bmi,k=k),
select=TRUE, data=wesdr, family=binomial(), method="ML")
b
## 7.4 Air pollution
data(chicago)
ap0 <- gam(death~s(time,bs="cr",k=200)+pm10median+so2median+
o3median+tmpd,data=chicago,family=poisson)
gam.check(ap0)
par(mfrow=c(2,1))
plot(ap0,n=1000) # n increased to make plot smooth
plot(ap0,residuals=TRUE,n=1000)
chicago$death[3111:3125]
ap1<-gam(death~s(time,bs="cr",k=200)+s(pm10median,bs="cr")+
s(so2median,bs="cr")+s(o3median,bs="cr")+s(tmpd,bs="cr"),
data=chicago,family=poisson)
## 7.4.1 single index
lagard <- function(x,n.lag=6) {
n <- length(x); X <- matrix(NA,n,n.lag)
for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1]
X
}
dat <- list(lag=matrix(0:5,nrow(chicago),6,byrow=TRUE),
death=chicago$death,time=chicago$time)
dat$pm10 <- lagard(chicago$pm10median)
dat$tmp <- lagard(chicago$tmpd)
dat$o3 <- lagard(chicago$o3median)
si <- function(theta,dat,opt=TRUE) {
## Return ML if opt==TRUE or fitted gam otherwise.
alpha <- c(1,theta) ## alpha defined via unconstrained theta
kk <- sqrt(sum(alpha^2)); alpha <- alpha/kk ## ||alpha||=1
o3 <- dat$o3%*%alpha; tmp <- dat$tmp%*%alpha
pm10 <- dat$pm10%*%alpha ## re-weight lagged covariates
b<- bam(dat$death~s(dat$time,k=200,bs="cr")+s(pm10,bs="cr")+
te(o3,tmp,k=8),family=poisson) ## fit model
cat(".") ## give user something to watch
if (opt) return(b$gcv.ubre) else {
b$alpha <- alpha ## add alpha to model object
b$J <- outer(alpha,-theta/kk^2) ## get dalpha_i/dtheta_j
for (j in 1:length(theta)) b$J[j+1,j] <- b$J[j+1,j] + 1/kk
return(b)
}
} ## si
## WARNING: the next line takes around half an hour to run
f1 <- optim(rep(1,5),si,method="BFGS",hessian=TRUE,dat=dat)
apsi <- si(f1$par,dat,opt=FALSE)
apsi$alpha
## 7.4.2 distributed lag...
apl <- bam(death~s(time,bs="cr",k=200)+te(pm10,lag,k=c(10,5))+
te(o3,tmp,lag,k=c(8,8,5)),family=poisson,data=dat)
## 7.5 Egg survey - less than a minute
## 7.5.1 Model development
data(mack)
mack$log.net.area <- log(mack$net.area)
gmtw <- gam(egg.count ~ s(lon,lat,k=100) + s(I(b.depth^.5))+
s(c.dist) + s(salinity) + s(temp.surf) + s(temp.20m)+
offset(log.net.area),data=mack,family=tw,method="REML",
select=TRUE)
gm2 <- gam(egg.count ~ s(lon,lat,k=100) + s(I(b.depth^.5)) +
s(c.dist) + s(temp.20m) + offset(log.net.area),
data=mack,family=tw,method="REML")
gm2
## 7.5.2 model predictions
par(mfrow=c(1,3))
data(mackp); data(coast)
mackp$log.net.area <- rep(0,nrow(mackp))
lon <- seq(-15,-1,1/4); lat <- seq(44,58,1/4)
zz<-array(NA,57*57); zz[mackp$area.index]<-predict(gm2,mackp)
image(lon,lat,matrix(zz,57,57),col=gray(0:32/32),
cex.lab=1.5,cex.axis=1.4)
contour(lon,lat,matrix(zz,57,57),add=TRUE)
lines(coast$lon,coast$lat,col=1)
set.seed(4) ## make reproducable
br1 <- rmvn(n=1000,coef(gm2),vcov(gm2))
Xp <- predict(gm2,newdata=mackp,type="lpmatrix")
mean.eggs1 <- colMeans(exp(Xp%*%t(br1)))
hist(mean.eggs1)
## 7.5.3 alternative
gmgr <- gam(egg.count ~s(lon,lat,k=100)+s(lon,lat,by=temp.20m)
+s(lon,lat,by=I(b.depth^.5)) +offset(log.net.area),
data=mack,family=tw,method="REML")
## 7.6 Larks - about a minute
library(gamair); data(bird)
bird$n <- bird$y/1000;bird$e <- bird$x/1000
m1 <- gam(crestlark~s(e,n,k=100),data=bird,family=binomial,
method="REML")
m1
m2 <- gam(crestlark ~ s(e,n,bs="ds",m=c(1,.5),k=100),data=bird,family=binomial,
method="REML")
REML <- r <- 1:10*10
for (i in 1:length(r)) {
mt <- gam(crestlark ~ s(e,n,bs="gp",m=c(3,r[i]),k=100),
data=bird,family=binomial,method="REML")
REML[i] <- mt$gcv.ubre
if (i==1||REML[i]==REML0) { m3 <- mt; REML0 <- REML[i]}
}
AIC(m1,m2,m3)
bird$tet.n <- bird$N <- rep(1,nrow(bird))
bird$N[is.na(as.vector(bird$crestlark))] <- NA
ba <- aggregate(data.matrix(bird), by=list(bird$QUADRICULA),
FUN=sum, na.rm=TRUE)
ba$e <- ba$e/ba$tet.n; ba$n <- ba$n/ba$tet.n
m10 <- gam(cbind(crestlark,N-crestlark) ~ s(e,n,k=100),
data=ba, family=binomial, method="REML")
library(geoR)
coords<-matrix(0,nrow(ba),2);coords[,1]<-ba$e;coords[,2]<-ba$n
gb<-list(data=residuals(m10,type="d"),coords=coords)
plot(variog(gb,max.dist=100))
plot(fitted(m10),residuals(m10))
## 7.7.1 Sole egg GAMM
## Chapter 3 preliminaries...
data(sole)
sole$off <- log(sole$a.1-sole$a.0)# model offset term
sole$a<-(sole$a.1+sole$a.0)/2 # mean stage age
solr<-sole # make copy for rescaling
solr$t<-solr$t-mean(sole$t)
solr$t<-solr$t/var(sole$t)^0.5
solr$la<-solr$la-mean(sole$la)
solr$lo<-solr$lo-mean(sole$lo)
## GAMM fit...
solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))
som <- gamm(eggs~te(lo,la,t,bs=c("tp","tp"),k=c(25,5),d=c(2,1))
+s(t,k=5,by=a)+offset(off), family=quasipoisson,
data=solr,random=list(station=~1))
som$gam
som1 <- bam(eggs~te(lo,la,t,bs=c("tp","tp"),k=c(25,5),d=c(2,1))
+ s(t,k=5,by=a)+offset(off)+s(station,bs="re"),
family=quasipoisson,data=solr)
gam.vcomp(som1)
som$lme
## boundary and knots for soap...
bnd <- list(list(lo=c(-6.74,-5.72,-5.7 ,-5.52,-5.37,-5.21,-5.09,-5.02,
-4.92,-4.76,-4.64,-4.56,-4.53,-4.3,-4.16,-3.8 ,-3.8,-5.04,-6.76,
-6.74),
la=c(50.01,50.02,50.13,50.21,50.24,50.32,50.41,50.54,50.59,50.64,
50.74,50.86,51.01,51 ,51.2,51.22,51.61,51.7,51.7,50.01)))
knt <- list(lo=c(-4.643,-5.172,-5.638,-6.159,-6.665,-6.158,-5.656,-5.149,
-4.652,-4.154,-3.901,-4.146,-4.381,-4.9,-5.149,-5.37,-5.866,-6.36,-6.635,
-6.12,-5.626,-5.117,-4.622,-4.695,-4.875,-5.102,-5.609,-5.652,-5.141,
-5.354,-5.843,-6.35,-6.628,-6.127,-5.63,-5.154,-5.356,-5.652,-5.853,
-6.123),
la=c(51.626,51.61,51.639,51.638,51.376,51.377,51.373,51.374,51.374,
51.376,51.379,51.226,51.129,51.194,51.083,51.147,51.129,51.151,50.901,
50.891,50.959,50.958,50.942,50.728,50.676,50.818,50.825,50.684,50.693,
50.568,50.564,50.626,50.397,50.451,50.443,50.457,50.325,50.193,50.322,
50.177))
sole$station <- solr$station ## station to sole data
som2 <- bam(eggs ~ te(lo,la,t,bs=c("sw","cr"),k=c(40,5),
d=c(2,1),xt=list(list(bnd=bnd),NULL)) +
s(t,k=5,by=a) + offset(off) + s(station,bs="re"),
knots=knt, family=quasipoisson, data=sole)
## 7.7.2 Cairo temperature
data(cairo)
ctamm <- gamm(temp~s(day.of.year,bs="cc",k=20)+s(time,bs="cr"),
data=cairo,correlation=corAR1(form=~1|year))
summary(ctamm$gam)
intervals(ctamm$lme,which="var-cov")
ctamm$gam$sig2/ctamm$gam$sp
plot(ctamm$gam,scale=0,pages=1)
REML <- rho <- 0.6+0:20/100
for (i in 1:length(rho)) {
ctbam <- bam(temp~s(day.of.year,bs="cc",k=20)+s(time,bs="cr"),
data=cairo,rho=rho[i])
REML[i] <- ctbam$gcv.ubre
}
rho[REML==min(REML)]
## 7.7.3 Fully Bayesian
## Not currently included (requires editing of JAGS file)
## 7.7.4 Random wiggly curves
data(sitka)
sitka$id.num <- as.factor(sitka$id.num)
b <- gamm(log.size~s(days) + ozone + ozone:days +
s(days,id.num,bs="fs",k=5),data=sitka)
plot(b$gam,pages=1)
## 7.8 survival
require(survival)
data(pbc) ## loads pbcseq also
pbc$status1 <- as.numeric(pbc$status==2)
pbc$stage <- factor(pbc$stage)
b0 <- gam(time ~ trt+sex+stage+s(sqrt(protime))+s(platelet)+
s(age)+s(bili)+s(albumin)+s(sqrt(ast))+s(alk.phos),
weights=status1,family=cox.ph,data=pbc)
b <- gam(time ~ trt+sex+s(sqrt(protime))+s(platelet)+
s(age)+s(bili)+s(albumin),
weights=status1,family=cox.ph,data=pbc)
anova(b)
par(mfrow=c(2,3))
plot(b); plot(b$linear.predictors,residuals(b))
par(mfrow=c(1,1))
## create prediction data frame...
np <- 300
newd <- data.frame(matrix(0,np,0))
for (n in names(pbc)) newd[[n]] <- rep(pbc[[n]][25],np)
newd$time <- seq(0,4500,length=np)
## predict and plot the survival function...
fv <- predict(b,newdata=newd,type="response",se=TRUE)
plot(newd$time,fv$fit,type="l",ylim=c(0.,1),xlab="time",
ylab="survival",lwd=2)
## add crude one s.e. intervals...
lines(newd$time,fv$fit+fv$se.fit,col="grey")
lines(newd$time,fv$fit-fv$se.fit,col="grey")
## and intervals based on cumulative hazard s.e...
se <- fv$se.fit/fv$fit
lines(newd$time,exp(log(fv$fit)+se))
lines(newd$time,exp(log(fv$fit)-se))
## 7.8.1 time dependent
## copy functions from ?cox.pht in mgcv...
app <- function(x,t,to) {
## wrapper to approx for calling from apply...
y <- if (sum(!is.na(x))<1) rep(NA,length(to)) else
approx(t,x,to,method="constant",rule=2)$y
if (is.factor(x)) factor(levels(x)[y],levels=levels(x)) else y
} ## app
tdpois <- function(dat,event="z",et="futime",t="day",
status="status1",id="id") {
## dat is data frame. id is patient id; et is event time; t is
## observation time; status is 1 for death 0 otherwise;
## event is name for Poisson response.
if (event %in% names(dat)) warning("event name in use")
require(utils) ## for progress bar
te <- sort(unique(dat[[et]][dat[[status]]==1])) ## event times
sid <- unique(dat[[id]])
prg <- txtProgressBar(min = 0, max = length(sid), initial = 0,
char = "=",width = NA, title="Progress", style = 3)
## create dataframe for poisson model data
dat[[event]] <- 0; start <- 1
dap <- dat[rep(1:length(sid),length(te)),]
for (i in 1:length(sid)) { ## work through patients
di <- dat[dat[[id]]==sid[i],] ## ith patient's data
tr <- te[te <= di[[et]][1]] ## times required for this patient
## Now do the interpolation of covariates to event times...
um <- data.frame(lapply(X=di,FUN=app,t=di[[t]],to=tr))
## Mark the actual event...
if (um[[et]][1]==max(tr)&&um[[status]]==1) um[[event]][nrow(um)] <- 1
um[[et]] <- tr ## reset time to relevant event times
dap[start:(start-1+nrow(um)),] <- um ## copy to dap
start <- start + nrow(um)
setTxtProgressBar(prg, i)
}
close(prg)
dap[1:(start-1),]
} ## tdpois
## model fitting...
data(pbc)
pbcseq$status1 <- as.numeric(pbcseq$status==2) ## deaths
pb <- tdpois(pbcseq) ## conversion
pb$tf <- factor(pb$futime) ## add factor for event time
b <- bam(z ~ tf - 1 + trt + s(sqrt(protime)) + s(platelet) +
s(age) + s(bili) + s(albumin) + s(sqrt(ast)),
family=poisson,data=pb,discrete=TRUE,nthreads=2)
chaz <- tapply(fitted(b),pb$id,sum) ## cum. hazard by subject
d <- tapply(pb$z,pb$id,sum) ## censoring indicator
mrsd <- d - chaz ## Martingale residuals
drsd <- sign(mrsd)*sqrt(-2*(mrsd + d*log(chaz))) ## deviance
te <- sort(unique(pb$futime)) ## event times
di <- pbcseq[pbcseq$id==25,] ## data for subject 25
## interpolate to te using app from ?cox.pht...
pd <- data.frame(lapply(X=di,FUN=app,t=di$day,to=te))
pd$tf <- factor(te)
X <- predict(b,newdata=pd,type="lpmatrix")
eta <- drop(X%*%coef(b)); H <- cumsum(exp(eta))
J <- apply(exp(eta)*X,2,cumsum)
se <- diag(J%*%vcov(b)%*%t(J))^.5
par(mfrow=c(1,2))
plot(stepfun(te,c(1,exp(-H))),do.points=FALSE,ylim=c(0.7,1),
ylab="S(t)",xlab="t (days)",main="",lwd=2)
lines(stepfun(te,c(1,exp(-H+se))),do.points=FALSE)
lines(stepfun(te,c(1,exp(-H-se))),do.points=FALSE)
rug(pbcseq$day[pbcseq$id==25]) ## measurement times
er <- pbcseq[pbcseq$id==25,]
plot(er$day,er$ast);lines(te,pd$ast)
## 7.9 Location scale
library(MASS);library(mgcv)
b <- gam(list(accel~s(times,bs="ad"),~s(times,bs="ad")),
family=gaulss,data=mcycle)
## 7.9.1 Extreme rainfall
library(mgcv);library(gamair);data(swer)
b0 <- gam(list(exra ~ s(nao)+ s(elevation)+ climate.region+
te(N,E,year,d=c(2,1),k=c(20,5)),
~ s(year)+ s(nao)+ s(elevation)+ climate.region+ s(N,E),
~ s(elevation)+ climate.region),family=gevlss,data=swer)
b <- gam(list(exra~ s(nao)+s(elevation)+climate.region+s(N,E),
~ s(year)+ s(elevation)+ climate.region+ s(N,E),
~ climate.region),family=gevlss,data=swer)
plot(b,scale=0,scheme=c(1,1,3,1,1,3),contour.col="white",pages=1)
mu <- fitted(b)[,1];rho <- fitted(b)[,2]; xi <- fitted(b)[,3]
fv <- mu + exp(rho)*(gamma(1-xi)-1)/xi
Fi.gev <- function(z,mu,sigma,xi) { ## GEV inverse cdf.
xi[abs(xi)<1e-8] <- 1e-8 ## approximate xi=0, by small xi
x <- mu + ((-log(z))^-xi-1)*sigma/xi
}
mb <- coef(b);Vb <- vcov(b) ## posterior mean and cov
b1 <- b ## copy fitted model object to modify
n.rep <- 1000; br <- rmvn(n.rep,mb,Vb) ## posterior sim
n <- length(fitted(b))
sim.dat <- cbind(data.frame(rep(0,n*n.rep)),swer$code)
for (i in 1:n.rep) {
b1$coefficients <- br[i,] ## copy sim coefs to gam object
X <- predict(b1,type="response");ii <- 1:n + (i-1)*n
sim.dat[ii,1] <- Fi.gev(runif(n),X[,1],exp(X[,2]),X[,3])
}
stm <- tapply(sim.dat[,1],sim.dat[,2],mean)
st98 <- tapply(sim.dat[,1],sim.dat[,2],quantile,probs=0.98)
## 7.10 Multivariate
library(mgcv); library(gamair); data(mpg)
b <- gam(list(city.mpg ~ fuel +style +drive +s(weight) +s(hp)
+ s(make,bs="re"),
hw.mpg ~ fuel +style +drive +s(weight) +s(hp)
+ s(make,bs="re")),
family = mvn(d=2) , data = mpg)
b1 <- gam(list(city.mpg ~ fuel +style +drive +s(hp) +s(weight)
+ s(make,bs="re"),
hw.mpg ~ fuel +style +drive +s(make,bs="re"),
1+2 ~ s(weight) +s(hp) -1),
family = mvn(d=2) , data = mpg)
## 7.11 FDA
## 7.11.1 scalar-on-function
data(gas)
b <- gam(octane~s(nm,by=NIR,k=50),data=gas)
par(mfrow=c(1,2))
plot(b,scheme=1,col=1)
plot(fitted(b),gas$octane)
## Prostate...
data(prostate)
b <- gam(type ~ s(MZ,by=intensity,k=100),family=ocat(R=3),
data=prostate,method="ML")
par(mfrow=c(1,3))
plot(b,rug=FALSE,scheme=1,xlab="Daltons",ylab="f(D)",
cex.lab=1.6,cex.axis=1.4)
pb <- predict(b,type="response") ## matrix of class probs
plot(factor(prostate$type),pb[,3])
qq.gam(b,rep=100,lev=.95)
prostate$type1 <- prostate$type - 1 ## recode for multinom
b1 <- gam(list(type1 ~ s(MZ,by=intensity,k=100),
~ s(MZ,by=intensity,k=100)),
family=multinom(K=2),data=prostate)
plot(b1,pages=1,scheme=1,rug=FALSE)
## 7.11.2 Canadian weather
require(gamair);require(lattice);data(canWeather)
xyplot(T~time|region,data=CanWeather,type="l",groups=place)
aic <- reml <- rho <- seq(0.9,0.99,by=.01)
for (i in 1:length(rho)) {
b <- bam(T ~ region + s(time,k=20,bs="cr",by=region) +
s(time,k=40,bs="cr",by=latitude),
data=CanWeather,AR.start=time==1,rho=rho[i])
aic[i] <- AIC(b); reml[i] <- b$gcv.ubre
}
## End(Not run)
Solution code for Chapter 7 GAMs in Practice: mgcv
Description
R code for Chapter 7 exercise solutions.
Author(s)
Simon Wood <simon@r-project.org>
Maintainer: Simon Wood <simon@r-project.org>
References
Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, CRC
See Also
Examples
library(gamair); library(mgcv)
## Q.1
## a)
data(hubble)
h1 <- gam(y~s(x),data=hubble)
plot(h1) ## model is curved
h0 <- gam(y~x,data=hubble)
h1;h0
AIC(h1,h0)
## b)
gam.check(h1) # oh dear
h2 <- gam(y~s(x),data=hubble,family=quasi(var=mu))
gam.check(h2) # not great, but better
h2
## Q.2
## a)
library(MASS)
par(mfrow=c(2,2))
mc <- gam(accel~s(times,k=40),data=mcycle)
plot(mc,residuals=TRUE,se=FALSE,pch=1)
## b)
mc1 <- lm(accel~poly(times,11),data=mcycle)
termplot(mc1,partial.resid=TRUE)
## c)
mc2 <- gam(accel~s(times,k=11,fx=TRUE),data=mcycle)
plot(mc2,residuals=TRUE,se=FALSE,pch=1)
## d)
mc3 <- gam(accel~s(times,k=11,fx=TRUE,bs="cr"),data=mcycle)
plot(mc3,residuals=TRUE,se=FALSE,pch=1)
## e)
par(mfrow=c(1,1))
plot(mcycle$times,residuals(mc))
## f)
mcw <- gam(accel~s(times,k=40),data=mcycle,
weights=c(rep(400,20),rep(1,113)))
plot(mcw,residuals=TRUE,pch=1)
rsd <- residuals(mcw)
plot(mcycle$times,rsd)
var(rsd[21:133])/var(rsd[1:20])
## g)
gam(accel~s(times,k=40,m=3),data=mcycle,
weights=c(rep(400,20),rep(1,113)))
## Q.3
## b)
library(MASS)
n <- nrow(mcycle)
A <- matrix(0,n,n)
for (i in 1:n) {
mcycle$y<-mcycle$accel*0;mcycle$y[i] <- 1
A[,i] <- fitted(gam(y~s(times,k=40),data=mcycle,sp=mc$sp))
}
## d)
plot(mcycle$times,A[,65],type="l",ylim=c(-0.05,0.15))
## e)
for (i in 1:n) lines(mcycle$times,A[,i])
## f)
par(mfrow=c(2,2))
mcycle$y<-mcycle$accel*0;mcycle$y[65] <- 1
for (k in 1:4) plot(mcycle$times,fitted(
gam(y~s(times,k=40),data=mcycle,sp=mc$sp*10^(k-1.5))
),type="l",ylab="A[65,]",ylim=c(-0.01,0.12))
## Q.4
## a)
par(mfrow=c(1,1))
w <- c(rep(400,20),rep(1,113))
m <- 40;par(mfrow=c(1,1))
sp <- seq(-13,12,length=m) ## trial log(sp)'s
AC1 <- EDF <- rep(0,m)
for (i in 1:m) { ## loop through s.p.'s
b <- gam(accel~s(times,k=40),data=mcycle,weights=w,
sp=exp(sp[i]))
EDF[i] <- sum(b$edf)
AC1[i] <- acf(residuals(b),plot=FALSE)$acf[2]
}
plot(EDF,AC1,type="l");abline(0,0,col=2)
## Not run:
## Q.5 - a bit slow - few seconds
## a)
data(co2s)
attach(co2s)
plot(c.month,co2,type="l")
## b)
b<-gam(co2~s(c.month,k=300,bs="cr"))
## c)
pd <- data.frame(c.month=1:(n+36))
fv <- predict(b,pd,se=TRUE)
plot(pd$c.month,fv$fit,type="l")
lines(pd$c.month,fv$fit+2*fv$se,col=2)
lines(pd$c.month,fv$fit-2*fv$se,col=2)
## d)
b2 <- gam(co2~s(month,bs="cc")+s(c.month,bs="cr",k=300),
knots=list(month=seq(1,13,length=10)))
## e)
pd2 <- data.frame(c.month=1:(n+36),
month=rep(1:12,length.out=n+36))
fv <- predict(b2,pd2,se=TRUE)
plot(pd$c.month,fv$fit,type="l")
lines(pd$c.month,fv$fit+2*fv$se,col=2)
lines(pd$c.month,fv$fit-2*fv$se,col=2)
## End(Not run)
## Not run:
## Q.6 - a bit slow - a few seconds
data(ipo)
n<-nrow(ipo)
## create lagged variables ...
ipo$ir1 <- c(NA,ipo$ir[1:(n-1)])
ipo$ir2 <- c(NA,NA,ipo$ir[1:(n-2)])
ipo$ir3 <- c(NA,NA,NA,ipo$ir[1:(n-3)])
ipo$ir4 <- c(NA,NA,NA,NA,ipo$ir[1:(n-4)])
ipo$dp1 <- c(NA,ipo$dp[1:(n-1)])
ipo$dp2 <- c(NA,NA,ipo$dp[1:(n-2)])
ipo$dp3 <- c(NA,NA,NA,ipo$dp[1:(n-3)])
ipo$dp4 <- c(NA,NA,NA,NA,ipo$dp[1:(n-4)])
## fit initial model and look at it ...
b<-gam(n.ipo~s(ir1)+s(ir2)+s(ir3)+s(ir4)+s(log(reg.t))+
s(dp1)+s(dp2)+s(dp3)+s(dp4)+s(month,bs="cc")+s(t,k=20),
data=ipo,knots=list(month=seq(1,13,length=10)),
family=poisson,gamma=1.4)
par(mfrow=c(3,4))
plot(b,scale=0)
summary(b)
## re-fit model dropping ir4 ...
b1 <- gam(n.ipo~s(ir1)+s(ir2)+s(ir3)+s(log(reg.t))+s(dp1)+
s(dp2)+s(dp3)+s(dp4)+s(month,bs="cc")+s(t,k=20),
data=ipo,knots=list(month=seq(1,13,length=10)),
family=poisson,gamma=1.4)
par(mfrow=c(3,4))
plot(b1,scale=0)
summary(b1)
## residual checking ...
gam.check(b1)
par(mfrow=c(1,1))
acf(residuals(b1))
## End(Not run)
## Q.7
data(wine)
wm<-gam(price~s(h.rain)+s(s.temp)+s(h.temp)+s(year),
data=wine,family=Gamma(link=identity),gamma=1.4)
plot(wm,pages=1,residuals=TRUE,pch=1,scale=0)
acf(residuals(wm))
gam.check(wm)
predict(wm,wine,se=TRUE)
## Q.8
## a)
par(mfrow=c(1,1))
data(blowfly)
bf <- blowfly
plot(bf$day,bf$pop,type="l")
## b)
## prepare differenced and lagged data ...
n <- nrow(bf)
bf$dn <- c(NA,bf$pop[2:n]-bf$pop[1:(n-1)])
lag <- 6
bf$n.lag <- c(rep(NA,lag),bf$pop[1:(n-lag)])
bf1 <- bf[(lag+1):n,] # strip out NAs, for convenience
## fit model, note no intercept ...
b<-gam(dn~n.lag+pop+s(log(n.lag),by=n.lag)+
s(log(pop),by=-pop)-1,data=bf1)
plot(b,pages=1,scale=-1,se=FALSE) ## effects
plot(abs(fitted(b)),residuals(b))
acf(residuals(b))
## c)
fv <- bf$pop
e <- rnorm(n)*0 ## increase multiplier for noisy version
min.pop <- min(bf$pop);max.pop <- max(bf$pop)
for (i in (lag+1):n) { ## iteration loop
dn <- predict(b,data.frame(n.lag=fv[i-lag],pop=fv[i-1]))
fv[i] <- fv[i-1]+dn + e[i];
fv[i]<-min(max.pop,max(min.pop,fv[i]))
}
plot(bf$day,fv,type="l")
## Not run:
## Q.9 - takes several minutes
## a)
data(chl)
pairs(chl,pch=".")
## b)
fam <- quasi(link=log,var=mu^2)
cm <- gam(chl ~ s(I(chl.sw^.4),bs="cr",k=20)+
s(I(bath^.25),bs="cr",k=60)+s(jul.day,bs="cr",k=20),
data=chl,family=fam,gamma=1.4)
gam.check(cm)
summary(cm)
## c)
## create fit and validation sets ...
set.seed(2)
n<-nrow(chl);nf <- floor(n*.9)
ind <- sample(1:n,nf,replace=FALSE)
chlf <- chl[ind,];chlv <- chl[-ind,]
## fit to the fit set
cmf<-gam(chl ~ s(I(chl.sw^.4),bs="cr",k=20)+
s(I(bath^.25),bs="cr",k=60)+s(jul.day,bs="cr",k=20),
data=chlf,family=fam,gamma=1.4)
## evaluate prop. dev. explained for validation set
y <- chlv$chl;w <- y*0+1
mu <- predict(cmf,chlv,type="response")
pred.dev <- sum(fam$dev.resids(y,mu,w))
null.dev <- sum(fam$dev.resids(y,mean(y),w))
1-pred.dev/null.dev # prop dev. explained
## End(Not run)
## Not run:
## Q.10 - a few seconds run time
## a)
g1<-gamm(weight ~ Variety + s(Time)+
s(Time,by=ordered(Variety)),data=Soybean,
family=Gamma(link=log), random=list(Plot=~Time))
plot(g1$lme) ## standard mean variance plot
par(mfrow=c(1,3))
plot(g1$gam,residuals=TRUE,all.terms=TRUE,scale=0) ## gam plot
## b)
summary(g1$gam) ## evidence for variety dependence
## could also do following ....
g2 <- gamm(weight~s(Time),family=Gamma(link=log),
data=Soybean,random=list(Plot=~Time))
g3 <- gamm(weight~Variety+s(Time),family=Gamma(link=log),
data=Soybean,random=list(Plot=~Time))
## following only a rough guide, but also supports g1 ...
AIC(g1$lme,g2$lme,g3$lme)
## Q.11
data(med); head(med) ## look at data
data(coast)
## initial plots...
plot(med$lo,med$la,cex=0.2+med$count^.5/10,col="grey",
pch=19,xlab="lo",ylab="la",main="mackerel")
ind <- med$count==0
points(med$lo[ind],med$la[ind],cex=0.1,pch=19)
lines(coast)
## ... survey seems to cover spawning area this time!
require(mgcv)
m1 <- gam(count~s(lo,la,k=100)+s(T.surf)+s(T.20)+s(I(b.depth^.5))+s(Sal20)+
s(ship,bs="re")+offset(log(vol)),data=med,select=TRUE,family=tw)
gam.check(m1) ## mean variance relationship not quite right?
m2 <- gam(count~s(lo,la,k=100)+s(T.surf)+s(T.20)+s(I(b.depth^.5))+s(Sal20)+
s(ship,bs="re")+offset(log(vol)),data=med,select=TRUE,family=nb)
gam.check(m2)
par(mfrow=c(1,2)) ## re-check residuals vs fitted
plot(fitted(m1)^.5,residuals(m1));plot(fitted(m2)^.5,residuals(m2))
AIC(m1,m2) ## neg bin much better
plot(m2,pages=1) ## effects
## End(Not run)
Chicago air pollution and death rate data
Description
Daily air pollution and death rate data for Chicago.
Usage
data(chicago)
Format
A data frame with 7 columns and 5114 rows. Each row refers to one day. The columns are:
- death
total deaths (per day).
- pm10median
median particles in 2.5-10 per cubic m
- pm25median
median particles < 2.5 mg per cubic m (more dangerous).
- o3median
Ozone in parts per billion
- so2median
Median Sulpher dioxide measurement
- time
time in days
- tmpd
temperature in fahrenheit
Details
See the NMMAPSdata
package for fuller details. Note that
there are missing values in some fields.
Source
Roger D. Peng, Leah J. Welty and Aiden McDermott. R package NMMAPSdata.
References
Peng, R.D. and Welty, L.J. (2004) The NMMAPSdata package. R News 4(2).
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R
Chlorophyll data
Description
Data relating to the callibration of remote sensed satellite data. The SeaWifs satellite provides estimates of chlorophyll concentration at the ocean surface from measurements of ocean surface colour. It is of interest to attempt to use these data to predict direct bottle measurements of chl. conc.
Usage
data(chl)
Format
A data frame with 6 columns and 13840 rows. The columns are:
- lon
longitude
- lat
latitude
- jul.day
Julian day (i.e. day of year starting at Jan 1st.)
- bath
Ocean depth in metres.
- chl
direct chlorophyll concentration measured at given location from a bottle sample.
- chl.sw
chl. conc. as measured by Seawifs Satellite
Source
https://oceancolor.gsfc.nasa.gov/SeaWiFS/
and the World Ocean Database.
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
data(chl)
with(chl,plot(chl,chl.sw))
Atmospheric CO2 at South Pole
Description
Monthly CO2 concentration in parts per million at the South Pole.
Usage
data(co2s)
Format
A data frame with 3 columns and 507 rows. The columns are:
- co2
atmospheric CO2 concentration in parts per million
- c.month
cumulative number of months since Jan 1957
- month
month of year
Source
http://cdiac.esd.ornl.gov/trends/co2/
References
Keeling C.P. and T.P Whorf (2000) Atmospheric CO2 records from sites in the SIO air sampling network. In Trends: A Compedium of Data on Global Change. Carbon Dioxide Analyis Center, Oak Ridge National Laboratory, U.S. Department of Energy, Oak Ridge Tenn., USA
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
data(co2s)
with(co2s,plot(c.month,co2,type="l",ylab=
expression(paste(CO[2]," in ppm.")),xlab="Month since Jan. 1957"))
European coastline from -11 to 0 East and from 43 to 59 North
Description
The data are longitudes (degrees E) and latitudes (degrees N) defining points that can be joined up to get the European coastline in the rectangle (-11E,43N)-(0E,59N). Discontinuous sections of coast are separated by NA's.
Usage
data(coast)
Format
A data frame with 2 columns.
- lon
Longitude in degrees East for points used to define the coast.
- lat
Latitude in degrees North for points used to define the coast.
Details
lon
, lat
together define the co-ordinates of points that can be joined up in order to
plot the coastline. The original data come from the NOAA www site given below,
but have been substantially thinned, to a much lower resultion than the source.
Author(s)
Simon Wood.
References
Originally from...
http://rimmer.ngdc.noaa.gov/coast/
Examples
data(coast)
# plot the entire coast .....
plot(coast$lon,coast$lat,type="l")
# or draw it clipped to whatever the current plot is....
lines(coast$lon,coast$lat,col="blue")
Engine wear versus size data
Description
Data on engine wear against engine size for 19 Volvo car engines.
Usage
data(engine)
Format
A data frame with 2 columns and 19 rows. Each row refers to one engine model. The columns are:
- wear
an index of engine wear rate.
- size
cylinder capacity in litres.
Details
See the source for further details.
Source
Originally from...
http://www3.bc.sympatico.ca/Volvo_Books/engine3.html
Octane rating data
Description
The octane rating of fuel determines its ‘knocking’ resistance. So the higher the octane rating the higher the compression ratio that an engine can run at. Traditionally octane measurement involves comparing the knocking resistance of fuel samples to standard mixtures in special variable compression ratio engines. This is an expensive process relative to obtaining the near infra-red spectrum of a sample. It would be good to be able to predict octane rating from the spectrum.
Usage
data(gas)
Format
A three item list
- octane
Octane rating of gasoline (petrol) sample.
- NIR
A matrix each row of which contains the near infra-red reflectance spectrum of the corresponding gasoline sample.
- nm
Matrix of same dimension as
NIR
containing wavelengths at which measurements were taken.
Details
A scalar-on-function regression (also known as ‘signal regression’) works quite well for these data.
Source
Originally from the pls
package
https://cran.r-project.org/package=pls
Examples
require(gamair);require(mgcv)
data(gas)
## plot some spectra...
with(gas,plot(nm[1,],NIR[1,],type="l",ylab="log(1/R)",
xlab="wavelength (nm)",col=1))
text(1000,1.2,"octane");text(1000,1.2-.1,gas$octane[1],col=1)
for (i in 2:8) { lines(gas$nm[i,],gas$NIR[i,],col=i)
text(1000,1.2-.1*i,gas$octane[i],col=i)
}
## Fit scalar on function regression...
b <- gam(octane~s(nm,by=NIR,k=50),data=gas)
gam.check(b)
par(mfrow=c(1,2))
plot(b,scheme=1)
plot(fitted(b),gas$octane,xlab="fitted octane",
ylab="observed octane");abline(0,1)
Hen Harriers Eating Grouse
Description
Data on the rate at which Hen Harriers consume Grouse as a function of Grouse density.
Usage
data(harrier)
Format
A data frame with 2 columns and 37 rows. The columns are:
- Grouse.Density
Density of Grouse per square kilometre.
- Consumption.Rate
Number of Grouse consumed per Hen Harrier per day.
Details
Data have been read from Figure 1 of Asseburg et al. (2005)
Source
Asseburg, C., S. Smout, J. Matthiopoulos, C. Fernandez, S. Redpath, S. Thirgood and J. Harwood (2005) The functional response of a generalist predator. Web preprint
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
data(harrier)
with(harrier,plot(Grouse.Density,Consumption.Rate))
Hubble Space Telescope Data
Description
Data on distances and velocities of 24 galaxies containing Cepheid stars, from the Hubble space telescope key project to measure the Hubble constant.
Usage
data(hubble)
Format
A data frame with 3 columns and 24 rows. The columns are:
- Galaxy
A (factor) label identifying the galaxy.
- y
The galaxy's relative velocity in kilometres per second.
- x
The galaxy's distance in Mega parsecs. 1 parsec is 3.09e13 km.
Details
Cepheids are variable stars which have a known relationship between brightness and period. Hence the distance to galaxies containing these stars can be estimated from the observed brightness of the Cepheid, relative to its absolute brightness as predicted by its period. The velocity of the galaxy can be estimated from its mean red-shift.
The data can be used to get a reasonably good idea of the age of the universe. A data free alternative estimate of 6000 years is given in the reference (not the source!).
Source
Tables 4 and 5 of Freedman et al. 2001. The Astrophysical Journal 553:47-72
References
Freedman et al. (2001) Final results from the Hubble space telescope key project to measure the Hubble constant. The Astrophysical Journal (553), 47-72.
http://www.icr.org/pubs/imp/imp-352.htm
NUCLEAR DECAY: EVIDENCE FOR A YOUNG WORLD - IMPACT No. 352 October 2002 by D. Russell Humphreys, Ph.D.
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Initial Public Offering Data
Description
Data on the relationship between the number of initial public offerings (of shares in a company) and other potentially important variables. It is probably necessary to lag some of the explanatory variables.
Usage
data(ipo)
Format
A data frame with 6 columns and 156 rows. The columns are:
- n.ipo
number of initial pubilc offerings each month.
- ir
the average initial return (volume weighted): this is the percentage difference between the offer proce of shares and the price after the first day of trading.
- dp
the average percentage difference between middle of the price range proposed at first filing of the IPO, and the eventual offer price.
- reg.t
the average time between filing and offer.
- t
time, in months.
- month
month of the year (1 = January).
Source
http://schwert.ssb.rochester.edu
References
Lowry, M. and G.W. Schwert (2002) IPO market cycles: Bubbles or sequential learning? The Journal of Finance 67(3), 1171-1198
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
data(ipo)
pairs(ipo)
Egg data from 1992 mackerel survey
Description
The data relate to the distribution of mackerel eggs and were collected as part of the 1992 mackerel survey aimed at assessing the mackerel spawning stock biomass using the daily egg production method.
Usage
data(mack)
Format
A data frame with 16 columns. Each row corresponds to one sample of eggs.
- egg.count
The number of stage I eggs in this sample.
- egg.dens
The number of stage I eggs per square metre of sea surface, produced per day. This is calculated from
egg.count
and other information about sampling net size, and egg stage duration.- b.depth
The sea bed depth at the sampling location.
- c.dist
The distance from the sample location to the 200m contour measured in degrees as if degrees latitude equalled degrees longitude, which actually they don't.
- lon
The longitude of the sample station in degrees east.
- lat
The latitude of the sample station in degrees north.
- time
The time of day (in hours) at which the sample was taken.
- salinity
The salinity (saltiness) of the water at the sampling location.
- flow
Reading from the flow meter attached to the sampling net - used for calibration.
- s.depth
The depth that the sampling net started sampling from (the net is dropped to this depth and then hauled up to the surface, filtering eggs etc out of the water as it goes).
- temp.surf
The temperature at the sea surface at the sampling location.
- temp.20m
The temperature 20m down at the sampling location.
- net.area
The area of the sampling net in square metres.
- country
A code identifying the country responsible for the boat that took this sample.
- vessel
A code identifying the boat that took this sample.
- vessel.haul
A code uniquely identifying this sample, given that the vessel is known.
Details
At each of a number of stations located as defined in lon
and lat
, mackerel eggs were sampled by hauling a fine net up from deep below the sea surface to the sea surface. The egg count data are obtained from the resulting samples, and these have been converted to (stage I) eggs produced per metre squared per day - the egg density data. Other possibly useful predictor variable information has been recorded, along with identification information, and some information that is probably useless!
Source
The data are effectively a combination of datasets mackerel
and smacker
from the sm
library. They were originally analyzed
using GAMs by:
Borchers, D.L., S.T. Buckland, I.G. Priede and S. Ahmadi (1997) "Improving the precision of the daily egg production method using generalized additive models". Can. J. Fish. Aquat. Sci. 54:2727-2742.
Examples
data(mack)
# plot the egg densities against location
plot(mack$lon,mack$lat,cex=0.2+mack$egg.dens/150,col="red")
Prediction grid data for 1992 mackerel egg model
Description
This data frame provides a regular grid of values of some predictor variables useful for modelling mackerel egg abundances. Its main purpose is to enable mackerel egg densities to be predicted over a regular spatial grid within the area covered by the 1992 mackerel egg survey (see mack
), using a fitted generalised additive model.
Usage
data(mackp)
Format
A data frame with 5 columns. Each row corresponds to one spatial location within the survey area. The columns are as follows:
- lon
Longitude of the gridpoint in degrees east
- lat
Latitude of the gridpoint in degrees north.
- b.depth
The sea bed depth at the gridpoint.
- c.dist
The distance from the gridpoint to the 200m sea bed depth contour.
- salinity
Salinity interpolated onto the grid (from
mack
measurements).- temp.surf
Surface temperature interpolated onto grid (from
mack
data).- temp.20m
Temperature at 20m interpolated from
mack
data.- area.index
An indexing vector that enables straightforward copying of the other variables into a matrix suitable for plotting against longitude and lattitude using
image()
. See the example below.
Details
The grid is defined on a series of 1/4 degree lon-lat squares.
References
Borchers, D.L., S.T. Buckland, I.G. Priede and S. Ahmadi (1997) "Improving the precision of the daily egg production method using generalized additive models". Can. J. Fish. Aquat. Sci. 54:2727-2742.
Examples
## example of how to use `area.index' to paste gridded info.
## into a square grid (of NA's) for plotting
data(mackp)
lon<-seq(-15,-1,1/4);lat<-seq(44,58,1/4)
zz<-array(NA,57*57)
zz[mackp$area.index]<-mackp$b.depth
image(lon,lat,matrix(zz,57,57))
Data from 2010 horse mackerel and mackerel egg survey
Description
The data relate to the distribution of horse mackerel (meh
, Trachurus trachurus) eggs and mackerel (med
, Scomber scombrus) eggs and were collected as part of the 2010 mackerel survey aimed at assessing the mackerel spawning stock biomass using the daily egg production method.
Usage
data(med)
data(meh)
Format
A data frame with the following columns. Each row corresponds to one sample of eggs.
- count
The number of stage I eggs in this sample.
- la
sample station latitude
- lo
sample station longitude
- vol
volume of water sampled
- T.surf
surface temperature in centigrade
- T.x
temperature at x metres depth.
- T1.x
Second temperature measurements.
- Sal20
Salinity at 20m depth
- b.depth
seabed depth in metres for
med
only.- lon
The longitude of the sample station in degrees east.
- lat
The latitude of the sample station in degrees north.
- time
The time of day (in hours) at which the sample was taken.
- salinity
The salinity (saltiness) of the water at the sampling location.
- period
sampling period
- country
Country responsible for sample
- ship
Vessel ID
- DT
sample data and time
- ID
Sample ID
- gear
type of sampling gear used
The remaining fields are undocumented.
Details
The original data files do not always exactly match the file documentation, so these data should not be treated as definitive.
Source
ICES Eggs and Larvae Dataset 2012, ICES, Copenhagen
http://eggsandlarvae.ices.dk/Download.aspx
Examples
require(gamair)
par(mfrow=c(1,2))
data(meh);data(med);data(coast)
# plot the egg counts against location
plot(meh$lo,meh$la,cex=0.2+meh$count^.5/10,col="grey",
pch=19,xlab="lo",ylab="la",main="horse mackerel")
ind <- meh$count==0
points(meh$lo[ind],meh$la[ind],cex=0.1,pch=19)
lines(coast)
# same for med
plot(med$lo,med$la,cex=0.2+med$count^.5/10,col="grey",
pch=19,xlab="lo",ylab="la",main="mackerel")
ind <- med$count==0
points(med$lo[ind],med$la[ind],cex=0.1,pch=19)
lines(coast)
Data on automobile efficiency on town streets and highway.
Description
Fuel efficiency in miles per gallon for a variety of cars in the USA.
Usage
data(mpg)
Format
A data frame listing fuel efficiency and other car characteristics including
- symbol
Insurers measure of relative riskiness of car from -3 (safe) to 3 (risky)
- loss
average insurance loss payment per insured vehicle per year.
- hw.mpg
Fuel consumption on highway as miles per US gallon.
- city.mpg
Fuel consumption in town as miles per US gallon.
- make
Name of car maker.
- fuel
2 level factor.
gas
ordiesel
.- aspir
2 level factor.
std
orturbo
.- doors
2 level factor.
two
orfour
.- style
Factor indicating style of car.
- drive
3 level factor indicating front, rear or all wheel drive:
fwd
,rwd
or4wd
.- eng.loc
Engine location
- wb
wheel base in inches
- length
in inches
- width
in inches
- height
in inches
- weight
in pounds
- eng.type
Factor giving engine type
- cylinders
Factor for number of cylinders
- eng.cc
cubic capicity of engine in cubic inches.
- fuel.sys
fuel system
- bore
in inches
- stroke
in inches
- comp.ratio
compression ratio
- hp
horse power
- rpm
maximum RPM
- price
in US dollars
Details
Data were collected by Jeffrey C. Schlimmer from 1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Source
https://archive.ics.uci.edu/ml/datasets/Automobile
References
Wood, S.N. (2006) Generalized Additive Models: An Introduction with R
Examples
require(gamair);require(mgcv)
data(mpg)
b <- gam(list(city.mpg~fuel+style+drive+s(weight)+s(hp)+s(make,bs="re"),
hw.mpg~fuel+style+drive++s(weight)+s(hp)+s(make,bs="re")),
family=mvn(d=2),data=mpg)
plot(b,pages=1,scheme=1)
Prostate cancer screening data
Description
Protein mass spectographs for patients with normal, benign enlargement and cancer of the prostate gland.
Usage
data(prostate)
Format
A three item list
- type
1 for normal, 2 for benign enlargement and 3 for cancerous.
- intensity
A matrix with rows corresponding to measurements in
type
. Each row is a normalized spectral intensity measurement for the protein mass given inMZ
- MZ
Matrix corresponding to
intensity
giving the protein masses in Daltons.Actually all rows are identical.
Details
See the source article for fuller details. The intensity data here have been smoothed so that each measurement is an average of 40 adjacent measurements from the raw spectrum. The intensity data have also been rounded to 3 significant figures. This pre-processing was done to reduce the dataset size to something reasonable for distribution.
Source
Originally from the msProstate package version 1.0.2.
References
Adam, B-L. Y. Qu, J.W. Davis et al. (2002) Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. Cancer Research 62:3609-3614
Examples
require(gamair);require(mgcv)
data(prostate)
## plot some spectra...
par(mfrow=c(2,3),mar=c(5,5,3,1))
ind <- c(1,163,319)
lab <- list("Healthy","Enlarged","Cancer")
for (i in 1:3) {
plot(prostate$MZ[ind[i],],prostate$intensity[ind[i],],type="l",ylim=c(0,60),
xlab="Daltons",ylab="Intensity",main=lab[[i]],cex.axis=1.4,cex.lab=1.6)
lines(prostate$MZ[ind[i],],prostate$intensity[ind[i]+2,]+5,col=2)
lines(prostate$MZ[ind[i],],prostate$intensity[ind[i]+4,]+10,col=4)
}
## treat as ordered cat control, bph, cancer
b <- gam(type ~ s(MZ,by=intensity,k=100),family=ocat(R=3),
data=prostate,method="ML")
## results...
pb <- predict(b,type="response")
plot(b,rug=FALSE,scheme=1,xlab="Daltons",ylab="f(D)",
cex.lab=1.6,cex.axis=1.4,main="a")
plot(factor(prostate$type),pb[,3],cex.lab=1.6,cex.axis=1.4,main="b")
qq.gam(b,rep=100,lev=.95,cex.lab=1.6,cex.axis=1.4,main="c")
Sitka spruce growth data.
Description
Tree growth data under enhanced ozone and control conditions.
Usage
data(sitka)
Format
A data frame with 1027 rows and 5 columns columns:
- id.num
identity of the tree: 1...79.
- order
time order ranking within each tree.
- days
since 1st January, 1988.
- log.size
log of tree ‘size’.
- ozone
1 - enhanced ozone treatment; 0 - control.
Details
The data were analysed in Crainiceanu CM, Ruppert D, Wand MP (2005) using WinBUGS, and in Wood (2016) using auto-generated JAGS code.
Source
The SemiPar
package, from:
Diggle, P.J, Heagery, P., Liang, K.-Y. and Zeger, S.L. (2002) Analysis of Longitudinal Data (2nd ed.) OUP.
References
Wood SN (2016) "Just Another Gibbs Additive Modeller: Interfacing JAGS and mgcv" Journal of Statistical Software 75
Crainiceanu C.M., Ruppert D. and Wand M.P. (2005). "Bayesian Analysis for Penalized Spline Regression Using WinBUGS." Journal of Statistical Software, 14(14).
Examples
require(gamair); require(lattice)
data(sitka)
xyplot(log.size~days|as.factor(ozone),data=sitka,type="l",groups=id.num)
Sole Eggs in the Bristol Channel
Description
Data on Sole Egg densities in the Bristol Channel (West Coast of England, UK.) The data are from 5 research cruises undertaken for the purpose of measuring Sole egg densities. Samples were taken at each of a number of sampling stations, by hauling a net vertically through the water column. Sole eggs were counted and assigned to one of four developmental stages.
Usage
data(sole)
Format
A data frame with 7 columns and 1575 rows. The columns are:
- la
latitude of sampling station
- lo
longitude of sampling station
- t
time of sampling station: actually time of midpoint of the cruise on which this sample was taken. Measured in Julian days (days since January 1st).
- eggs
egg density per square metre of sea surface.
- stage
to which of 4 stages the sample relates.
- a.0
lower age limit for the stage (i.e. age of youngest possible egg in this sample).
- a.1
upper age limit of this stage (i.e. age of oldest possible egg in sample).
Source
Dixon (2003)
References
Dixon, C.E. (2003) Multi-dimensional modelling of physiologically and temporally structured populations. PhD thesis. University of St Andrews
Horwood, J. (1993) The Bristol Channel Sole (solea solea (L.)): A fisheries case study. Advances in Marine Biology 29, 215-367
Horwood, J. and M. Greer Walker (1990) Determinacy of fecundity in Sole (solea solea) from the Bristol Channel. Journal of the Marine Biology Association of the United Kingdom. 70, 803-813.
Wood (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
require(gamair)
data(sole);data(coast)
par(mfrow=c(2,3))
sample.t <- unique(sole$t)
stage <- 1
for (i in 1:5)
{ egg<-sole[sole$stage==stage&sole$t==sample.t[i],]
plot(egg$lo,egg$la,xlab="lo",ylab="la",main=paste("day",sample.t[i]),cex=egg$eggs/4,
xlim=range(sole$lo),ylim=range(sole$la),cex.axis=1.5,cex.lab=1.5,cex.main=1.5)
points(egg$lo,egg$la,pch=".",col=2)
lines(coast)
}
## boundary definition list and knots suitable for soap film smoothing
bnd <- list(list(lo=c(-6.74,-5.72,-5.7 ,-5.52,-5.37,-5.21,-5.09,-5.02,
-4.92,-4.76,-4.64,-4.56,-4.53,-4.3,-4.16,-3.8 ,-3.8,-5.04,-6.76,
-6.74),
la=c(50.01,50.02,50.13,50.21,50.24,50.32,50.41,50.54,50.59,50.64,
50.74,50.86,51.01,51 ,51.2,51.22,51.61,51.7,51.7,50.01)))
knt <- list(lo=c(-4.643,-5.172,-5.638,-6.159,-6.665,-6.158,-5.656,-5.149,
-4.652,-4.154,-3.901,-4.146,-4.381,-4.9,-5.149,-5.37,-5.866,-6.36,-6.635,
-6.12,-5.626,-5.117,-4.622,-4.695,-4.875,-5.102,-5.609,-5.652,-5.141,
-5.354,-5.843,-6.35,-6.628,-6.127,-5.63,-5.154,-5.356,-5.652,-5.853,
-6.123),
la=c(51.626,51.61,51.639,51.638,51.376,51.377,51.373,51.374,51.374,
51.376,51.379,51.226,51.129,51.194,51.083,51.147,51.129,51.151,50.901,
50.891,50.959,50.958,50.942,50.728,50.676,50.818,50.825,50.684,50.693,
50.568,50.564,50.626,50.397,50.451,50.443,50.457,50.325,50.193,50.322,
50.177))
points(knt$lo,knt$la,pch=19,col=2,cex=.6)
lines(bnd[[1]]$lo,bnd[[1]]$la,col=2)
Sperm competition data I
Description
Data relating sperm count to time since last inter-pair copulation and proportion of that time spent together for 15 couples living in Manchester UK.
Usage
data(sperm.comp1)
Format
A data frame with 4 columns and 15 rows. The columns are:
- subject
An identifier for the subject/couple.
- time.ipc
Time since last inter-pair copulation, in hours.
- prop.partner
Proportion of
time.ipc
that the couple had spent together.- count
Sperm count in millions.
Details
The sperm counts reported are total counts in ejaculate from a single copulation, for each of 15 couples. Also recorded are the time since the couple's previous copulation, and the proportion of that time that the couple had spent together. The data are from volunteers from Manchester University and were gathered to test theories about human sperm competition. See the source article for further details.
Source
Baker, RR and Bellis M.A. (1993) 'Human sperm competition: ejaculate adjustment by males and the function of masturbation'. Animal behaviour 46:861-885
Sperm competition data II
Description
Data relating average number of sperm ejaculated per copulation to physical characterisics of partners involved, for 24 heterosexual couples from Manchester, UK.
Usage
data(sperm.comp2)
Format
A data frame with 10 columns and 24 rows. The columns are:
- pair
an identifier for the couple. These labels correspond to those given in
sperm.comp1
.- n
the number of copulations over which the average sperm count has been calculated.
- count
the average sperm count in millions, per copulation.
- f.age
age of the female, in years.
- f.height
height of the female, in cm.
- f.weight
weight of the female, in kg.
- m.age
age of the male, in years.
- m.height
height of the male, in cm.
- m.weight
weight of the male, in kg.
- m.vol
volume of one male teste in cubic cm.
Details
In the source article, these data are used to argue that males invest more reproductive effort in heavier females, on the basis of regression modelling. It is worth checking for outliers.
Source
Baker, RR and Bellis M.A. (1993) 'Human sperm competition: ejaculate adjustment by males and the function of masturbation'. Animal behaviour 46:861-885
Stomatal area and CO2
Description
Fake data on average stomatal area for 6 trees grown under one of two CO2 concentrations
Usage
data(stomata)
Format
A data frame with 3 columns and 24 rows. The columns are:
- area
mean stomatal area.
- CO2
label for which CO2 treatment the measurement relates to.
- tree
label for individual tree.
Details
The context for these simulated data is given in section 6.1 of the source book.
Source
The reference.
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Swiss 12 hour extreme rainfall
Description
Records the most extreme 12 hourly total rainfall each year for 65 Swiss weather stations. The data period is 1981-2015, although not all stations start in 1981.
Usage
data(swer)
Format
The swer
data frame has the following columns
- year
The year of observation.
- exra
The highest rainfall observed in any 12 hour period in that year, in mm.
- nao
Annual North Atlantic Oscillation index, based on the difference of normalized sea level pressure (SLP) between Lisbon, Portugal and Stykkisholmur/Reykjavik, Iceland. Positive values are generally associated with wetter and milder weather over Western Europe.
- location
The measuring station location name.
- code
Three letter code identifying the station.
- elevation
metres above sea level.
- climate.region
One of 12 distinct climate regions.
- N
Degrees north.
- E
Degrees east.
Details
The actual extreme rainfall measurements are digitized from plots in the MeteoSwiss reports for each station. The error associated with digitization can be estimated from the error in the digitized year values, since the true values are then known exactly. This translates into a mean square error in rainfall of about 0.1% of the station maximum, and a maximum error of about 0.3% of station maximum.
Source
Mostly from the MeteoSwiss website:
NAO data from:
Hurrell, James & National Center for Atmospheric Research Staff (Eds). Last modified 16 Aug 2016. "The Climate Data Guide: Hurrell North Atlantic Oscillation (NAO) Index (station-based)."
Examples
require(gamair);require(mgcv)
data(swer)
## GEV model, over-simplified for speed...
system.time(b <- gam(list(exra~s(elevation)+ climate.region,
~s(elevation),~1),family=gevlss,data=swer))
plot(b,pages=1,scale=0,scheme=1)
Diabetic retinopathy in Wisconsin
Description
The data, originally from the gss
package, record whether or not diabetic patients developed retinopathy along with three possible predictors.
Usage
data(wesdr)
Format
The wesdr
data frame has the following columns
- ret
binary variable: 1 = retinopathy, 0 = not.
- bmi
Body mass index (weight in kg divided by square of height in metres)
- gly
Glycosylated hemoglobin - the percentage of hemoglobin bound to glucuse in the blood. This reflects long term average blood glucose levels: less than 6% is typical of non-diabetics, but is only rarely acheived by diabetic patients.
- dur
Duration of disease in years.
Details
Retinopathy is a common problem in diabetic patients and the interst is in predicting the risk using the measured predictors.
Source
Data are from Chong Gu's gss
package.
Examples
require(gamair);require(mgcv)
data(wesdr)
## Smooth ANOVA model...
k <- 5
b <- gam(ret~s(dur,k=k)+s(gly,k=k)+s(bmi,k=k)+ti(dur,gly,k=k)+
ti(dur,bmi,k=k)+ti(gly,bmi,k=k),select=TRUE,
data=wesdr,family=binomial(),method="ML")
ow <- options(warn=-1) ## avoid complaint about zlim
plot(b,pages=1,scheme=1,zlim=c(-3,3))
options(ow)
Bordeaux Wines
Description
Data on prices and growing characteristics of 25 Bordeaux wines from 1952 to 1998.
Usage
data(wine)
Format
A data frame with 7 columns and 47 rows. The columns are:
- year
year of production
- price
average price of the wines as a percentage of the 1961 price.
- h.rain
mm of rain in the harvest month.
- s.temp
Average temperature (C) over the summer preceding harvest.
- w.rain
mm of rain in the winter preceding harvest.
- h.temp
average temperature (C) at harvest.
- parker
a rating of the wine quality (see source for details).
Source
http://schwert.ssb.rochester.edu/a425/a425.htm
References
Wood, S.N. (2006, 2017) Generalized Additive Models: An Introduction with R. CRC
Examples
data(wine)
pairs(wine[,-7])