%%% R-courseEH.Rnw
%% Author: emanuelheitlinger@gmail.com
\documentclass[12pt,a4paper]{article}
\usepackage[debugshow,final]{graphics}
\usepackage{float}
\usepackage{lscape}
\begin{document}
\title{Transcript of Mick Crawley's R course 2010
Imperial College London, Silwood Park}
\author{Emanuel G Heitlinger}
\date{}
\maketitle
\setkeys{Gin}{width=\linewidth}
Disclaimer: The following document is a private transcript of Mick
Crawley's R-course. I am a participant in this course and my writeup
has in no way been approved by Mick Crawley (from whom the ideas behind
the code and teaching concepts are) or any of his staff.
\section*{Day 7: Nested analysis and variance components}
\subsection*{Nested Analysis}
We take a look at a classic example for pseudoreplication.
<<>>=
rat <- read.table("rats.txt", header=T)
attach(rat)
Treatment <- factor(Treatment)
Rat <- factor(Rat)
Liver <- factor(Liver)
@
In this experiments 3 treatments have been administated to 2 rats
each. From these 6 rats three pieces of liver were taken and Glycogen
content was measured for each of the 18 pieces twice, leading to a
total of 36 datapoints.
Clearly it would be wrong to analyse the data as if there were 12
replicates for each treatment: The replicates for Treatment are only
2! 2 Rats: The spatial scale immediately below the interesting factor.
These form of data provides a big risk of conducting the wrong
analysis.
One very easy approach of conducting the right analysis is taking
means of the pseudoreplicates.
<<>>=
trm.mean <- as.vector(tapply(Glycogen, list(Treatment, Rat), mean))
trm <- factor(rep(c(1,2,3),2))
summary(aov(trm.mean~trm))
@
Not a very impressive experiment! But is there not more information in
the data? On the level of comparing treatments, \textbf{no}! But we
can learn something about the variance in different pieces of
rat-liver and about the variance in the measurements (the measurement
error). The means for theses factors are uninteresting as the
factor-levels are uninformative.
First the calculations by hand. In nested designs we have to use a
special correction factor, the uncorrected sum of squares from the
spacial scale \textbf{above}.
So for the highest level everything stays the same:\\
$SS_{treat}=\displaystyle\frac{\sum{T^2}}{n}-\frac{(\sum{y})^2}{nk}$\\
Where $n$ is the number of observations in a treatment and $k$ the
number of treatments. But\\
$SS_{Rats}=\displaystyle\frac{\sum{R^2}}{n_{rats}}-\frac{\sum{T^2}}{n_{treat}}$\\
$SS_{liver.bits}=\displaystyle\frac{\sum{L^2}}{n_{bits}}-\frac{\sum{R^2}}{n_{rat}}$\\
$SS_{measure}=\displaystyle\frac{\sum{y^2}}{1}-\frac{\sum{L^2}}{n_{bits}}$\\
We can make use of our half way generalized functions for anova to
calculate these sums of squares (note the use of combined factor levels
to get the right $n$:
<<>>=
CF <- function(x) sum(x)^2/length(x)
Q.aov <- function(x, fac){
sum(as.vector((tapply(x, fac, sum)^2)/tapply(x, fac, length)))
}
ss.treat <- Q.aov(Glycogen, Treatment)-CF(Glycogen)
ss.rats <- Q.aov(Glycogen, Treatment:Rat)-Q.aov(Glycogen, Treatment)
ss.liver <- Q.aov(Glycogen, Treatment:Rat:Liver)-Q.aov(Glycogen, Treatment:Rat)
ss.measure <- sum(Glycogen^2)-Q.aov(Glycogen, Treatment:Rat:Liver)
ss.total <- sum(Glycogen^2)-CF(Glycogen)
ss.all <- c(SS.treat=ss.treat, SS.rats=ss.rats, SS.liver=ss.liver, SS.measure=ss.measure, SS.total=ss.total)
ss.all
@
Next key thing to understand are the degrees of freedom. Take some
time to see what the code does: It takes the number of observations on
the level below minus one (to get the degrees of freedom) times the
number of observations on that level.
<<>>=
df.treat <- nlevels(Treatment)-1
df.rats <- (nlevels(Rat)-1)*nlevels(Treatment)
df.bits <- (nlevels(Liver)-1)*nlevels(Rat:Treatment)
df.measure <- (2-1)*nlevels(Rat:Treatment:Liver)
df.total <- length(Glycogen)-1
df.all <- c(DF.treat=df.treat, DF.rats=df.rats, DF.bits=df.bits, DF.measure=df.measure, DF.total=df.total)
@
Based on this it is easy to get the variances (mean squares):
<<>>=
ms.all <- ss.all/df.all
names(ms.all) <- sapply(names(ms.all),
function(x) gsub("SS", "MS", x))
ms.all
@
Next important thing to understand is that to calculate the
F-statistics not the error-variance is used but the \textbf{variance
of the spatial scale below}. The reason for this is that this time
the spacial error-varaince is not a quantity excluding other
error-variances. Or to say it another way, every variance still
includes additional error for the spacial scale above.
<<>>=
Fstat <- sapply(1:3, function(i) ms.all[i]/ms.all[i+1])
names(Fstat) <- sapply(names(Fstat),
function(x) gsub("MS", "Fstat", x))
Fstat
@
The procedure for the probabilities is very much like that. It has to
come from F-tables for the degrees of freedom from the actual spacial
scale and the one below.
<<>>=
p.val <- sapply(1:3, function (i) 1- pf(Fstat[i], df.all[i], df.all[i+1]))
names(p.val) <- sapply(names(p.val),
function(x) gsub("Fstat", "p.val", x))
p.val
@
Now first note, that the exactly same p-value for differences in
treatments is archieved with this complicated analysis as with simply
taking means for pseudoreplicates.
Then we can compare our archievements with R's anova for these kind of
analyses:
<<>>=
summary(aov(Glycogen~Treatment+Error(Treatment/Rat/Liver)))
@
Note in this output there are no F-values and probabilities, but trust
me tay are right, as the Sum sq and MS are.\\
The difference in means between the so called \textbf{random effects}
(rats, liverpieces and measurements) is not interesting, as their
facto levels are uninformative. But we can work out how much these
effects contribute to the overall variance, this might help to plann a
better experimental design.
We do this simply by substracting the variance for a certain level
from the variance immediately below to get the additional variance. We
express this variance as a percentage of the overall variance. This
simple technique is called \textbf{variance component analysis}.
<<>>=
var.diff <- (c(sapply(1:3,
function (i) ms.all[[i]]-ms.all[[i+1]]), ms.all[[4]]))
ms.all[[1]]==sum(var.diff)
var.comp <- (var.diff/sum(var.diff))*100
names(var.comp) <- c("Treatment %", "Rats %", "Liver %", "Measurement %")
var.comp
@
This tells us that if we do a proper experiment we could hope to get
significant results for different treatments. The variances for
different Liver pieces and the measurement error are low compared to
variance for individual rats. The experiment should concentrate on a
bigger sample size for rats e.g. several genotypes of inbread lines
would be clever to compare (this would allow for informative factor
levels instead of rat 1 to many).
\subsection*{More complex examples and variance component analysis in R}
\subsubsection*{A fly experiment}
Read in some data: Three male flies were mated to four females (in
total 12 females; note that female is not nested in male despite the
repetetive factor levels of female). These mating was repeated once
for each couple of flies.
<<>>=
fly <- read.table("flies.txt", header=TRUE)
attach(fly)
@
Beside aov there is lme (from the nmle-package) in R to analyse
nested data.
<<>>=
library(nlme)
fm <- lme(eye~1, random= ~1|male/female)
summary(fm)
@
Note that it used restricted maximum likelyhood (REML) to estimate the
model parameters, if we wanted to do model simplification we would
have to specify method=ML.
We extract the variances from our lme model object using VarCorr and express them as percentage of the total variance.
<<>>=
temp <- as.numeric(VarCorr(fm)[,"Variance"])
vars <- temp[!is.na(temp)]
comp <- vars/sum(vars)*100
names(comp) <- c("male %", "female %", "same m and f %")
comp
@
Most variation is between females covered by the same
male. Different males seem to mave not much of an influence and
breeding attempts between the same individuals vary hardly.
Now we can ask wether the 15\% variation attributable to males is
significant. We make a model with a new factor, as female was nested
in males, leading to no pseudoreplication in females.
<<>>=
nf <- gl(12, 2)
fm2 <- lme(eye~1, random= ~1|nf)
summary(fm2)
anova(fm, fm2)
@
The new model ignoring males is not significantly poorer than the old
one including the males. But note that the sample size was extremely
small (3 for males). In a bigger experiment including more males this
could become significant.
\subsubsection*{Mixed effect models for temporal pseudoreplication}
We read in data from a classical field experiment. We have two levels
for treatment with fertilizer, 6 replicate plants for each of this
levels and 5 measurements for each plant (temporal pseudoreplicates).
Root biomass is the response.
<<>>=
grow <- read.table("fertilizer.txt", header=TRUE)
attach(grow)
@
First we do a graphical exploration of the data: We use a grouped Data
object.
<<>>=
library(lattice)
#trellis.par.set(col.whitebg())
gr <- groupedData(root~week|plant, outer = ~fertilizer, grow)
grpl <- plot(gr)
@
<>=
print(grpl)
@
To understand the fised effect it might be better to group the 6
replicate in each treatment.
<<>>=
grpl2 <- plot(gr, outer=T)
@
<>=
print(grpl2)
@
Now we start modelling.
Our fixed effects or root~fertilizer the
random effects are ~week|plant. We can write it like this as we have a
continuous random effect instead of ~1| for categorical random effects.
<<>>=
model <- lme(root~fertilizer, random=~week|plant)
summary(model)
@
From all this complex output we only need to learn, that the
unfertilized controls have a about -1.03 reductio in root mass. With a
standard error of about 0.203 on 10 redidual degrees of freedom and
this is highly significant.
We can compare this to an anova restricted to week 10:
<<>>=
model2 <- aov(root~fertilizer, subset=(week==10))
summary(model2)
summary.lm(model2)
detach(grow)
@
The model is in about the same range as the complex model. lme gave
us a little bit lower estimate of the difference but a higher
significance. Most importantly a correctly specified complex model
allows to avoid arbitrary decissions of which data subset to use in a
time-series. The reason for the differences are that lme uses Best
Linerar Unbiased Predictors (BLUPs), read about them elsewhere!
\subsubsection*{Analysis of split plot experiments}
Split plot experiments can be destinguished from other spatial
pseudoreplicated designs by their informative factor levels. Here the
means of the factors are interesting and should be compared. Clever
designs have the most interesting factor on the deepest spacial scale,
hence the most real replicates for this.
<<>>=
spyield <- read.table("splityield.txt", header=TRUE)
attach(spyield)
@
In this data a field was split in four blocks. In each of these one
half was irrigated the other half not. The irrigation sub-blocks were
seeded in three densities each, and finally three fertilizer
treatments were applied to each of the irrigation-density
sub-sub-blocks.
We do this a last time by hand using the semi-automated functions:
<<>>=
sst <- sum(yield^2)-CF(yield)
ssb <- Q.aov(yield, block)-CF(yield)
ssi <- Q.aov(yield, irrigation)-CF(yield)
@
Here things get different. The error term is the interaction between
blocks and all factors at that plot size or larger. We calculate the
interaction term using combined factor lavels again and substract the
block and irrigation sum of squares. The large plot error sum of
squares is therefore:
<<>>=
ssbi <- Q.aov(yield, block:irrigation)-ssb-ssi-CF(yield)
SS.big <- c(ssb, ssi, ssbi)
@
At this point we can do the first anova table. We just need to get the
degrees of freedom right to to variances an F-statistics. For the
error it is here 8 large blocks in the experiment minus 1 minus the
irrigation and block df.
<<>>=
df.b <- nlevels(block)-1
df.i <- nlevels(irrigation)-1
df.bi <- nlevels(block:irrigation)-1-df.i-df.b
DF.big <- c(df.b, df.i, df.bi)
ms.i <- ssi
ms.bi <- ssbi/df.bi
fstat.i <- ms.i/ms.bi
tval.i <- 1-pf(fstat.i, df.i, df.bi)
@
Now we can move on a spatial scale below to the density. We calculate
density; density and irrigation interaction; and density, irrigation
and block interaction.
<<>>=
ssd <- Q.aov(yield, density)-CF(yield)
ssid <- Q.aov(yield, density:irrigation)-ssi-ssd-CF(yield)
ssbid <- Q.aov(yield, density:irrigation:block)-ssid-ssd-sum(SS.big)-CF(yield)
SS.med <- c(ssd, ssid, ssbid)
@
We get the degrees of freedom to calculate variances and Fstats.
<<>>=
df.d <- nlevels(density)-1
df.id <- df.d*df.i
df.bid <- nlevels(irrigation:density:block) -1 -df.d -df.id -sum(DF.big)
DF.med <- c(df.d, df.id, df.bid)
MS.med <- SS.med/DF.med
fstat.d <- MS.med[1]/MS.med[3]
fstat.id <- MS.med[2]/MS.med[3]
tval.d <- 1-pf(fstat.d, df.d, df.bid)
tval.id <- 1-pf(fstat.id, df.id, df.bid)
@
Finally we can move on to the innermost block.
<<>>=
ssf <- Q.aov(yield, fertilizer)-CF(yield)
ssif <- Q.aov(yield, fertilizer:irrigation)-ssi-ssf-CF(yield)
ssdf <- Q.aov(yield, fertilizer:density)-ssd-ssf-CF(yield)
ssidf <- Q.aov(yield, fertilizer:density:irrigation) -ssid -ssif -ssdf -ssi -ssd -ssf -CF(yield)
SS.small <- c(ssf, ssif, ssdf, ssidf)
sse <- sst-sum(c(SS.big,SS.med,SS.small))
@
And from this using degrees of freedom via variance and F statistics to the probabilities.
<>=
df.f <- nlevels(fertilizer)-1
DF.small <- c(df.f=nlevels(fertilizer)-1, df.if=df.i*df.f, df.df=df.f*df.d, df.idf=df.i*df.d*df.f)
df.err <- length(yield)-1-sum(DF.big)-sum(DF.med)-sum(DF.small)
MS.small <- SS.small/DF.small
MS.small
ms.err <- sse/df.err
Fstat.small <- MS.small/ms.err
Fstat.small
tval.small <- sapply(1:4, function (i) 1- pf(Fstat.small[[i]], DF.small[[i]], df.err))
tval.small
@
So this gives three tables this time:
\begin{table}[ht]
\begin{tabular}{rrrrrr}
\hline
Source & SS & df & MS (var) & F & p \\
\hline
Block & \Sexpr{round(ssb,3)} & \Sexpr{round(df.b,3)} & & \\
Irrigation& \Sexpr{round(ssi, 3)} & \Sexpr{round(df.i,3)} & \Sexpr{round(ms.i,3)} &\Sexpr{round(fstat.i,3)} & \Sexpr{round(tval.i,3)} \\
Error & \Sexpr{round(ssbi,3)} & \Sexpr{round(df.bi,3)} & \Sexpr{round(ms.bi,3)} & & \\
\hline
\end{tabular}
\end{table}
\begin{table}[ht]
\begin{tabular}{rrrrrr}
\hline
Source & SS & df & MS (var) & F & p \\
\hline
Density & \Sexpr{round(ssd,3)} & \Sexpr{round(df.d,3)} & \Sexpr{round(MS.med[[1]],3)} & \Sexpr{round(fstat.d,3)} & \Sexpr{round(tval.d,3)} \\
Irrigation:Density& \Sexpr{round(ssid,3)}&\Sexpr{df.id} & \Sexpr{round(MS.med[[2]],3)} & \Sexpr{round(fstat.id,3)} & \Sexpr{round(tval.id,3)} \\
Error & \Sexpr{round(ssbid,3)} & \Sexpr{round(df.bid,3)} & \Sexpr{round(MS.med[[3]],3)} & & \\
\hline
\end{tabular}
\end{table}
\begin{table}[ht]
\begin{tabular}{rrrrrr}
\hline
Source & SS & df & MS (var) & F & p \\
\hline
Fertilizer & \Sexpr{round(ssf,3)} & \Sexpr{round(DF.small[[1]])} & \Sexpr{round(MS.small[[1]],3)} & \Sexpr{round(Fstat.small[[1]],3)} & \Sexpr{round(tval.small[[1]],3)} \\
Irrigation:Fertilizer& \Sexpr{round(ssif,3)}&\Sexpr{DF.small[[2]]} & \Sexpr{round(MS.small[[2]],3)} & \Sexpr{round(Fstat.small[[2]],3)} & \Sexpr{round(tval.small[[2]],3)} \\
Density:Fertilizer& \Sexpr{round(ssdf,3)}&\Sexpr{DF.small[[3]]} & \Sexpr{round(MS.small[[3]],3)} & \Sexpr{round(Fstat.small[[3]],3)} & \Sexpr{round(tval.small[[3]],3)} \\
Irrigation:Density:Fertilizer& \Sexpr{round(ssidf,3)}&\Sexpr{DF.small[[4]]} & \Sexpr{round(MS.small[[4]],3)} & \Sexpr{round(Fstat.small[[4]],3)} & \Sexpr{round(tval.small[[4]],3)} \\
Error & \Sexpr{round(sse,3)} & \Sexpr{round(df.err,3)} & \Sexpr{round(ms.err,3)} & & \\
\hline
\end{tabular}
\end{table}
Finally we can model in R and will understand the output just calculated by hand.
<<>>=
mod <- aov(yield~irrigation*density*fertilizer+Error(block/irrigation/density/fertilizer))
summary(mod)
@
And plot the model.
<>=
par(mfrow=c(2,2))
interaction.plot(fertilizer, density, yield)
interaction.plot(fertilizer, irrigation, yield)
interaction.plot(density, irrigation, yield)
@
\subsubsection*{An even more complex split-split-split-split-plot experiment}
<>=
splot <- read.table("splitplot.txt", header=TRUE)
attach(splot)
@
I will not do this by hand...
<>=
model <- aov(Biomass~Insect*Mollusc*Rabbit*Lime*Competition*Nutrient+
Error(Block/Rabbit/Lime/Competition/Nutrient))
summary(model)
@
Dauntingly complex! We can use interaction.plot to display the
iteraction terms two at a time.
<>=
par(mfrow=c(2,2))
interaction.plot(Rabbit,Mollusc, Biomass)
interaction.plot(Rabbit, Nutrient, Biomass)
interaction.plot(Nutrient, Mollusc, Biomass)
@
It demonstrates that the strong effect of fertilizer (bottom left) is
hardly altered by mollusc or rabbit. Clearly this model needs
simplification.
\end{document}