---
jupyter:
jupytext:
notebook_metadata_filter: all,-language_info
split_at_heading: true
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.13.7
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Hypothesis testing with the general linear model
**Note**: This page has some relatively advanced mathematics. You do not need
to fully understand this mathematics to follow this course.
## General linear model reprise
This page starts at the same place as [introduction to the general linear model](glm_intro.Rmd).
```{python}
# Import numerical and plotting libraries
import numpy as np
import numpy.linalg as npl
import matplotlib.pyplot as plt
# Only show 6 decimals when printing
np.set_printoptions(precision=6)
```
In that page, we had questionnaire measures of psychopathy from 12 students:
```{python}
psychopathy = np.array([11.416, 4.514, 12.204, 14.835,
8.416, 6.563, 17.343, 13.02,
15.19 , 11.902, 22.721, 22.324])
```
We also had skin-conductance scores from the palms of the each of the same 12
students, to get a measure of how sweaty they are:
```{python}
clammy = np.array([0.389, 0.2 , 0.241, 0.463,
4.585, 1.097, 1.642, 4.972,
7.957, 5.585, 5.527, 6.964])
```
We believe that the `clammy` score has some straight-line relationship to
the `psychopathy` scores. $n$ is the number of elements in `psychopathy`
and `clammy`: $n = 12$. Call the 12 values for `psychopathy` $\vec{y} =
[y_1, y_2, .... , y_n]$. The 12 values for `clammy` are $\vec{x} = [x_1,
x_2, ... , x_n]$. Our straight line model is:
$$
\newcommand{\yvec}{\vec{y}}
\newcommand{\xvec}{\vec{x}}
\newcommand{\evec}{\vec{\varepsilon}}
\newcommand{Xmat}{\boldsymbol X}
\newcommand{\bvec}{\vec{\beta}}
\newcommand{\bhat}{\hat{\bvec}}
\newcommand{\yhat}{\hat{\yvec}}
\newcommand{\ehat}{\hat{\evec}}
\newcommand{\cvec}{\vec{c}}
\newcommand{\rank}{\textrm{rank}}
$$
$$
y_i = c + b x_i + e_i
$$
where $c$ is the intercept, $b$ is the slope, and $e_i$ is the remainder of
$y_i$ after subtracting $c + b x_i$.
We then defined a new vector $\evec = [e_1, e_2, ... e_n]$ for remaining
error, and rewrote the same formula in vector notation:
$$
\yvec = c + b \xvec + \evec
$$
We defined a new $n=12$ element vector $\vec{1}$ containing all ones, and
used this to build a two-column *design matrix* $\Xmat$, with first column
$\vec{1}$ and second column $\vec{x}$. This allowed us to rewrite the vector
formulation as a matrix multiplication and addition:
$$
\yvec = \Xmat \bvec + \evec
$$
where $\bvec$ is:
$$
\left[
\begin{array}{\bvec}
c \\
b \\
\end{array}
\right]
$$
Using the matrix formulation of the general linear model, we found the least
squares *estimate* for $\bvec$ is:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$
The formula above applies when $\Xmat^T \Xmat$ is invertible. Generalizing to
the case where $\Xmat^T \Xmat$ is not invertible, the least squares estimate
is:
$$
\bhat = \Xmat^+ \yvec
$$
where $\Xmat^+$ is the [Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse) of $\Xmat$.
The `^` on $\bhat$ reminds us that this is an *estimate* of $\bvec$. We
derived this $\bhat$ estimate from our sample, hoping that it will be a
reasonable estimate for the $\bvec$ that applies to the whole population.
# The residual error
$\bhat$ gives us a corresponding estimate of $\evec$:
$$
\ehat = \yvec - \Xmat \bhat
$$
The least squares criterion that we used to derive $\bhat$ specifies that
$\bhat$ is the vector giving us the smallest sum of squares of $\ehat$. We can
write that criterion for $\bhat$ like this:
$$
\bhat = \textrm{argmin}_{\bvec} \sum_{i=1}^n e_i^2
$$
Read this as “$\bhat$ is the value of the vector $\bvec$ that gives the
minimum value for the sum of the squared residual errors”.
From now on, we will abbreviate $\sum_{i=1}^n e_i^2$ as $\sum e_i^2$, assuming
it is the sum over all elements index $1 .. n$.
Remembering the definition of the dot product, we can also write $\sum e_i^2$
as the dot product of $\ehat$ with itself:
$$
\sum e_i^2 \equiv \ehat \cdot \ehat
$$
Read $\equiv$ as “equivalent to”. We can also express $\sum e_i^2$ as the
matrix multiplication of $\ehat$ as a row vector with $\ehat$ as a column
vector. Because we assume that vectors are column vectors in matrix
operations, we can write that formulation as:
$$
\sum e_i^2 \equiv \ehat^T \ehat
$$
## Unbiased estimate of population variance
We will soon need an unbiased estimate of the population variance. The
population variance is $\frac{1}{N} \sum e_i^2$ where the population has $N$
elements, and $e_1, e_2, ... e_N$ are the remaining errors for all $N$
observations in the population.
However, we do not have all $N$ observations in the population, we only have a
$n$-size *sample* from the population. In our particular case $n=12$.
We could use the sample variance as this estimate: $\frac{1}{n} \sum e_i$.
Unfortunately, for [reasons](https://en.wikipedia.org/wiki/Bessel%27s_correction) we don’t have space to go
into, this is a *biased estimate of the population variance*.
To get an unbiased estimate of the variance, we need to allow for the number
of independent columns in the design $\Xmat$. The number of independent
columns in the design is given by the [matrix rank](matrix_rank.Rmd) of
$\Xmat$. Specifically, if $\rank(\Xmat)$ is the matrix rank of $\Xmat$, an
unbiased estimate of population variance is given by:
$$
\hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2
$$
For example, we saw in the [worked example of GLM](mean_test_example.Rmd),
that when we have a single regressor, and $\rank(\Xmat) = 1$, we divide the
sum of squares of the residuals by $n - 1$ where $n$ is the number of rows in
the design. This $n-1$ divisor is [Bessel’s
correction](https://en.wikipedia.org/wiki/Bessel%27s_correction).
We will also use these terms below:
* $\rank(\Xmat)$: *degrees of freedom of the design*;
* $n - \rank(\Xmat)$: *degrees of freedom of the error*.
## Hypothesis testing
We used contrast vectors to form particular linear combinations of the
parameter estimates in $\bhat$. For example, we used the contrast vector
$\cvec = [0, 1]$ to select the estimate for $b$ – the slope of the line:
$$
b = [0, 1] \bhat
$$
## t tests using contrast vectors
The formula for a t statistic test on any linear combination of the parameters
in $\bhat$ is:
$$
\newcommand{\cvec}{\vec{c}}
t = \frac{\cvec^T \bhat}
{\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}
$$
where $\hat{\sigma^2}$ is our unbiased estimate of the population variance.
Here is the t statistic calculation in Python:
```{python}
# Data vector
y = psychopathy
# Covariate vector
x = clammy
# Contrast vector as column vector
c = np.array([[0, 1]]).T
n = len(y)
# Design matrix
X = np.ones((n, 2))
X[:, 1] = x
# X.T X is invertible
iXtX = npl.inv(X.T @ X)
# Least-squares estimate of B
B = iXtX @ X.T @ y
e = y - X @ B
# Degrees of freedom of design
rank_x = npl.matrix_rank(X)
# The two columns are not colinear, so rank is 2
rank_x
```
```{python}
# Unbiased estimate of population variance
df_error = n - rank_x
s2_hat = e @ e / df_error
t = c.T @ B / np.sqrt(s2_hat * c.T @ iXtX @ c)
t
```
# F tests
F tests are another way to test hypotheses about the linear models. They are
particularly useful for testing whether there is a significant reduction in
the residual error when adding one or more regressors.
The simplest and generally most useful way of thinking of F test is as a test
comparing two models: a *full model* and a *reduced model*. The full model
contains the regressors that we want to test. We will use $\Xmat_f$ for the
full model. The reduced model is a model that does not contain the regressors
we want to test, but does contain all other regressors in the full model
. We will use $\Xmat_r$ for the reduced model.
In our case, $\Xmat_f$ is the model containing the `clammy` regressor, as
well as the column of ones that models the intercept.
$\Xmat_r$ is our original model, that only contains the column of ones.
If the full model is a better fit to the data than the reduced model, then
adding the new regressor(s) will cause a convincing drop in the size of
residuals.
The F test is a measure that reflects the drop in the magnitude of squared
residuals as a result of adding the new regressors.
Now we define the $SSR(\Xmat_r)$ and $SSR(\Xmat_f)$. These are the Sums of
Squares of the Residuals of the reduced and full model respectively.
$$
\bhat_r = \Xmat_r^+ \yvec \\
\hat\evec_r = \yvec - \Xmat_r \bhat_r \\
SSR(\Xmat_r) = \hat\evec_r^T \hat\evec_r \\
$$
$$
\bhat_f = \Xmat_f^+ \yvec \\
\hat\evec_f = \yvec - \Xmat_f \bhat_f \\
SSR(\Xmat_f) = \hat\evec_f^T \hat\evec_f
$$
$ESS = SSR(\Xmat_r) - SSR(\Xmat_f)$ is the Extra Sum of Squared residuals
explained by the full compared to the reduced model. The top half of the
ratio that forms the F statistic is $ESS / \nu_1$, where $\nu_1$ is the number
of extra independent regressors (columns) in $\Xmat_f$ compared to $\Xmat_r$.
Specifically:
$$
\nu_1 = \rank(\Xmat_f) - \rank(\Xmat_r)
$$
The bottom half of the F statistic is the estimated variance $\hat{\sigma^2}$
from the full model. This can also be written as $SSR(\Xmat_f) / \nu_2$ where
$\nu_2$ is the *degrees of freedom of the error*:
$$
\begin{eqnarray}
F_{\nu_1, \nu_2} & = &
\frac{
(\hat\evec_r^T \hat\evec_r - \hat\evec_f^T \hat\evec_f)
/ \nu_{1} }
{\hat\evec_f^T \hat\evec_f / \nu_{2}} \\
& = &
\frac{
(\textrm{SSR}(\Xmat_r) - \textrm{SSR}(\Xmat_f)) / \nu_1}
{\textrm{SSR}(\Xmat_f) / \nu_2}
\end{eqnarray}
$$
Here is the F-statistic calculation in Python:
```{python}
# We already have X, e, rank_x, for the full model, from
# the t calculation
X_f, e_f, rank_f = X, e, rank_x
# Now calculate the same for the reduced model
X_r = np.ones((n, 1))
iXtX_r = npl.inv(X_r.T @ X_r)
B_r = iXtX_r @ X_r.T @ y
e_r = y - X_r @ B_r
rank_r = npl.matrix_rank(X_r) # One column, rank 1
rank_r
```
```{python}
# Calculate the F statistic
SSR_f = e_f @ e_f
SSR_r = e_r @ e_r
nu_1 = rank_f - rank_r
F = ((SSR_r - SSR_f) / nu_1) / (SSR_f / (n - rank_f))
F
```
For reasons that we haven’t explained here, the F statistic for a single
column is the square of the t statistic testing the same column:
```{python}
t ** 2
```