# Hypothesis testing with the general linear model#

Note: This page has some relatively advanced mathematics. You do not need to fully understand this mathematics to follow this course.

## General linear model reprise#

This page starts at the same place as introduction to the general linear model.

# Import numerical and plotting libraries
import numpy as np
import numpy.linalg as npl
import matplotlib.pyplot as plt
# Only show 6 decimals when printing
np.set_printoptions(precision=6)


In that page, we had questionnaire measures of psychopathy from 12 students:

psychopathy = np.array([11.416,   4.514,  12.204,  14.835,
8.416,   6.563,  17.343, 13.02,
15.19 ,  11.902,  22.721,  22.324])


We also had skin-conductance scores from the palms of the each of the same 12 students, to get a measure of how sweaty they are:

clammy = np.array([0.389,  0.2  ,  0.241,  0.463,
4.585,  1.097,  1.642,  4.972,
7.957,  5.585,  5.527,  6.964])


We believe that the clammy score has some straight-line relationship to the psychopathy scores. $$n$$ is the number of elements in psychopathy and clammy: $$n = 12$$. Call the 12 values for psychopathy $$\vec{y} = [y_1, y_2, .... , y_n]$$. The 12 values for clammy are $$\vec{x} = [x_1, x_2, ... , x_n]$$. Our straight line model is:

$\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}} \newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}} \newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}} \newcommand{\ehat}{\hat{\evec}} \newcommand{\cvec}{\vec{c}} \newcommand{\rank}{\textrm{rank}}$
$y_i = c + b x_i + e_i$

where $$c$$ is the intercept, $$b$$ is the slope, and $$e_i$$ is the remainder of $$y_i$$ after subtracting $$c + b x_i$$.

We then defined a new vector $$\evec = [e_1, e_2, ... e_n]$$ for remaining error, and rewrote the same formula in vector notation:

$\yvec = c + b \xvec + \evec$

We defined a new $$n=12$$ element vector $$\vec{1}$$ containing all ones, and used this to build a two-column design matrix $$\Xmat$$, with first column $$\vec{1}$$ and second column $$\vec{x}$$. This allowed us to rewrite the vector formulation as a matrix multiplication and addition:

$\yvec = \Xmat \bvec + \evec$

where $$\bvec$$ is:

$\begin{split} \left[ \begin{array}{\bvec} c \\ b \\ \end{array} \right] \end{split}$

Using the matrix formulation of the general linear model, we found the least squares estimate for $$\bvec$$ is:

$\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec$

The formula above applies when $$\Xmat^T \Xmat$$ is invertible. Generalizing to the case where $$\Xmat^T \Xmat$$ is not invertible, the least squares estimate is:

$\bhat = \Xmat^+ \yvec$

where $$\Xmat^+$$ is the Moore-Penrose pseudoinverse of $$\Xmat$$.

The ^ on $$\bhat$$ reminds us that this is an estimate of $$\bvec$$. We derived this $$\bhat$$ estimate from our sample, hoping that it will be a reasonable estimate for the $$\bvec$$ that applies to the whole population.

# The residual error#

$$\bhat$$ gives us a corresponding estimate of $$\evec$$:

$\ehat = \yvec - \Xmat \bhat$

The least squares criterion that we used to derive $$\bhat$$ specifies that $$\bhat$$ is the vector giving us the smallest sum of squares of $$\ehat$$. We can write that criterion for $$\bhat$$ like this:

$\bhat = \textrm{argmin}_{\bvec} \sum_{i=1}^n e_i^2$

Read this as “$$\bhat$$ is the value of the vector $$\bvec$$ that gives the minimum value for the sum of the squared residual errors”.

From now on, we will abbreviate $$\sum_{i=1}^n e_i^2$$ as $$\sum e_i^2$$, assuming it is the sum over all elements index $$1 .. n$$.

Remembering the definition of the dot product, we can also write $$\sum e_i^2$$ as the dot product of $$\ehat$$ with itself:

$\sum e_i^2 \equiv \ehat \cdot \ehat$

Read $$\equiv$$ as “equivalent to”. We can also express $$\sum e_i^2$$ as the matrix multiplication of $$\ehat$$ as a row vector with $$\ehat$$ as a column vector. Because we assume that vectors are column vectors in matrix operations, we can write that formulation as:

$\sum e_i^2 \equiv \ehat^T \ehat$

## Unbiased estimate of population variance#

We will soon need an unbiased estimate of the population variance. The population variance is $$\frac{1}{N} \sum e_i^2$$ where the population has $$N$$ elements, and $$e_1, e_2, ... e_N$$ are the remaining errors for all $$N$$ observations in the population.

However, we do not have all $$N$$ observations in the population, we only have a $$n$$-size sample from the population. In our particular case $$n=12$$.

We could use the sample variance as this estimate: $$\frac{1}{n} \sum e_i$$. Unfortunately, for reasons we don’t have space to go into, this is a biased estimate of the population variance.

To get an unbiased estimate of the variance, we need to allow for the number of independent columns in the design $$\Xmat$$. The number of independent columns in the design is given by the matrix rank of $$\Xmat$$. Specifically, if $$\rank(\Xmat)$$ is the matrix rank of $$\Xmat$$, an unbiased estimate of population variance is given by:

$\hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2$

For example, we saw in the worked example of GLM, that when we have a single regressor, and $$\rank(\Xmat) = 1$$, we divide the sum of squares of the residuals by $$n - 1$$ where $$n$$ is the number of rows in the design. This $$n-1$$ divisor is Bessel’s correction.

We will also use these terms below:

• $$\rank(\Xmat)$$: degrees of freedom of the design;

• $$n - \rank(\Xmat)$$: degrees of freedom of the error.

## Hypothesis testing#

We used contrast vectors to form particular linear combinations of the parameter estimates in $$\bhat$$. For example, we used the contrast vector $$\cvec = [0, 1]$$ to select the estimate for $$b$$ – the slope of the line:

$b = [0, 1] \bhat$

## t tests using contrast vectors#

The formula for a t statistic test on any linear combination of the parameters in $$\bhat$$ is:

$\newcommand{\cvec}{\vec{c}} t = \frac{\cvec^T \bhat} {\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}$

where $$\hat{\sigma^2}$$ is our unbiased estimate of the population variance.

Here is the t statistic calculation in Python:

# Data vector
y = psychopathy
# Covariate vector
x = clammy
# Contrast vector as column vector
c = np.array([[0, 1]]).T
n = len(y)
# Design matrix
X = np.ones((n, 2))
X[:, 1] = x
# X.T X is invertible
iXtX = npl.inv(X.T @ X)
# Least-squares estimate of B
B = iXtX @ X.T @ y
e = y - X @ B
# Degrees of freedom of design
rank_x = npl.matrix_rank(X)
# The two columns are not colinear, so rank is 2
rank_x

2

# Unbiased estimate of population variance
df_error = n - rank_x
s2_hat = e @ e / df_error
t = c.T @ B / np.sqrt(s2_hat * c.T @ iXtX @ c)
t

array([[1.914389]])


# F tests#

F tests are another way to test hypotheses about the linear models. They are particularly useful for testing whether there is a significant reduction in the residual error when adding one or more regressors.

The simplest and generally most useful way of thinking of F test is as a test comparing two models: a full model and a reduced model. The full model contains the regressors that we want to test. We will use $$\Xmat_f$$ for the full model. The reduced model is a model that does not contain the regressors we want to test, but does contain all other regressors in the full model . We will use $$\Xmat_r$$ for the reduced model.

In our case, $$\Xmat_f$$ is the model containing the clammy regressor, as well as the column of ones that models the intercept.

$$\Xmat_r$$ is our original model, that only contains the column of ones.

If the full model is a better fit to the data than the reduced model, then adding the new regressor(s) will cause a convincing drop in the size of residuals.

The F test is a measure that reflects the drop in the magnitude of squared residuals as a result of adding the new regressors.

Now we define the $$SSR(\Xmat_r)$$ and $$SSR(\Xmat_f)$$. These are the Sums of Squares of the Residuals of the reduced and full model respectively.

$\begin{split} \bhat_r = \Xmat_r^+ \yvec \\ \hat\evec_r = \yvec - \Xmat_r \bhat_r \\ SSR(\Xmat_r) = \hat\evec_r^T \hat\evec_r \\ \end{split}$
$\begin{split} \bhat_f = \Xmat_f^+ \yvec \\ \hat\evec_f = \yvec - \Xmat_f \bhat_f \\ SSR(\Xmat_f) = \hat\evec_f^T \hat\evec_f \end{split}$

$$ESS = SSR(\Xmat_r) - SSR(\Xmat_f)$$ is the Extra Sum of Squared residuals explained by the full compared to the reduced model. The top half of the ratio that forms the F statistic is $$ESS / \nu_1$$, where $$\nu_1$$ is the number of extra independent regressors (columns) in $$\Xmat_f$$ compared to $$\Xmat_r$$. Specifically:

$\nu_1 = \rank(\Xmat_f) - \rank(\Xmat_r)$

The bottom half of the F statistic is the estimated variance $$\hat{\sigma^2}$$ from the full model. This can also be written as $$SSR(\Xmat_f) / \nu_2$$ where $$\nu_2$$ is the degrees of freedom of the error:

$\begin{split} \begin{eqnarray} F_{\nu_1, \nu_2} & = & \frac{ (\hat\evec_r^T \hat\evec_r - \hat\evec_f^T \hat\evec_f) / \nu_{1} } {\hat\evec_f^T \hat\evec_f / \nu_{2}} \\ & = & \frac{ (\textrm{SSR}(\Xmat_r) - \textrm{SSR}(\Xmat_f)) / \nu_1} {\textrm{SSR}(\Xmat_f) / \nu_2} \end{eqnarray} \end{split}$

Here is the F-statistic calculation in Python:

# We already have X, e, rank_x, for the full model, from
# the t calculation
X_f, e_f, rank_f = X, e, rank_x
# Now calculate the same for the reduced model
X_r = np.ones((n, 1))
iXtX_r = npl.inv(X_r.T @ X_r)
B_r = iXtX_r @ X_r.T @ y
e_r = y - X_r @ B_r
rank_r = npl.matrix_rank(X_r)  # One column, rank 1
rank_r

1

# Calculate the F statistic
SSR_f = e_f @ e_f
SSR_r = e_r @ e_r
nu_1 = rank_f - rank_r
F = ((SSR_r - SSR_f) / nu_1) / (SSR_f / (n - rank_f))
F

3.664886189966517


For reasons that we haven’t explained here, the F statistic for a single column is the square of the t statistic testing the same column:

t ** 2

array([[3.664886]])