Subtracting the mean from columns or rows#

We often want to do operations like subtract the mean from the columns or rows of a 2D array. For example, here is a 4 by 3 array:

import numpy as np
import matplotlib.pyplot as plt
# Display array values to 6 digits of precision
np.set_printoptions(precision=6, suppress=True)
arr = np.array([[3., 1, 4], [1, 5, 9], [2, 6, 5], [3, 5, 8]])
array([[3., 1., 4.],
       [1., 5., 9.],
       [2., 6., 5.],
       [3., 5., 8.]])

Let’s say I wanted to remove the mean across the columns (the row mean). Here is the row mean:

# Mean across the second (column) axis
row_means = np.mean(arr, axis=1)
array([2.666667, 5.      , 4.333333, 5.333333])

This is a 1D array:


I want do something like the following, but in a neater and faster way:

# Use a loop to subtract the mean from each row
de_meaned = arr.copy()
for i in range(arr.shape[0]):  # iterate over rows
    de_meaned[i] = de_meaned[i] - row_means[i]
# The rows now have very near 0 mean
np.mean(de_meaned, axis=1)
array([0., 0., 0., 0.])

An inefficient way using “np.outer”#

One way of doing this subtraction, is to expand the 1D shape (4,) mean vector out to a shape (3, 4) array, where the new columns are all the same as the (4,) mean vector. In fact you can do this with np.outer and a vector of ones:

means_expanded = np.outer(row_means, np.ones(3))
array([[2.666667, 2.666667, 2.666667],
       [5.      , 5.      , 5.      ],
       [4.333333, 4.333333, 4.333333],
       [5.333333, 5.333333, 5.333333]])

Now we can subtract this expanded array to remove the row means:

re_de_meaned = arr - means_expanded
# The row means are now very close to zero
np.mean(re_de_meaned, axis=1)
array([0., 0., 0., 0.])

This is an example of vectorizing. We worked out a way of doing the operation we wanted by using arrays, rather than having to loop over the rows of the matrix.

An efficient way using NumPy broadcasting#

Our example array is shape (4, 3):

(4, 3)

Above we used np.outer to make a new array shape (4, 3) that replicates the shape (4,) row mean values across 3 columns. We then subtract the new (4, 3) mean array from the original to subtract the mean.

NumPy broadcasting is a way to get to the same outcome, but without creating a new (4, 3) shaped array. Although broadcasting takes a while to get used to, it usually results in code that is more concise and saves memory by avoiding large temporary arrays. In our case, the temporary means array of shape (4, 3) is very small, but if arr had many more rows and / or columns, then the temporary means array could be very large.

See NumPy broadcasting for a detailed description of how broadcasting works. Here, we can summarize by saying that broadcasting tries to guess what full arrays we will need by replicating rows or columns or planes until the shapes of the two input arrays match.

Here is the broadcasting way of subtracting the row means:

# Make row_means into column vector so numpy knows to replicate
# the columns during broadcasting.
row_means_col_vec = np.reshape(row_means, (4, 1))  # Better: np.newaxis.
broadcast_demeaned = arr - row_means_col_vec
np.mean(broadcast_demeaned, axis=1)
array([0., 0., 0., 0.])

When NumPy sees arr - row_means_col_vec it notices that arr is shape (4, 3) and row_mean_col_vec is shape (4, 1). It can’t do an elementwise operation like subtract with these shapes, so it will try and work out if it can expand any missing or length 1 dimensions in the input arrays to make the shapes match. In this case, it sees that it can replicate the column of row_mean_col_vec 3 times to make an array shape (4, 3). It does this in an efficient way that re-uses the memory from the first column to make up the data for the other columns, therefore saving memory compared to creating a new full (4, 3) array.

You can see what NumPy is going to do when it tries to do elementwise operations on arrays of these shapes by using np.broadcast_arrays:

# Show what arrays NumPy will broadcast to.
bc_arr, bc_row_means = np.broadcast_arrays(arr, row_means_col_vec)
# The (4, 3) array is unchanged when broadcasting.
print(np.all(bc_arr == arr))
# The (4, 1) array has its columns replicated to give a (4, 3) array.
array([[2.666667, 2.666667, 2.666667],
       [5.      , 5.      , 5.      ],
       [4.333333, 4.333333, 4.333333],
       [5.333333, 5.333333, 5.333333]])