PKGW: How-To: Take a 2D weighted average in Numpy

2018 October 15

Say that you have a 2D image stored as a Numpy array. Along with the data themselves, you also have a noise image of the uncertainty associated with each pixel. You’d like to bin your image along both axes to increase your signal-to-noise. How does one go about that?

I don’t know of one single preferred tool for doing such averaging — you more or less have to roll your own code. I’ll give the 80% solution here. It's a bit tricky to cover all possible corner cases, but the basic operations should hopefully seem straightforward if you've taken a data analysis class. There's one sneaky trick to writing efficient Python code to do the operation, though.

Fundamentally you want to take a weighted average, where the weight that you use is the inverse variance of each measurement — that is, the inverse of the squares of the standard deviations. Starting with the most straightforward case, imagine you have an array of n uncertain measurements, and you want to take their weighted average in this way. We can write:

# "data" is the array of measurements. "uncerts" is the array
# of their uncertainties / standard deviations.
data = [....]
uncerts = [...]

# "weights" are the weights to use when averaging: inverse variances.
weights = uncerts ** -2

# This is how you take a weighted average:
wt_avg = (data * weights).sum() / weights.sum()

# And if you work out the stats, this is the uncertainty
# of "wt_avg" when things are well-behaved:
uncert_wt_avg = 1 / np.sqrt(weights.sum())

For the first wrinkle, imagine that you have an array of 80 numbers, and you want to take the weighted average of in groups of 10 numbers — eight groups in all. The clever trick is that you can do this efficiently by using the np.reshape() function to pretend that your array is two-dimensional, and then using the axis= optional argument to the sum() function to only sum over one axis:

# We have 80 measurements and 80 uncerts; nonsense values
# used here
data = np.random.normal(size=80)
uncerts = np.ones(80)

# Re-interpret the arrays as 8-by-10 2D arrays
r_data = data.reshape((8, 10))
r_uncerts = uncerts.reshape((8, 10))

# Proceed as before, but only sum along one axis each time. We sum
# along the 1-th axis, which is counted starting from zero -- so
# we're summing along the rightmost axis, the one of size 10.
r_weights = r_uncerts ** -2
wt_avg = (r_data * r_weights).sum(axis=1) / r_weights.sum(axis=1)
uncert_wt_avg = 1 / np.sqrt(r_weights.sum(axis=1))

If you run the above code, you should find that wt_avg and uncert_wt_avg are 8-element 1D arrays, as intended.

You can use the same basic trick for a 2D image by turning it four-dimensional. The resulting code is very similar. Say we have an 80-by-120 image that we want to bin by 10 pixels along both dimensions:

# Data and uncertainty images
data = np.random.normal(size=(80, 120))
uncerts = np.ones((80, 120))

# Re-interpret as 4D
r_data = data.reshape((8, 10, 12, 10))
r_uncerts = uncerts.reshape((8, 10, 12, 10))

# Proceed as before, but now sum along two axes.
r_weights = r_uncerts ** -2
wt_avg = (r_data *
          r_weights).sum(axis=(1,3)) / r_weights.sum(axis=(1,3))
uncert_wt_avg = 1 / np.sqrt(r_weights.sum(axis=(1, 3)))

Finally, things are often trickier in the real world since your array dimensions might not be so neatly divisible into groups. The most convenient thing to do is drop a few samples off the edges of your arrays if needed. There’s no single “best” way to do so: I usually end up experimenting with numbers to trade off between how big I want each bin to be, and how many samples I’m comfortable discarding. Given a target bin size on each axis, though, here’s a sketch of how you might chop your arrays down to the desired size:

def chop_and_reshape_array(arr, y_bin_size, x_bin_size):
    ny, nx = arr.shape
    n_y_bins = ny // y_bin_size
    n_x_bins = nx // x_bin_size
    arr = arr[:n_y_bins * y_bin_size,
              :n_x_bins * x_bin_size]
    return arr.reshape((n_y_bins, y_bin_size, n_x_bins, x_bin_size))

Here we are using my preferred convention of mentally labeling the two axes of a 2D array [y, x]: this corresponds to how 2D arrays are visualized in Python.

Questions or comments? For better or worse this website isn’t interactive, so send me an email or, uh, Toot me.

To get notified of new posts, try subscribing to my lightweight newsletter or my RSS/Atom feed. No thirsty influencering — you get alerts about what I’m writing; I get warm fuzzies from knowing that someone’s reading!

See a list of all how-to guides.

On GitHub you can propose a revision to this guide or view its revision history.