Millions of Medians

5 Minute Read

Difficult software problems often reveal the power behind data structures and algorithms. Recently I encountered a challenging problem:

Given a large array of numbers m, calculate the medians of consecutive subsets of n values, scanning from left to right.

Here's a small-scale example:

m = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
n = 3

Consecutive subsets of 3 values would be:

[0, 1, 2], [1, 2, 3], [2, 3, 4] ... [7, 8, 9]

Medians of consecutive subsets can be useful in data science, for example, when studying a list of bank transactions and comparing a given day's spending with the median of previous days. If, say, a given transaction is ten times greater than the median of the previous 30 days, the transaction may be fraudulent.

But, imagine you had millions of numbers as your raw data. What algorithmic approach can possibly hack that?

The trival solution

Working in Java 8, I started with a trivial solution. To take a median in the most straightforward way, we sort the numbers and find the "middle" number. For sets of an even number of values, there are two middle numbers, so we average those. To sort the numbers, I used the Insertion Sort algorithm.

static void insertionSort(int[] arr) {
    for (int i = 1; i < arr.length; i++) {
        int key = arr[i];
        int j = i - 1;

        while (j >= 0 && arr[j] > key) {
            arr[j + 1] = arr[j];
            j = j - 1;
        }

        arr[j + 1] = key;
    }
}

static float trivialMedian(int[] arr) {
    insertionSort(arr);

    int mid = (int)(arr.length / 2);

    if (arr.length % 2 == 0) {
        return (float)(arr[mid-1] + arr[mid]) / 2;
    }

    return (float)arr[mid];
}

This solution did not work at all. Iterating on all subsets, it was far too slow and the script never finished. It was clear something about my understanding of the problem needed to fundamentally change.

Taking a median requires sorting the numbers... can our sorting be simplified?

Using a frequency table

I realized handling consecutive subsets of numbers means that, from one subset to the next, only one value of the subset is actually changed. Let's look at our list again:

[0, 1, 2], [1, 2, 3], [2, 3, 4] ... [7, 8, 9]

Going from [0, 1, 2] to [1, 2, 3], speaking simply of the numbers themselves, 0 is replaced by 3. This is just one small change per iteration. Yet, the subset needs to be sorted again. So, we need a data structure that allows for 1) quick lookup/replacement of any value and 2) quick sorting.

A simple array of numbers is not a good structure, since on each iteration, we would need to perform a search for the number in question, swap it with the new number, and then re-sort the list, changing the position of at most half the numbers.

Instead, a frequency table meets our two requirements really well. On a small scale, a frequency table (in code) turns this:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

... into this:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Essentially, we are using the array indeces as our "buckets":

0 = 1
1 = 1
2 = 1
3 = 1
4 = 1
5 = 1
6 = 1
7 = 1
8 = 1
9 = 1

In practice, a frequency table is most useful when we already know the range of values, and that range is somewhat limited:

static int[] countsFromArray(int[] arr, int max) {
    int counts[] = new int[max + 1];

    for (int i = 0; i < arr.length; i++) {
        counts[arr[i]]++;
    }

    return counts;
}

Iterating through our subsets, swapping a number between iterations is easy, since we only need to increment and decrement once like so:

subset[prevValue]--;
subset[nextValue]++;

Since a frequency table is by nature already sorted, finding the median means only updating our array with its values and taking its middle number.

static void fillFromCounts(int[] arr, int[] counts) {
    int pos = 0;

    for (int i = 0; i < counts.length; i++) {
        int count = counts[i];
        while (0 < count) {
            arr[pos++] = i;
            count--;
        }
    }
}

With this approach, the final program was able to process millions of medians in a matter of milliseconds.

Cheers!


Updated May 22, 2020.