In which I rediscover something first described in 1957, and learn a little Hypothesis along the way.

## Binary search

This is binary search, implemented in Python:

def binary_search(haystack, needle): lo = 0 hi = len(haystack) - 1 found = None iterations = 0 while lo <= hi and found is None: iterations += 1 mid = lo + (hi - lo) // 2 if haystack[mid] == needle: found = mid elif haystack[mid] > needle: hi = mid - 1 else: lo = mid + 1 return (found, iterations)

It returns both the index of the found element and the number of iterations, for reasons which will become apparent in section 3.

How do we know it’s right? Well, let’s test it. I decided to do this with Hypothesis, a property-based testing tool. Here’s a property that an element in the list is found by `binary_search`

:

from hypothesis import given from hypothesis.strategies import lists, integers @given( haystack=lists(integers(), min_size=1), index=integers() ) def test_needle_in_haystack(haystack, index): haystack.sort() needle = haystack[index % len(haystack)] found_index, _ = binary_search(haystack, needle) assert found_index >= 0 assert found_index < len(haystack) assert haystack[found_index] == needle

Given a sorted nonempty list of integers, and an index into that list, the element at that position should be found by `binary_search`

.

We should also test the other case, elements *not* in the list shouldn’t have an index returned:

@given( haystack=lists(integers()), needle=integers() ) def test_needle_might_be_in_haystack(haystack, needle): haystack.sort() found_index, _ = binary_search(haystack, needle) if needle in haystack: assert found_index >= 0 assert found_index < len(haystack) assert haystack[found_index] == needle else: assert found_index is None

## Interpolation search

Binary search is pretty good, but I found myself wondering one day while doing Advent of Code if we could do better by not splitting the search space in the middle, but biasing our split by assuming the data is distributed linearly. After all, if you look in a dictionary for “binary” you don’t start by opening it to “M”.

This is interpolation search, it’s like binary search, but different:

def interpolation_search(haystack, needle): lo = 0 hi = len(haystack) - 1 found = None iterations = 0 while lo <= hi and found is None: iterations += 1 if needle < haystack[lo] or needle > haystack[hi]: # a new special case break elif haystack[lo] == haystack[hi]: # a new special case if needle == haystack[lo]: found = lo else: break else: # a more complex calculation mid = lo + int((((hi - lo) / (haystack[hi] - haystack[lo])) * (needle - haystack[lo]))) if haystack[mid] == needle: found = mid elif haystack[mid] > needle: hi = mid - 1 else: lo = mid + 1 return (found, iterations)

It’s a bit more complex, we’ve got two new special cases: one for if the needle is not in the haystack at all, and one for if all the elements in the haystack are equal. We’ve also got a more complex `mid`

calculation, trying to figure out where in the haystack the needle will appear.

We can use Hypothesis to compare our two search functions against each other:

@given( haystack=lists(integers()), needle=integers() ) def test_interpolation_equiv_binary(haystack, needle): haystack.sort() found_index_b, _ = binary_search(haystack, needle) found_index_i, _ = interpolation_search(haystack, needle) if found_index_b is None: assert found_index_i is None else: assert found_index_i is not None assert haystack[found_index_b] == haystack[found_index_i]

This is a common trick with property-based testing (and lots of types of testing, really): implement a simpler version of your thing and test that the more complex “real” implementation behaves the same as the simpler “test” implementation.

I intentionally didn’t do this:

@given( haystack=lists(integers()), needle=integers() ) def test_interpolation_equal_binary(haystack, needle): haystack.sort() found_index_b, _ = binary_search(haystack, needle) found_index_i, _ = interpolation_search(haystack, needle) assert found_index_b == found_index_i

Because the functions can differ if the needle is present in the haystack multiple times (eg, looking for `0`

in `[0,0,1]`

), and that’s fine.

## Is it faster?

Given our fancy midpoint calculation, the interpolation search *must* be better than (ie, do no more iterations than) binary search, right?

@given( haystack=lists(integers(), min_size=1), index=integers() ) def test_interpolation_beats_binary(haystack, index): haystack.sort() needle = haystack[index % len(haystack)] _, iterations_b = binary_search(haystack, needle) _, iterations_i = interpolation_search(haystack, needle) assert iterations_i <= iterations_b

Wrong.

==================================== FAILURES =================================== ________________________ test_interpolation_beats_binary ________________________ @given( > haystack=lists(integers(), min_size=1), index=integers() ) def test_interpolation_beats_binary(haystack, index): interpolation-search.py:101: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ haystack = [0, 1, 3], index = 64 @given( haystack=lists(integers(), min_size=1), index=integers() ) def test_interpolation_beats_binary(haystack, index): haystack.sort() needle = haystack[index % len(haystack)] _, iterations_b = binary_search(haystack, needle) _, iterations_i = interpolation_search(haystack, needle) > assert iterations_i <= iterations_b E assert 2 <= 1 interpolation-search.py:111: AssertionError ---------------------------------- Hypothesis ----------------------------------- Falsifying example: test_interpolation_beats_binary(haystack=[0, 1, 3], index=64) ====================== 1 failed, 3 passed in 0.34 seconds =======================

We have a counterexample where binary search wins: with the list `[0, 1, 3]`

and the index 64 (which gives a `needle`

of 1), binary search finds it in 1 iteration but interpolation search takes 2.

Let’s step through that example:

iteration | binary search | interpolation search | ||||||
---|---|---|---|---|---|---|---|---|

lo | hi | mid | found | lo | hi | mid | found | |

0 | 0 | 2 | False | 0 | 2 | False | ||

1 | 0 | 2 | 1 | True | 0 | 2 | 0 | False |

2 | 0 | 1 | 1 | True |

In iteration 1, the binary search picks the middle element, which is the right answer. But the interpolation search doesn’t. It’s thrown off by the assumption we’ve made in the `mid`

calculation: that the values will be linearly distributed. If they’re not, the biasing of the interpolation search towards one end of the search space will work against us.

Sadly, my idle thought about a biased search hasn’t revolutionised computer science. Better luck next time.