I'd like to share this one unpopular but awesome algorithm that I only heard a mention of once. Kapun's algorithm finds a hash collision for moduli as large as $$$10^{18}$$$. I know, we have lots of algorithms that do much better than that (and I'm writing an article on that at the moment, keep tuned), but Kapun's algorithm is really surprising in its simplicity. That its correctness is so difficult to prove is of more surprise even.
Here we go.
A polynomial hash of a string $$$s$$$ is defined as
We want to find two strings with the same polynomial hash using only two characters: $$$0$$$ and $$$1$$$.
We can re-formulate this problem in another way. Let $$$a_i = b^i \mod M$$$, then we want to find two distinct subsets of $$$a$$$ with the same sum modulo $$$M$$$. Now forget about the modulus: let's just find two subsets with the same sum.
Firstly, when is this possible? There are $$$2^n$$$ possible subsets and $$$n(M - 1) - n + 1 = n (M - 2) + 1$$$ distinct sums. If $$$2^n > n (M - 2) + 1$$$, that is, $$$\frac{2^n}{n} \gg M$$$, there are certainly two subsets with the same sum. Generating $$$M$$$ or even just $$$\sqrt{M}$$$ strings (if you use birthday paradox) is impossible for large $$$M$$$, but there's still a way through.
It turns out there is a deterministic algorithm that attempts to solve this problem that is very easy to implement!
Algorithm
Suppose we have a list of $$$n$$$ integers in range $$$[1; R]$$$ each: $$$[a_1, a_2, \dots, a_n]$$$. Kapun's algorithm attempts to find two subsets of these integers with equal sum.
The algorithm is as follows. Instead of finding two subsets, we will find an expression of kind $$$\pm a_{i_1} \pm a_{i_2} \dots \pm a_{i_k} = 0$$$ for some $$$i$$$ and some signs. It's easy to see that this is equivalent to finding two subsets with the same sum. Initially, we have $$$n$$$ trivial expressions $$$a_1, a_2, \dots, a_n$$$. The algorithm is iterative; at each step we do the following:
- If we have an expression with value $$$0$$$ at the moment, return this expression.
- Otherwise, sort the expressions by their values.
- Replace the current expressions $$$[b_1, b_2, \dots, b_k]$$$ by $$$[b_2 - b_1, b_4 - b_3, b_6 - b_5, \dots]$$$, that is, the expressions are divided into pairs and each pair is replaced by its difference. If $$$k$$$ is odd, $$$b_k$$$ is simply dropped.
- Repeat from the beginning.
If we end up with $$$b = [x]$$$ and $$$x > 0$$$, the algorithm has failed.
Theorem
Kapun's algorithm necessarily succeeds if $$$R > \alpha + n^{\beta \ln n}$$$, where $$$\alpha, \beta$$$ are some small constants.
Proof. It's obvious that if there exists a counterexample for a particular $$$(n, R)$$$ tuple, it also exists for $$$(n, R + 1)$$$. Therefore we can define $$$f(n)$$$ as the smallest $$$R$$$ for which a counterexample exists for specific $$$n$$$.
Also, let's denote by operator $$$\sigma a$$$ the sorted array $$$a$$$, and by $$$\Delta a$$$ the neighbour-difference array of $$$\sigma a$$$. For instance, $$$\Delta [10, 5, 9, 7] = \Delta [5, 7, 9, 10] = [2, 1]$$$, and $$$\sigma [5, 1, 2] = [1, 2, 5]$$$.
For a particular $$$n$$$, let us consider the smallest countertest, the "smallest" meaning the one with the smallest $$$R$$$. Let's say it's some $$$[a_1, a_2, \dots, a_n]$$$. The algorithm sorts this array, so without loss of generality assume $$$a$$$ is sorted.
As long as $$$a_1 > 0$$$, the subsequent behavior of the algorithm only depends on $$$\Delta a$$$, right? So we can subtract $$$a_1 - 1$$$ from all elements of $$$a$$$, because this operation does not affect the difference array, $$$a_1$$$ stays positive and $$$R = a_n$$$ decreases. Therefore, in the optimal countertest $$$a_1 = 1$$$.
Notice that an array of kind $$$[1, a_2, a_3, a_4, \dots]$$$ has the same difference array as an array of kind $$$[1, a_2, a_2, a_4 - (a_3 - a_2), \dots]$$$, so in the optimal array $$$a_2 = a_3, a_4 = a_5$$$, and so on. Therefore the optimal countertest is of kind $$$[1, x, x, y, y, z, z, \dots]$$$.
Let us denote $$$a' = \Delta a$$$. Then $$$a = [1, 1 + a'_1, 1 + a'_1, 1 + a'_1 + a'_2, 1 + a'_1 + a'_2, \dots]$$$. In particular, $$$a_n = 1 + a'_1 + a'_2 + \dots$$$. Notice that the mapping between $$$a'$$$ and the optimal $$$a$$$ is a bijection! This means that instead of searching for $$$a$$$ of size $$$n$$$ with minimal $$$a_n = R$$$, we can forget about $$$a$$$ and search for $$$a'$$$ of size $$$n' = \lfloor \frac{n}{2} \rfloor$$$ with the minimal sum of $$$a'$$$.
What can we say about the optimal $$$a'$$$? We only care about its sum, so without loss of generality assume it's sorted. As long as $$$a'_1 > 0$$$, the behavior of the algorithm only depends on $$$\Delta a'$$$, so we can subtract $$$a'_1 - 1$$$ from all elements of $$$a'$$$, because the sum of the elements (non-strictly) decreases. Therefore $$$a'_1 = 1$$$. We can also assume the array $$$a'_2 = a'_3, a'_4 = a'_5$$$ and so on, because subtracting a constant from a suffix of the array decreases the sum. Therefore the optimal array is of kind $$$[1, x, x, y, y, z, z, \dots]$$$.
Let us denote $$$a{''} = \Delta a'$$$. Then $$$a' = [1, 1 + a{''}_1, 1 + a{''}_1, 1 + a{''}_1 + a{''}_2, 1 + a{''}_1 + a{''}_2, \dots]$$$. The sum of $$$a'$$$ is some weighted sum of $$$a{''}$$$. Notice that the mapping between $$$a{''}$$$ and the optimal $$$a'$$$ is a bijection, so we can search for $$$a{''}$$$ of size $$$n{''} = \lfloor \frac{n'}{2} \rfloor$$$ with a particular weighted sum of $$$a'$$$ minimized.
You get the drill. Every time we have some weight function $$$\mu(a) = \lambda_1 a_1 + \lambda_2 a_2 + \dots$$$, and we want to find $$$a$$$ with the minimal weight. $$$\lambda_1 \le \lambda_2 \le \dots \le \lambda_n$$$ holds for all reductions except the first one (when $$$\lambda = [0, 0, \dots, 0, 1]$$$), so due to the rearrangement inequality the optimal $$$a$$$ is sorted. Subtracting $$$a_1 - 1$$$ from all elements of $$$a$$$ decreases $$$\mu(a)$$$ because $$$\lambda_i > 0$$$. Subtracting a constant from a suffix of $$$a$$$ reduces $$$\mu(a)$$$ for the same reason, so the optimal $$$a$$$ is of kind
Minimizing $$$\mu(a)$$$ is equivalent to minimizing $$$\mu'(a')$$$, where
It's easy to see from this formula that $$$\lambda'$$$ is always increasing, so the argument can be continued by induction.
Now, what do we get at the end? We have $$$a = [x]$$$ and $$$\mu(a) = \lambda_1 a_1$$$; it's obvious that the optimal $$$a$$$ is $$$[1]$$$. Treating that as a neighbour-difference array of some $$$a'$$$ we get $$$a' = [1, 2]$$$ or $$$a' = [1, 2, 2]$$$ for the optimal $$$a'$$$, depending on the parity. Treating that as a neighbour-difference array of some $$$a{''}$$$ we get $$$a{''} = [1, 2, 2, 4]$$$ or $$$a{''} = [1, 2, 2, 4, 4]$$$ in the former case and $$$a{''} = [1, 2, 2, 4, 4, 6]$$$ or $$$a{''} = [1, 2, 2, 4, 4, 6, 6]$$$ in the latter case.
It takes a little while to notice the following fact.
Let us consider an infinite sequence $$$G$$$ with the following properties:
- $$$G_1 = 1$$$,
- $$$\Delta G = G$$$,
- among such $$$G$$$ it is the lexicographically smallest one.
Let us prove that the optimal $$$a$$$ for a given $$$n$$$ (arbitrary) and $$$\mu$$$ (with $$$\lambda$$$ increasing) is the prefix of $$$G$$$ of length $$$n$$$, by induction. For $$$n = 1$$$ this is obvious. Induction step: say we have already proven the fact for all lesser $$$n$$$. We have also already derived that the optimal $$$a$$$ is of kind $$$[1, x, x, y, y, z, z, \dots]$$$. Due to induction $$$a' = \Delta a$$$ is optimal for $$$\mu'$$$ if $$$a'$$$ is a prefix of $$$G$$$ of length $$$n'$$$. Therefore, $$$a$$$ satisfies the following conditions:
- $$$a_1 = 1$$$,
- $$$\Delta a = [G_1, G_2, \dots, G_n]$$$,
- among such $$$a$$$ it is the lexicographically smallest one.
It's easy to see that this implies that $$$a$$$ is the prefix of $$$G$$$ of length 1. The proof by induction is complete.
We now know that $$$G$$$ gives the smallest counterexample to Kapun's algorithm, for lots of definitions of "smallest" (namely, for all $$$\mu$$$ with non-decreasing $$$\lambda$$$). Therefore $$$f(n) = G_n$$$.
The sequence $$$G$$$ is as follows:
This series is given by A018819 and is asymptotically
with high precision. Therefore Kapun's algorithm works for all
QED.
Upper bound
The bound obtained above can be seen as a kind of "lower bound", meaning that, if you have an array and you want to find two subsets, it makes no sense to decrease $$$R$$$ any further than that bound.
It it reasonable to attempt to provide an upper bound in the same fashion, i.e. some $$$R$$$ such that there is little chance that Kapun's algorithm succeeds anymore.
This bound is, unfortunately, much less rigid than the lower bound. It is derived on assumption that Kapun's algorithm works best when $$$a$$$ is selected randomly uniformly from $$$[1; R]^n$$$, which is of course false. We can only hope that whatever bound we derive is of the right magnitude.
So let's forget about math and all and assume $$$a$$$ is selected randomly. Then, after sorting, $$$a_{i+1} - a_i \approx \frac{R}{n}$$$. So after running one iteration of Kapun's algorithm we have $$$R' \approx \frac{R}{n}$$$ and $$$n' \approx n$$$. After all $$$\log_2 n$$$ operations we have
We want $$$R' \ge 1$$$ or something, so $$$R > n^{(\log_2 n) / 2}$$$. This bound turns out to be really similar to the rigid lower bound, so maybe we weren't that much off after all.
As far as I am aware, this is what people used to call the "proof" of Kapun's algorithm, which is obviously wrong, but at least the correct proof explains away why it worked so well.
Application to hashing
In our use case, $$$R = M - 1$$$. After rearranging the inequality of the lower bound we get
which translates to $$$n \approx 625$$$ for $$$M \approx 10^9$$$ and to $$$n \approx 8996$$$ for $$$M \approx 10^{18}$$$.
For the upper bound,
which translates to $$$n \approx 212$$$ for $$$M \approx 10^9$$$ and to $$$n \approx 1958$$$ for $$$M \approx 10^{18}$$$.
In reality, the truth is somewhere in the middle. For $$$M \approx 10^{18}$$$, the 50-th percentile is $$$n \approx 3500$$$ and the 95-th percentile is $$$n \approx 4000$$$.
Implementation
Licensed under GPLv3 or any later version at your option for any use cases, and under MIT for use in competitive programming tournaments.