Analysis of polynomial hashing

Правка en3, от purplesyringa, 2022-02-17 19:28:28

Unfortunately, most derivations of the chance of polynomial hashing collision are invalid, wrong, or misleading, and finding reliable public sources with proofs is incredibly difficult. This article is a formal analysis of the method. The goal of this article is to complement well-known empirical facts with theory, provide boundaries on the probability of collision, justify common choices, and advise against certain popular parameters.

There is no code in this article, and it's mostly theoretical. It's more of a summa of everything we know about polynomial hashing in integers rather than a guide for beginners. You'll probably find something of interest still. I do, however, describe some methods of cracking hashes, which I can provide code and additional explanation for if someone asks in the comments, and some general guidelines in the last section of the present article.

Table of contents

  • The concept of polynomial hashing
  • Classical justification of the coprimality requirement
  • The pitfall
  • Probability of size-bounded collision
  • Probability of value-bounded collision
  • Back to basics
  • Conditions of surjectivity
  • Handling lengths
  • Side note on coprimality
  • The two scenarios
  • Reversal
  • Thue-Morse sequence
  • Attempt to reuse Thue-Morse sequence for other moduli
  • Birthday paradox
  • Lower bounds for anti-hash tests
  • Particularities
  • Key takeaways

The concept of polynomial hashing

This section is an introduction on polynomial hashing that contains well-known facts. Feel free to skip to the next one if you are familiar with everything explained here.

A hash of an object $$$s$$$ is an object $$$H(s)$$$ (usually an integer) that satisfies the following conditions:

  • $$$s \equiv t \implies H(s) = H(t)$$$
  • $$$H(s) = H(t) \implies s \equiv t$$$ with high probability

$$$\equiv$$$ might denote something as simple as equality, e.g. if $$$s$$$ and $$$t$$$ are strings, or some potentially more complicated equivalence relation.

The classic approach to hashing a string $$$s = s_0 s_1 s_2 \dots s_{n-1}$$$ is polynomial hashing. The polynomial hash of a string is defined as

$$$ H(s) = (s_0 + s_1 b + s_2 b^2 + \dots + s_{n-1} b^{n-1}) \mod M, $$$

where $$$s_i$$$ is the $$$i$$$-th character of the string, $$$b$$$ is an arbitrary integer larger to or equal than than the order of the alphabet, and $$$M$$$ is an arbitrary modulus.

If we ignore the modulus, $$$H(s)$$$ is simply $$$s$$$ written in a positional numeral system with base $$$b$$$. $$$b \ge \lvert \Sigma \rvert$$$ ensures that $$$H(s)$$$ is unique among all $$$s$$$ (that is, if the alphabet is a series of consecutive integers). Alas, this means $$$H(s)$$$ may be an arbitrarily large number, so we take the result modulo $$$M$$$. This means all hashes fit within a finite integral type, but that hash collisions are theoretically possible, if with low probability. The problem is to compute said probability.

Classical justification of the coprimality requirement

For polynomial hashing, a hash collision is said to occur if $$$s \ne t$$$ yield $$$H(s) = H(t)$$$, that is, if

$$$ (s_0 - t_0) + (s_1 - t_1) b + (s_2 - t_2) b^2 + \dots + (s_{n-1} - t_{n-1}) b^{n-1} = 0 \pmod{M}. $$$

We assume that the strings $$$s$$$ and $$$t$$$ are of equal length here, because this is a more practical case and it does not affect the probability estimate much. Note that finding a hash collision is (almost) equivalent to finding a non-zero string $$$u = s - t$$$ with null hash, except that the alphabet is twofold (i.e. $$$\Sigma \oplus (-\Sigma)$$$), where $$$\oplus$$$ denotes pairwise summation/Minkowski sum).

Suppose now that $$$u = u_0 u_1 u_2 \dots u_{n-1}$$$ is a string, and we want to compute its probability of having a zero hash—this is almost identical to computing the probability of a hash collision among two random strings.

Consider any $$$u$$$ with zero hash. For which $$$x$$$ is

$$$ u' = u_0 (u_1 + x) u_2 \dots u_{n-1} $$$

also a string of null hash? This requirement is equivalent to $$$xb = 0 \pmod M$$$, and this equation has $$$\frac{M}{(b, M)}$$$ solutions modulo $$$M$$$. If $$$x$$$ was added to $$$u_2$$$, there would be $$$\frac{M}{(b^2, M)}$$$ solutions, and so on. Therefore, if $$$b$$$ and $$$M$$$ have a common divisor, there are several solutions for $$$x$$$, which is obviously worse for hash collision rate. Hence $$$b \perp M$$$ is the better choice.

The pitfall

Forget $$$x$$$ and compute the probability that $$$u$$$ has a zero hash directly.

$$$ H(u) = (u_0 + u_1 b + u_2 b^2 + \dots + u_{n-1} b^{n-1}) \mod M. $$$

If $$$b, M, u_1, u_2, \dots, u_{n-1}$$$ were fixed and $$$u_0$$$ was a variable, how many solutions for $$$H(u)$$$ would there be? Obviously, only one: that is, $$$u_0 = -b H(u_1 u_2 \dots u_{n-1}) \mod M$$$. Summing over all $$$u_1, u_2, \dots, u_{n-1}$$$, we get that exactly $$$M^{n-1}$$$ strings have a zero hash. Therefore, the probability of a hash collision is $$$\frac{1}{M}$$$, regardless of whether $$$b$$$ and $$$M$$$ are coprime.

The problem is that we don't really want to know the probability of a hash collision between two random strings—it's going to be $$$\frac{1}{M}$$$ regardless of our choice of $$$b$$$ and $$$M$$$.

What we really want to know in practice is whether two similar strings $$$s$$$ and $$$t$$$ have the same hash. In other words, we are interested in the probability that $$$H(u) = 0$$$ when $$$u$$$ is small, for some definition of 'small'.

The case considered in the classical derivation is that $$$u$$$ has exactly one non-zero element $$$u_i$$$. For a given $$$i$$$, the probability of collision is then $$$\frac{(b^i, M)}{M}$$$. Summing over all $$$i$$$ we get

$$$ p_1 = \frac{1}{nM} \cdot \sum\limits_{i=0}^{n-1} (b^i, M). $$$

This is easy to illustrate with a toy example of $$$b = 0$$$. In this case, $$$p_0 \approx 1$$$, because all strings of kind $$$c u_1 u_2 \dots u_{n-1}$$$ have the same hash $$$c$$$, so the hash does not change when the string is fluctuated starting with $$$i = 1$$$, which is a better metric than the average collision rate.

If we consider $$$p_1$$$ the target metric, $$$b \perp M$$$ comes automatically. However, whether $$$M$$$ is prime does not seem to impact the probability.

Probability of size-bounded collision

It is reasonable to improve our collision model by analyzing the practical use cases more thoroughly. In real life, the compared strings are often similar, and the anti-hash tests are small enough, which leads us to the following observation.

Let us consider the probability $$$p_k$$$ of a string $$$u$$$ of length $$$n$$$ with at most $$$k$$$ non-zero elements having a zero hash. Since $$$p_1 = \frac{1}{M}$$$ only for coprime $$$b$$$ and $$$M$$$, it is reasonable to limit the following discussion to the case of $$$b \perp M$$$.

For a particular $$$k$$$, $$$p_k$$$ equals the expectation of the probability of collision among all $$$k$$$-subsets of the $$$n$$$ indices, where the $$$k$$$-subset denotes the subset of indices that are allowed to be non-zero. For a particular subset $$$i_1, i_2, \dots, i_k$$$, we get an equation of kind:

$$$ u_{i_0} b^{i_0} + u_{i_1} b^{i_1} + \dots + u_{i_k} b^{i_k} = 0 \pmod M. $$$

As $$$b \perp M$$$, $$$b$$$ is invertible modulo $$$M$$$ and hence the equation can be rewritten as

$$$ u_{i_0} = -u_{i_1} b^{i_1 - i_0} - \dots - u_{i_k} b^{i_k - i_0} \pmod M. $$$

Therefore any combination of $$$u_{i_1}, \dots, u_{i_k}$$$ yields a single value for $$$u_{i_0}$$$, and the probability is $$$\frac{1}{M}$$$. This probability is the same for each $$$k$$$-subset, and hence $$$p_k = \frac{1}{M}$$$.

This preliminary result has important impact: in most practical use cases, the probability of collision is $$$\frac{1}{M}$$$ independent of primality of $$$M$$$ (as long as $$$b \perp M$$$), whether the strings differ in a particular number of characters, at particular indices, or under any or none conditions, as long as the numbers $$$u_i$$$ are treated as black boxes, that is, arbitrary numbers from $$$\mathbb{Z}/M\mathbb{Z}$$$.

Probability of value-bounded collision

However, in some tasks, the values of $$$u_i$$$ are, in fact, limited by the alphabet $$$\Sigma$$$. In this section we will calculate the probability of a string $$$u$$$ of length $$$n$$$ yielding a zero hash under the condition that $$$u_i \in \Sigma$$$.

Consider a Markov process on $$$\mathbb{Z}/M\mathbb{Z}$$$, where we start at $$$0$$$ and we move from $$$i$$$ to $$$bi + c$$$ with probability $$$\frac{1}{\Sigma}$$$, where $$$c \in \Sigma$$$. This is equivalent to generating a hash of a string character by character, adding elements to the beginning one by one. The transition matrix $$$\mathbf{A}$$$ indexed by $$$(\mathbb{Z}/M\mathbb{Z})^2$$$ depends on the modulus $$$M$$$, the base $$$b$$$ and the alphabet $$$\Sigma$$$. The probability of collision $$$p(n)$$$ is then the probability of arriving at $$$0$$$ after $$$n$$$ transitions, that is, $$$v^T \mathbb{A}^n v$$$, where $$$v$$$ is the column vector having $$$v_i = \delta_{i,0}$$$.

For example, if $$$\Sigma = \mathbb{Z}/M\mathbb{Z}$$$, that is, all integers are in the alphabet, $$$\mathbf{A}_{i,j} = \frac{1}{M}$$$ universally, and similarly $$$\mathbf{A}^n$$$ has the same value, and $$$p = \frac{1}{M}$$$, which is a result we have obtained before.

More generally,

$$$ \mathbf{A}_{i,j} = \frac{[((j - ib) \mod M) \in \Sigma]}{\lvert \Sigma \rvert}. $$$

Notice that each row is a cyclic shift of the previous row by $$$b$$$.

It is reasonable to compute the probability for large $$$n$$$, that is, the limit

$$$ p = \lim\limits_{n \to \infty} p(n) = v^T \mathbb{A}^\infty v. $$$

This is about the moment where everything breaks down.

Firstly, let us consider the "best" case. For regular matrices, that is, the matrices for which

$$$ \exists n \ \forall i, j \ \mathbf{A}^n_{i,j} \ne 0, $$$

it is known that

$$$ \mathbf{A}^\infty = \begin{bmatrix} \pi^T \\ \pi^T \\ \vdots \\ \pi^T \end{bmatrix}, $$$

where $$$\pi$$$ is the equilibrium vector of $$$\mathbf{A}$$$, that is, the (only) solution to

$$$ \pi^T \mathbf{A} = \pi^T \iff \mathbf{A}^T \pi = \pi $$$

having absolute value $$$1$$$. As $$$b \perp M$$$, the sum of each column of $$$\mathbf{A}$$$ is $$$1$$$ and therefore $$$\pi_i = \frac{1}{M}$$$ is the (only solution), and $$$p = \frac{1}{M}$$$ as a consequence.

For irregular matrices, such decomposition most likely won't work and there will be $$$\mathbf{A}_{i,j} > \frac{1}{M}$$$, potentially increasing the likelihood of collision. So our goal is to determine which $$$(b, M, \Sigma)$$$ yield regular matrices.

Back to basics

The condition that $$$\mathbf{A}$$$ is regular can be rewritten in terms of the hash function. Basically, $$$\mathbf{A}$$$ is regular if and only if there is $$$n$$$ such that $$$H$$$ is surjective for strings of length exactly $$$n$$$. This is a trivial reformulation, and is an obvious fact, but this is an insight as to why surjectivity is a sufficient condition.

To present an example, $$$\Sigma = \{1\}$$$ allows exactly one string of length $$$n$$$, and therefore $$$\mathbf{A}$$$ is irregular and we cannot determine the behavior of $$$H$$$ for infinitely long strings. $$$\Sigma = \{1, 2\}$$$, on the other hand, yields a regular transition matrix for big reasonable $$$b, M$$$ and so the probability of collision is $$$\frac{1}{M}$$$.

Conditions of surjectivity

In practice we often have $$$\Sigma = \{0, 1, 2, \dots, b-1\}, b > 1$$$. Does this yield surjective $$$H$$$? The answer is yes, and we can prove this by repeated division. To generate a string $$$s$$$ of length $$$n$$$ with hash $$$h$$$, let

$$$ s_i = \left\lfloor \frac{h}{b^i} \right\rfloor \mod b. $$$

This string quite obviously has the correct hash, and $$$s_i \in \Sigma$$$.

The case of $$$\Sigma = \{x, x+1, x+2, \dots, x+b-1\}$$$ is similar because we can simply subtract $$$x (1 + b + b^2 + \dots + b^{n-1})$$$ from $$$h$$$ before applying the algorithm above.

We have also already seen that $$$\Sigma = \{x\}$$$ is not surjective, which coincides to exclusion of $$$b = 1$$$ from the special case above.

A more tricky question is whether $$$\Sigma = \{0, 1\}$$$ yields surjective $$$H$$$. The answer is yes: solve $$$b^i = 1 \pmod M$$$ for $$$i$$$ and let the first $$$M - 1$$$ roots be $$$i_1, i_2, \dots, i_{M-1}$$$. Then a string with zeroes everywhere but in $$$h$$$ of those roots has hash $$$h$$$. Of course, this is not a practical answer, but we will return to this shortly.

If $$$\Sigma = \{x, y\}$$$, the answer is yes iff $$$x - y \perp M$$$. More generally, if for some $$$p \vert M$$$ all elements of $$$\Sigma$$$ are congruent modulo $$$p$$$, $$$H$$$ is not surjective.

In practice, if the modulus of choice happens to have a small divisor, say, 13, and there is a testcase that only uses letters A and N, the probability of collision is thirteen times greater than usual. Note that this implies that for a "good" hash, $$$M$$$ has to have no small divisors.

I failed to provide a closed predicate for surjectivity of $$$H$$$, but it seems to me from my experiments that as long as the congruence condition above fails, $$$H$$$ is surjective.

In most practical applications, $$$\Sigma$$$ is either a segment of integers of a union of segments, and in either case $$$H$$$ is surjective.

Handling lengths

It is also reasonable to consider collisions between hashes of strings of different lengths. Firstly, we expect $$$0 \not\in \Sigma$$$, because otherwise there are many trivial collisions.

Start with a toy example of $$$\Sigma = \{1\}$$$, then a hash of a string of length $$$n$$$ is always $$$1 + b + b^2 + \dots + b^{n-1}$$$.

Two hashes collide if for $$$n < m$$$:

$$$ 1 + b + b^2 + \dots + b^{n-1} = 1 + b + b^2 + \dots + b^{m-1} \pmod M, $$$

or, equivalently,

$$$ b^n (1 + b + b^2 + \dots + b^{m-n-1}) = 0 \mod M. $$$

As $$$b \perp M$$$, this is equivalent to

$$$ 1 + b + b^2 + \dots + b^{m-n-1} = 0 \mod M. $$$

If $$$b - 1 \perp M$$$ also holds, we can use the formula for sum of geometric series, and after multiplying by $$$b - 1$$$ we get:

$$$ b^{m-n} = 1 \mod M \iff \mathrm{ord}_M(b) \vert m - n. $$$

This is a bit of a worrying result as is. Think about it: every time you implement a polynomial hash, there is a chance that two strings of kind AA...A and different length will collide, and the probability of such a collision is of order

$$$ q = \frac{1}{\mathrm{ord}_M(b)}, $$$

so we generally want to use $$$b$$$ of large order.

From a practical viewpoint, this is a condition on $$$M$$$ more than on $$$b$$$. We know that $$$\mathrm{ord}_M(b) \le \phi(M)$$$, and $$$\phi(M)$$$ is largest when $$$M$$$ is prime. This is reason enough to choose prime $$$M$$$ when hashing strings of different length.

This also gives a condition on moduli for double hashing. Double hashing modulo $$$M_1$$$ and $$$M_2$$$ is equivalent to hashing modulo $$$[M_1, M_2]$$$, and for distinct primes we have

$$$ \phi(M) = M \left( 1 - \frac{1}{M_1} \right) \left( 1 - \frac{1}{M_2} \right), $$$

which is only slightly different from $$$M - 1$$$ if $$$M_1$$$ and $$$M_2$$$ are large enough.

Anyway, of such moduli it's reasonable to choose the ones with large orders for all small $$$b$$$.

The first prime number greater than $$$10^9$$$ is $$$M = 10^9 + 7$$$, a very popular choice among competitive programmers. Among all $$$b$$$ from $$$2$$$ through $$$99$$$, 43 numbers have order $$$M - 1$$$ and 55 numbers have order $$$\frac{M - 1}{2}$$$. This is more or less as much as you can expect.

The next prime is $$$M = 10^9 + 9$$$, and this is a nightmare. There are:

  • 20 numbers of order $$$M - 1$$$,
  • 18 numbers of order $$$\frac{M - 1}{2}$$$,
  • 4 numbers of order $$$\frac{M - 1}{3}$$$,
  • 7 numbers of order $$$\frac{M - 1}{4}$$$,
  • 8 numbers of order $$$\frac{M - 1}{6}$$$,
  • 12 numbers of order $$$\frac{M - 1}{8}$$$,
  • 29 numbers of order $$$\frac{M - 1}{9}$$$ and less, including one number of order $$$\frac{M - 1}{1336}$$$.

It's amusing to realize that if you were so unlikely as to choose $$$b = 86$$$ for this prime, your collision probability would be 1336 times as large as that of the 20 numbers of order $$$M - 1$$$. Knowing the birthday paradox, this means a collision is very likely among only as many as about 1000 strings.

For $$$M = 998244353$$$, a less popular choice but still the one remembered by most due to its use in NTT, there are:

  • 36 numbers of order $$$M - 1$$$,
  • 22 numbers of order $$$\frac{M - 1}{2}$$$,
  • 9 numbers of order $$$\frac{M - 1}{4}$$$,
  • 11 numbers of order $$$\frac{M - 1}{7}$$$,
  • 4 numbers of order $$$\frac{M - 1}{8}$$$,
  • 16 numbers of order $$$\frac{M - 1}{14}$$$ and less, including one number of order $$$\frac{M - 1}{1792}$$$.

The reason for such a huge difference in orders is, of course, the factorization of $$$M - 1$$$. So the condition of the modulus for a polynomial hash to be used on strings of different length is not only that $$$M$$$ is a prime, but that $$$M - 1$$$ has no small factors (except 2, which you can't avoid, consider the factorization $$$10^9 + 6 = 2 \cdot 500000003$$$).

Side note on coprimality

Recall that we assumed $$$b - 1 \perp M$$$ to compute the sum of geometric series. Even if that does not hold, we still have

$$$ b^t - 1 = (b - 1)(1 + b + b^2 + \dots + b^{t-1}) = (b - 1) h, $$$

so if $$$h = 0$$$ then $$$b^t = 1$$$ and hence $$$\mathrm{ord}_M(b) \vert t$$$, but the opposite does not hold. This, in theory, somewhat reduces the probability of collision, but not much.

The two scenarios

To proceed any further, we have to realize that there are two slightly different scenarios for hashing:

  1. $$$M$$$ and $$$b$$$ are chosen once and are more or less fixed, the hashed data is either random, or trusted, or both, or collisions only slightly deteriorate the complexity.

  2. $$$M$$$ and $$$b$$$ are generated dynamically, and the hashed data is potentially untrusted, has patterns, or hash collisions carry a great risk (e.g. affect the correctness of the program rather than performance).

Scenario 1 is the usual case for most real-life applications, because an attack on a hash table of a high enough magnitude is all but impossible to execute. Indeed, Java's string hashCode uses a polynomial hash with $$$b = 31$$$ and $$$M = 2^{32}$$$ and is used as-is almost everywhere. GCC's std::hash<std::string> does not use a polynomial hash but something very similar to it, with $$$b = 1540483477$$$ and $$$M = 2^{32}$$$. Modern VS uses FNV-1 which is almost a polynomial hash too, with $$$b = 100000001b3_{\text{hex}}$$$ and $$$M = 2^{32}$$$.

The reason standard libraries of most programming languages do not use a polynomial hash with a prime modulus is that their intended scenario (i.e. scenario 1) does not require that for reliability.

Reversal

In competitive programming scenario 2 is much more common because test cases are usually pre-prepared. What we want to know is the distribution of $$$M$$$ and $$$b$$$ for which given input yields a collision.

In other words, given a string $$$u$$$, for which $$$M$$$ and $$$b$$$ will the hash of $$$u$$$ vanish?

With $$$M$$$ fixed, this is equivalent to solving the polynomial equation

$$$ H(u) = u_0 + u_1 b + u_2 b^2 + \dots + u_{n-1} b^{n-1} = 0 \pmod M $$$

for $$$b$$$.

For prime $$$M$$$, $$$\mathbb{F}_M$$$ is a field and so there are no more than $$$n - 1$$$ solutions to this equation. Therefore for prime $$$M$$$ the probability of collision is of order $$$\frac{n - 1}{M}$$$.

This bound is reached for e.g.

$$$ H(u) = (b - 2)(b - 3)(b - 4) \dots (b - n), $$$

but the coefficients of this polynomial might be too high for some problems, because $$$b$$$ is usually an upper bound of $$$\Sigma$$$.

For composite moduli, however, zero divisors exist in the ring $$$\mathbb{Z}/M\mathbb{Z}$$$, so a polynomial might have more than $$$n - 1$$$ roots.

The following few sections will research universal collisions. A universal collision is a string that yields zero hash for all (or most) $$$b$$$. If we want to generate a universal collision and $$$M$$$ is prime, this means $$$n \approx M$$$ which is rather impractical. If we assume $$$n \le 10^6$$$ and $$$M = 10^9 + 7$$$, this means $$$H(u) = 0$$$ with probability lower than $$$10^{-3}$$$. This is the reason $$$b$$$ is often generated randomly.

So the question is really: are there polynomials with about $$$M$$$ roots with small coefficients and how do we generate them?

Thue-Morse sequence

This is a classical anti-hash test for $$$M$$$ that is a power of two. Let $$$\Sigma = \{-1, 1\}$$$. Thue-Morse sequence of order $$$k$$$ is defined recursively as

$$$ s_0 = [1], \\ s_k = s_{k-1} \mathbin\Vert \overline{s_{k-1}}, $$$

where $$$\mathbin\Vert$$$ denotes concatenation and $$$\overline{u}$$$ is the conjugate of string $$$u$$$, that is, $$$1$$$ is replaced by $$$-1$$$ and vice versa.

The first few strings are $$$[1], [1, -1], [1, -1, -1, 1], [1, -1, -1, 1, -1, 1, 1, -1]$$$, and so on. It is easy to see from the recurrence that the hash of such a string equals

$$$ H(s_k) = (1 - b^{2^{k-1}}) H(s_{k-1}), $$$

and therefore

$$$ H(s_k) = \prod\limits_{i=0}^{k-1} (1 - b^{2^i}). $$$

After applying the difference of two squares identity we get

$$$ H(s_k) = \prod\limits_{i=0}^{k-1} \left( (1 - b) \prod\limits_{j=0}^{i-1} (1 + b^{2^j}) \right). $$$

For odd $$$b$$$, every single parenthesis is even and hence $$$2^{\frac{k(k+1)}{2}} \vert H(s_k)$$$. So if $$$M$$$ is a power of two up to $$$\frac{k(k+1)}{2}$$$, the hash is zero regardless of the value of $$$b$$$.

In practice we have $$$M = 2^{32}$$$, so $$$k = 8$$$ and $$$n = 2^k = 256$$$. This means that an anti-hash tests as short as that of 256 characters exists, using only two letters of the alphabet, say, A and C. Of course, choosing the two latters of distance 16, say A and Q, helps decrease the length down to $$$128$$$.

Attempt to reuse Thue-Morse sequence for other moduli

The trick behind the Thue-Morse sequence is that $$$1 + b^x$$$ is always zero modulo $$$2$$$, and then we compose several such parentheses to obtain a zero modulo a power of two.

Let us consider how the Thue-Morse sequence behaves for other powers of primes, that is, $$$M = p^m$$$. For instance, for $$$M = 3^m$$$:

  • If $$$b = 1 \pmod 3$$$, only parentheses of kind $$$1 - b$$$ are zero modulo 3.
  • If $$$b = 2 \pmod 3$$$, $$$1 + b^{2^0}$$$ is zero modulo 3, but then we take $$$b^2 = 1 \pmod 3$$$ and we're back to square one.

Either way, there are only about $$$k$$$ parentheses yielding a zero, so we have to take $$$k = m$$$ and then $$$n = 2^k = M^(\log_3(2))$$$, which is borderline reasonable for $$$M$$$ of order $$$10^9$$$.

For $$$M = 5^m$$$ we have:

  • $$$b = 1 \pmod 5$$$ gives $$$k$$$ zero parentheses,
  • $$$b = 2 \pmod 5 \implies b^2 = 4 \pmod 5 \implies b^4 = 1 \pmod 5$$$ gives $$$k - 2$$$ zero parentheses,
  • $$$b = 3 \pmod 5 \implies b^2 = 4 \pmod 5 \implies b^4 = 1 \pmod 5$$$ gives $$$k - 2$$$ zero parentheses,
  • $$$b = 4 \pmod 5 \implies b^2 = 1 \pmod 5$$$ gives $$$k - 1$$$ zero parentheses.

The situation gets worse for other powers of primes. For instance, for $$$M = 7^m$$$ we have $$$b = 2 \pmod 7 \implies b^2 = 4 \pmod 7 \implies b^4 = 2 \pmod 7 \implies \dots$$$, this goes on indefinitely and does not ever reach $$$6$$$ which would be necessary to get a zero parenthesis.

This qualitative difference between $$$n = \Theta(2^\sqrt{m})$$$ for $$$p = 2$$$ and much larger $$$n$$$ for $$$p \ge 3$$$ is astounding and has to be elaborated on. One notable detail on the Thue-Morse sequence is that it only uses powers of $$$b$$$ of kind $$$b^{2^j}$$$, that is, the numbers obtained by repeated squaring of $$$b$$$. We want

$$$ b^{2^j} = -1 \pmod p, $$$

which is equivalent to

$$$ b^{2^{j+1}} = 1 \pmod p $$$

and

$$$ \mathrm{ord}_p(b) \vert 2^{j+1}. $$$

We know that $$$\mathrm{ord}_p(b) \vert p - 1$$$, which for $$$p = 2$$$ translates to $$$\mathrm{ord}_p(b) = 1$$$ and hence the equation holds for all $$$j$$$. For other $$$p$$$, $$$\mathrm{ord}_p(b)$$$ has to be a power of two, which almost always fails in practice.

Let's consider an almost arbitrary iterative construction like Thue-Morse's, that is defined recursively as

$$$ H(s_k) = P_k(b) H(s_{k-1}), $$$

where $$$P_k(b)$$$ is some polynomial with all coefficients either $$$0$$$, $$$1$$$ or $$$-1$$$ (for otherwise we'd quickly run out of alphabet), and each power of $$$b$$$ with non-zero coefficient is divisible by the length of the string $$$s_{k-1}$$$. For instance, Thue-Morse sequence uses $$$P_k(b) = 1 - b^{2^{k-1}}$$$ because the length of $$$k$$$-th string is $$$2^k$$$.

If we assume every iteration increases the string length $$$r$$$ times, then

$$$ P_k(b) = \sum\limits_{i=0}^{r-1} \alpha_i b^{ir^{k-1}}, $$$

or

$$$ P_k(b) = Q_k(b^{r^{k-1}}), \\ Q_k(x) = \sum\limits_{i=0}^{r-1} \alpha_i x^i, $$$

where $$$\alpha_i \in \{-1, 0, 1\}$$$. Thue-Morse sequence uses $$$Q_k(x) = 1 - x$$$. As $$$P_k(b)$$$ is expected to be zero modulo $$$p$$$ for all $$$b$$$, this is a reasonable choice modulo $$$2$$$, but it's unsatisfactory for $$$p > 2$$$.

For other $$$p$$$ we want $$$Q_k(x) = (1 - x) (2 - x) \dots ((p - 1) - x)$$$, which implies $$$r = p$$$ and that the coefficients are no longer small.

For instance, even for $$$p = 3$$$ we have $$$Q_k(x) = 2 - 3x + x^2$$$, which translates to

$$$ s_0 = [1], \\ s_1 = [2, -3, 1], \\ s_2 = [4, -6, 2, -6, 9, -3, 2, -3, 1], \\ s_3 = [8, -12, 4, -12, 18, -6, 4, -6, 2, -12, 18, -6, 18, -27, 9, -6, 9, -3, 4, -6, 2, -6, 9, -3, 2, -3, 1], $$$

and we are already out of English alphabet. This only gives a zero hash for all $$$b$$$ up to $$$M = 3^3$$$, after that it's fifty-fifty for $$$M$$$ up to $$$3^6$$$, and for larger $$$M$$$ almost all $$$b$$$ yield non-zero hash.

This implies that no analogue of Thue-Morse sequence exists for moduli other than $$$2^m$$$ or perhaps $$$3^m$$$ constructed using only concatenation and alphabet conjugation.

Birthday paradox

We have shown that there is no relatively easy way to generate a universal collision, so it is reasonable to return to the start and forget about the polynomial interpretation for a minute.

There are $$$\lvert \Sigma \rvert^n$$$ strings of length $$$n$$$, and due to pigeonhole principle there are two strings with distinct hashes if $$$\lvert \Sigma \rvert^n > M$$$. We can lower the bound further using birthday paradox, which gives us a bound similar to $$$\lvert \Sigma \rvert^n > 2 \sqrt{M}$$$. In practice for $$$M = 10^9 + 7$$$ this means $$$n \ge 16$$$ for two-character alphabet and $$$n \ge 4$$$ for 26-letter alphabet, although the latter is a gross underestimate resulting from the probabilistic nature of the birthday paradox.

More formally, in this context the birthday paradox says that among $$$2 \sqrt{M}$$$ polynomials in $$$\mathbb{Z}/M\mathbb{Z}[x]$$$ with coefficients from $$$\Sigma$$$ there will be two polynomials with the same value at $$$b$$$ with high probability.

If we apply the polynomial remainder theorem we can see that this is equivalent to having two polynomials with the same remainder modulo $$$x - b$$$, and this is a key to generating universal collisions.

Denote $$$P(x) = (x - b_1) (x - b_2) (x - b_3) \dots (x - b_k)$$$. Then if two polynomials have the same remainder modulo $$$P(x)$$$, they generate a collision for all $$$b$$$ in $$$\{b_1, b_2, \dots, b_k\}$$$. There are $$$M^k$$$ potential remainders modulo $$$P(x)$$$. We can apply the pigeonhole principle and get

$$$ \lvert \Sigma \rvert^n > M^k \iff k < n \log_M(\lvert \Sigma \rvert). $$$

Using birthday paradox we obtain

$$$ \lvert \Sigma \rvert^n \approx \sqrt{M^k} \iff k \approx 2n \log_M \lvert \Sigma \rvert. $$$

Therefore, a single string of length $$$n$$$ can be a collision for about $$$t = 2n \log_M(\lvert \Sigma \rvert)$$$ distinct $$$b$$$. If we want a universal collision we will need about $$$\frac{M}{t}$$$ strings of length $$$n$$$ each, resulting in the total string length

$$$ L = \frac{Mn}{t} = \frac{M \log M}{2 \log \lvert \Sigma \rvert} $$$

for appropriate $$$n$$$. As we will see from the next section, generating a collision for a single given $$$b$$$ requires $$$n \approx \log_2 M$$$, so this method generates an anti-hash test about $$$2 \log_2 \lvert \Sigma \rvert$$$ times better than that.

Note that this argument can be reversed: a given string can be an anti-hash test for only $$$n - 1$$$ distinct $$$b$$$ (if $$$M$$$ is prime, of course), which gives a lower bound $$$L > M$$$. In practice, if $$$M = 10^9 + 7$$$ and $$$\lvert \Sigma \rvert = 26$$$, we have $$$L = 3.18 M$$$, meaning that this algorithm generates almost optimal tests for large $$$\lvert \Sigma \rvert$$$.

Lower bounds for anti-hash tests

It does not seem easy, if possible at all, to generate a universal collision for moduli other than powers of two or maybe three purely theoretically. We don't even quite understand how to generate a universal collision in $$$\mathcal{\tilde{O}}(M)$$$. Perhaps we will have to give up on that and search for collisions for fixed $$$b$$$.

Prior to that it is reasonable to analyze polynomial hash functions via brute force for a moment.

There are two metrics of consequence for hash functions: their relative "surjectivity" or filling factor, so to speak, that is, the ratio of possible hashes to $$$M$$$ for a given length $$$n$$$, and the length of the minimum non-trivial string that gives a zero hash.

I got the following results. For a fixed $$$M$$$, I found the smallest $$$n$$$ such that $$$H(s)$$$ is surjective, using only $$$\Sigma = \{0, 1\}$$$. Of course, that also depends on $$$b$$$, so I computed the maximum of said values over all $$$b$$$ except $$$0, 1$$$ and $$$M - 1$$$. The results are as follows:

  • $$$M = 10^3 + 9 \implies n = 83$$$,
  • $$$M = 10^4 + 7 \implies n = 22$$$,
  • $$$M = 10^5 + 3 \implies n = 630$$$.

Computing the maximum over all $$$b$$$ for larger $$$M$$$ is too computationally difficult, so I computed the same metric among all $$$b$$$ from $$$2$$$ to $$$200$$$ only. I obtained:

  • $$$M = 10^3 + 9 \implies n = 29$$$,
  • $$$M = 10^4 + 7 \implies n = 19$$$,
  • $$$M = 10^5 + 3 \implies n = 24$$$,
  • $$$M = 10^6 + 3 \implies n = 28$$$,
  • $$$M = 10^7 + 19 \implies n = 28$$$,
  • $$$M = 10^8 + 7 \implies n = 32$$$,
  • $$$M = 10^9 + 7 \implies n = 36$$$.

For 75% fill factor,

  • $$$M = 10^3 + 9 \implies n = 18$$$,
  • $$$M = 10^4 + 7 \implies n = 16$$$,
  • $$$M = 10^5 + 3 \implies n = 19$$$,
  • $$$M = 10^6 + 3 \implies n = 23$$$,
  • $$$M = 10^7 + 19 \implies n = 25$$$,
  • $$$M = 10^8 + 7 \implies n = 28$$$,
  • $$$M = 10^9 + 7 \implies n = 32$$$.

It should be noted that for most composite $$$M$$$ there exists $$$b$$$ for which surjectivity does not hold for all $$$n$$$.

The minimum $$$n$$$ having an anti-hash test is:

  • $$$M = 10^3 + 9 \implies n = 15$$$,
  • $$$M = 10^4 + 7 \implies n = 18$$$,
  • $$$M = 10^5 + 3 \implies n = 21$$$,
  • $$$M = 10^6 + 3 \implies n = 24$$$,
  • $$$M = 10^7 + 19 \implies n = 27$$$,
  • $$$M = 10^8 + 7 \implies n = 31$$$,
  • $$$M = 10^9 + 7 \implies n = 34$$$.

This means that $$$n$$$ is indeed of order $$$\log_2 M$$$ as predicted by assuming hashes of different strings to be random and uniformly distributed. Generating an anti-hash test for particular $$$M$$$ and $$$b$$$ is not particularly troublesome: a $$$\Theta(n 2^{n/2})$$$ algorithm based on MitM springs to mind instantaneously.

Assuming $$$\Sigma = \{0, 1, 2, \dots, 12\}$$$ we have, for the anti-hash test:

  • $$$M = 10^3 + 9 \implies n = 7$$$,
  • $$$M = 10^4 + 7 \implies n = 10$$$,
  • $$$M = 10^5 + 3 \implies n = 14$$$,
  • $$$M = 10^6 + 3 \implies n = 17$$$,
  • $$$M = 10^7 + 19 \implies n = 20$$$,
  • $$$M = 10^8 + 7 \implies n = 23$$$.

In practice, although $$$n$$$ is a third lower, this takes more time due to the sheer size of the alphabet. There's also a neat bonus in that a solution in $$$\Sigma = \{0, 1\}$$$ translates to 25 solutions in $$$\Sigma = \{0, 1, 2, \dots, 25\}$$$ corresponding to multiplication of the original solution by numbers $$$1$$$ to $$$25$$$.

Particularities

The MitM method has more theoretical than practical value, because birthday paradox allows to construct a collision in about $$$\Theta(\sqrt{M})$$$ time too. However, the relationship $$$n \propto \log M$$$ is an important result. It demonstrates that polynomial hash behaves like a uniformly random function, as acquired from experimental data. This also explains why birthday paradox-based generation works so well.

One important distinction, however, is that while birthday paradox allows us to generate two strings with the same hash, this method generates a single string with a given hash value, which is a more difficult problem.

The idea is that if we consider $$$t$$$ random strings and $$$t > M$$$, there have to be two strings with the same hash among them due to pigeonhole principle. The birthday paradox helps decrease this boundary to about $$$t > 2 \sqrt{M}$$$, assuming the hashes are more or less uniformly distributed. This means the birthday paradox attack works for $$$M = 10^9$$$, but not much larger.

Can we do better?

It turns out that we can. For $$$M$$$ of order up to $$$10^{18}$$$ we can use Kapun's algorithm which I have explained in this article. Unfortunately, getting any further with simple algorithms is rather difficult, so we'll have to resort to linear algebra now.

For fixed $$$n, b, M$$$, the function $$$H(u)$$$ can be interpreted as a linear form:

$$$ H(u) = \begin{bmatrix} b^0 & b^1 & b^2 & \dots & b^{n-1} \end{bmatrix} \cdot u. $$$

We would like to find any vector with small coefficients such that $$$H(u) = 0$$$.

Due to the rank-nullity theorem the kernel of $$$H$$$ is a linear space of dimension $$$n - 1$$$, one obvious basis of which is

$$$ e_1 = \begin{bmatrix}-b^1 & 1 & 0 & 0 & \dots & 0\end{bmatrix} \\ e_2 = \begin{bmatrix}-b^2 & 0 & 1 & 0 & \dots & 0\end{bmatrix} \\ e_3 = \begin{bmatrix}-b^3 & 0 & 0 & 1 & \dots & 0\end{bmatrix} \\ \vdots \\ e_{n-1} = \begin{bmatrix}-b^{n-1} & 0 & 0 & 0 & \dots & 1\end{bmatrix}. $$$

We would like to find a vector with small coefficients that is a linear combination of the basis vectors. Better yet, find $$$n - 1$$$ linearly independent vectors satisfying the same condition, that is, we want to find a different basis of the same linear space with additional constraints on the magnitude of the components.

This problem is known as lattice basis reduction. There are surprisingly few algorithms to solve it. All of them have enormously large time complexity, but at least it's polynomial, and in practice they work much better. The algorithm I'd recommend is LLL or BKZ. After reducing the bases we have $$$n - 1$$$ strings, each of which has a zero hash, and the coefficients (that is, characters) are small enough if $$$n$$$ is big enough. This translates to $$$n \approx 15$$$ for $$$M \approx 10^9$$$ and $$$n \approx 30$$$ for $$$M \approx 10^{18}$$$.

An obligatory disclaimer: these strings are built on alphabet $$$\Sigma = \{-k, -k + 1, \dots, k - 1, k\}$$$, but in practice the alphabet is of kind, say, $$$\Sigma = \{97, 98, \dots, 122\}$$$ if we're talking lowercase letters. You can add $$$97 + k$$$ to each character of the string and still have multiple strings with a colliding hash, but notice that the hash won't be zero.

Moreover, this algorithm allows us to construct anti-hash tests for multiple tuples of parameters at once. This requires us to construct our original basis in a slightly different way. Unfortunately, we can't reuse our current construction because the first character of a string would depend on the parameters used, but we can use the following trick.

Notice that if we let

$$$ e_0 = \begin{bmatrix}1 & 0 & 0 & 0 & \dots & 0 & C b^0\end{bmatrix} \\ e_1 = \begin{bmatrix}0 & 1 & 0 & 0 & \dots & 0 & C b^1\end{bmatrix} \\ e_2 = \begin{bmatrix}0 & 0 & 1 & 0 & \dots & 0 & C b^2\end{bmatrix} \\ e_3 = \begin{bmatrix}0 & 0 & 0 & 1 & \dots & 0 & C b^3\end{bmatrix} \\ \vdots \\ e_{n-1} = \begin{bmatrix}0 & 0 & 0 & 0 & \dots & 1 & C b^{n-1}\end{bmatrix}, $$$

then each basis vector is basically a concatenation of some string and its hash multiplied by some constant $$$C$$$. Due to linearity of $$$H$$$ each vector of the reduced basis will also contain a concatenation of a string and its hash times $$$C$$$. As the coefficients of the reduced basis are low, the reduction algorithm will attempt to reduce $$$C$$$ times hash too. If we make $$$C$$$ large enough, this means at least a few of the rows (though probably not all of them) will have $$$0$$$ in their last column, meaning a zero hash.

For multiple hashes, this translates to

$$$ e_0 = \begin{bmatrix}1 & 0 & 0 & 0 & \dots & 0 & C (b_1^0 \mod M_1) & C (b_2^0 \mod M_2) & \dots & C (b_t^0 \mod M_t)\end{bmatrix} \\ e_1 = \begin{bmatrix}0 & 1 & 0 & 0 & \dots & 0 & C (b_1^1 \mod M_1) & C (b_2^1 \mod M_2) & \dots & C (b_t^1 \mod M_t)\end{bmatrix} \\ e_2 = \begin{bmatrix}0 & 0 & 1 & 0 & \dots & 0 & C (b_1^2 \mod M_1) & C (b_2^2 \mod M_2) & \dots & C (b_t^2 \mod M_t)\end{bmatrix} \\ e_3 = \begin{bmatrix}0 & 0 & 0 & 1 & \dots & 0 & C (b_1^3 \mod M_1) & C (b_2^3 \mod M_2) & \dots & C (b_t^3 \mod M_t)\end{bmatrix} \\ \vdots \\ e_{n-1} = \begin{bmatrix}0 & 0 & 0 & 0 & \dots & 1 & C (b_1^{n-1} \mod M_1) & C (b_2^{n-1} \mod M_2) & \dots & C (b_t^{n-1} \mod M_t)\end{bmatrix}. $$$

Reducing this basis results in approximately half rows having zero hashes.

In practice, generating an anti-hash test for ten prime moduli of order $$$10^9$$$ with 26-letter alphabet requires as little as $$$n = 90, C = 100$$$ and takes less than two seconds on my relatively slow laptop. On the contrary, generating tests for $$$\lvert \Sigma \rvert < 8$$$ is incredibly difficult.

Key takeaways

The modulus has to be prime. If it's not, you're losing precision. If it has small divisors, you're losing precision. If it's divisible by a power of two, you're losing precision.

$$$M - 1$$$ must not have small divisors except the trivial $$$2$$$. Preferably prime times $$$2$$$.

There are no requirements on the base except its coprimality with the modulus and being greater than $$$\max \Sigma$$$.

Modulus $$$10^9 + 7$$$ is fine. Moduli $$$10^9 + 9$$$, $$$998244353$$$ and $$$2^{31} - 1$$$ are bad.

The base should be selected randomly.

If you absolutely have to use a modulus with non-prime $$$(M - 1) / 2$$$, make sure to choose the base of a high enough order.

A universal collision for a prime modulus cannot be generated unless the length of the input may be about as large as the modulus.

Polynomial hashes do behave like uniformly random integers starting from $$$n > \log_2 M$$$.

A collision with $$$\Sigma = \{0, 1\}$$$ can be generated for fixed $$$M$$$ and $$$b$$$ as long as $$$M \approx 10^{18}$$$ or less.

A collision with slightly larger $$$\Sigma$$$ can be generated for fixed $$$M$$$ and $$$b$$$ as long as $$$M < 10^{100}$$$ and possibly even larger. This includes collisions for multiple parameters, where we assume the $$$M$$$ in the complexity is the product of the moduli.

Теги polynomial hash, hashing

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en5 Английский purplesyringa 2022-11-12 02:21:14 45
en4 Английский purplesyringa 2022-02-23 14:12:55 104 Fix multiple typos
en3 Английский purplesyringa 2022-02-17 19:28:28 2 Fix typo
en2 Английский purplesyringa 2022-02-16 20:53:01 126
en1 Английский purplesyringa 2022-02-16 19:39:25 37659 Initial revision (published)