Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject.
Fast modular multiplication
Let $$$P=10^9+7$$$ and let $$$a$$$ and $$$b$$$ be two numbers in $$$[0,P)$$$. Our goal is to calculate $$$a \cdot b \% P$$$ without ever actually calling $$$\% P$$$. This is because calling $$$\% P$$$ is very costly.
Montgomery reduction of $$$a \cdot b$$$
It turns out that the trick to calculate $$$a \cdot b \% P$$$ efficently is to calculate $$$a \cdot b \cdot 2^{-32} \% P$$$ efficiently. So the goal for this section will be to figure out how to calculate $$$a \cdot b \cdot 2^{-32} \% P$$$ efficently. $$$a \cdot b \cdot 2^{-32} \% P$$$ is called the Montgomery reduction of $$$a \cdot b$$$, denoted by $$$\text{m_reduce}(a \cdot b)$$$.
Idea (easy case)
Suppose that $$$a \cdot b$$$ just happens to be divisible by $$$2^{32}$$$. Then $$$(a \cdot b \cdot 2^{-32}) \% P = (a \cdot b) \gg 32$$$. Which runs super fast!
Idea (general case)
Can we do something similar if $$$a \cdot b$$$ is not divisible by $$$2^{32}$$$? The answer is yes! The trick is to find some integer $$$m$$$ such that $$$(a \cdot b + m \cdot P)$$$ is divisible by $$$2^{32}$$$. Then $$$a \cdot b \cdot 2^{-32} \% P = (a \cdot b + m \cdot P) \cdot 2^{-32} \% P = (a \cdot b + m \cdot P) \gg 32$$$.
So how do we find such an integer $$$m$$$? We want $$$(a \cdot b + m \cdot P) \% 2^{32} = 0$$$ so $$$m = (-a \cdot b \cdot P^{-1}) \% 2^{32}$$$. So if we precalculate $$$(-P^{-1}) \% 2^{32}$$$ then calculating $$$m$$$ can be done blazingly fast.
Montgomery transformation
Since the Montgomery reduction divides $$$a \cdot b$$$ by $$$2^{32}$$$, we would like some some way of multiplying by $$$2^{32} \% P$$$. The operation $$$x \cdot 2^{32} \% P$$$ is called the Montgomery transform of $$$x$$$, denoted by $$$\text{m_transform}(x)$$$.
The trick to implement $$$\text{m_transform}$$$ efficently is to make use of the Montgomery reduction. Note that $$$\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \% P))$$$, so if we precalculate $$$2^{64} \% P$$$, then $$$\text{m_transform}$$$ also runs blazingly fast.
Montgomery multiplication
Using $$$\text{m_reduce}$$$ and $$$\text{m_transform}$$$ there are multiple different ways of calculating $$$a \cdot b \% P$$$ effectively. One way is to run $$$\text{m_transform}(\text{m_reduce}(a \cdot b))$$$. This results in two calls to $$$\text{m_reduce}$$$ per multiplication.
Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $$$a' = \text{m_transform}(a)$$$ and $$$b' = \text{m_transform}(b)$$$ then $$$\text{m_transform}(a \cdot b \% P) = \text{m_reduce}(a' \cdot b')$$$. This effectively results in one call to $$$\text{m_reduce}$$$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.
Example implementation
Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as an example. Implement it in C++ if you want it to run fast.
P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r)
def m_reduce(ab):
m = ab * Pinv % r
return (ab + m * P) // r
def m_transform(a):
return m_reduce(a * r2)
# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587