Рассмотрим такую задачу: даны три целых числа $$$x$$$, $$$y$$$ и $$$m$$$, $$$0 \leqslant x,y < m < 2^{32}$$$, найти $$$xy\bmod m$$$. По-хорошему хотелось бы просто перемножить эти два числа, а потом применить операцию остатка:
uint32_t prod(const uint32_t x, const uint32_t y, const uint32_t m)
{
return x * y % m;
}
Как вы, возможно, догадываетесь, это решение неверное. Всё дело в том, что в такой процедуре возможно переполнение: операция x * y
выполняется в типе uint32_t
, и на самом деле промежуточным результатом выполнения этой операции будет не $$$xy$$$, а $$$xy\bmod2^{32}$$$. Если после этого взять результат по модулю $$$m$$$, он может отличаться от правильного:
Выход прост — перемножать необходимо в большем типе:
uint64_t prod_uint64(const uint64_t x, const uint64_t y, const uint64_t m)
{
return x * y % m;
}
Если так делать, то, поскольку $$$xy<2^{64}$$$, это произведение точно не переполнится, и после взятия результата по модулю получится правильный ответ.
Вопрос: а что делать, если $$$x$$$, $$$y$$$ и $$$m$$$ могут быть больше? Предлагаю следующее.
- Бинарное умножение. Так же, как и бинарное возведение в степень, существует бинарное умножение: чтобы посчитать $$$xy$$$, посчитаем $$$x\left\lfloor\frac y2\right\rfloor$$$, сложим это число само с собой и, возможно, добавим ещё $$$x$$$. Это потратит $$$\mathcal O(\log y)$$$ действий, но среди них не будет ничего, кроме сложения и вычитания!
uint64_t sum(const uint64_t x, const uint64_t y, const uint64_t m)
{
uint64_t ans = x + y;
if (ans < x || ans >= m)
ans -= m;
return ans;
}
uint64_t prod_binary(const uint64_t x, const uint64_t y, const uint64_t m)
{
if (y <= 1)
return y ? x : 0;
uint64_t ans = prod_binary(x, y >> 1, m);
ans = sum(ans, ans, m);
if (y & 1)
ans = sum(ans, x, m);
return ans;
}
- Умножение с помощью
int128
. Чтобы перемножить два 32-битных числа, нужна 64-битная промежуточная переменная. А чтобы перемножить два 64-битных числа, нужна 128-битная переменная! В современных 64-битных компиляторах C++ (кроме разве что Microsoft® Visual C++®) есть специальный тип__int128
, позволяющий осуществлять операции над 128-битными числами.
int64_t prod_uint128(const uint64_t x, const uint64_t y, const uint64_t m)
{
return (unsigned __int128)x * y % m;
}
- Умножение с помощью вещественного типа. Что такое $$$xy\bmod m$$$? Это на самом деле $$$xy-cm$$$, где $$$c=\left\lfloor\frac{xy}m\right\rfloor$$$. Давайте тогда попробуем посчитать $$$c$$$, а отсюда найдём $$$xy\bmod m$$$. При этом заметим, что нам не требуется находить $$$c$$$ прямо точно. Что будет, если мы случайно посчитаем, скажем, $$$c-4$$$? Тогда при подсчёте остатка мы вычислим $$$xy-(c-4)m=xy-cm+4m=xy\bmod m+4m$$$. На первый взгляд, это не то, что нам надо. Но если $$$m$$$ не слишком велико и $$$xy\bmod m+4m$$$ не вылезло из 64-битного типа, то после этого можно по-честному взять остаток и получить ответ.
Получается такой код:
uint64_t prod_double(const uint64_t x, const uint64_t y, const uint64_t m)
{
uint64_t c = (double)x * y / m;
int64_t ans = int64_t(x * y - c * m) % int64_t(m);
if (ans < 0)
ans += m;
return ans;
}
uint64_t prod_long_double(const uint64_t x, const uint64_t y, const uint64_t m)
{
uint64_t c = (long double)x * y / m;
int64_t ans = int64_t(x * y - c * m) % int64_t(m);
if (ans < 0)
ans += m;
return ans;
}
double
достаточно точный для этой задачи, если $$$x$$$, $$$y$$$ и $$$m$$$ меньше $$$2^{57}$$$. long double
же хватает на числа, меньшие $$$2^{63}$$$, однако стоит помнить, что long double
для этого должен быть 80-битным, а это верно не на всех компиляторах: например, в Microsoft® Visual C++® long double
— то же самое, что double
.
Обратите внимание, что этот способ неприменим, если $$$m>2^{63}$$$: в этом случае ans
нельзя хранить в int64_t
, так как, возможно, $$$\mathtt{ans}\geqslant 2^{63}$$$ и произойдёт переполнение, из-за которого выполнится ветка (ans < 0)
и мы получим неверный ответ.
Видно, что Microsoft® Visual C++® отстаёт от остальных компиляторов в развитии в наличии технических средств для умножения больших чисел по модулю, поэтому, если мы хотим, чтобы функция работала на всех компиляторах быстро, ей необходима какая-то свежая идея. К счастью, именно такую идею в 1960-м году изобрёл Анатолий Карацуба.
- Умножение Карацубы. Идея изначально использовалась для быстрого умножения длинных чисел. А именно, пусть $$$x$$$ и $$$y$$$ — два неотрицательных целых числа, меньших $$$N^2$$$. Поделим их с остатком на $$$N$$$: $$$x=Nx_1+x_0$$$, $$$y=Ny_1+y_0$$$. Тогда искомое $$$xy$$$ можно найти как $$$N^2x_1y_1+Nx_1y_0+Nx_0y_1+x_0y_0=N\cdot\bigl(N\cdot x_1y_1+\left(x_0+x_1\right)\left(y_0+y_1\right)-x_1y_1-x_0y_0\bigr)+x_0y_0$$$. Как видно, это преобразование свело умножение $$$x$$$ и $$$y$$$ 1) к $$$\mathcal O(1)$$$ сложениям и вычитаниям чисел, не превосходящих $$$N^4$$$; 2) к трём умножениям чисел, не превосходящих $$$2N$$$ (а именно $$$x_0y_0$$$, $$$x_1y_1$$$ и $$$\left(x_0+x_1\right)\left(y_0+y_1\right)$$$); 3) к двум умножениям на $$$N$$$ чисел, не превосходящих $$$2N^2$$$.
Пункт 1) практически всегда очень прост. В случае с длинными числами пункт 3) также прост: можно взять $$$N$$$, равное степени двойки, и тогда он осуществляется как обычный двоичный сдвиг (оператор<<
в C++). Поэтому по существу Карацуба свёл одно умножение чисел, меньших $$$N^2$$$, к трём умножениям чисел, меньших $$$2N$$$. Если эти умножения также свести методом Карацубы, по мастер-теореме асимптотика этого метода составит $$$\Theta\left(\log^{\log_23}N\right)$$$ вместо наивного $$$\Theta\left(\log^2N\right)$$$.
Но нам не потребуется использовать рекурсию, ведь, если длину $$$x$$$ и $$$y$$$ уменьшить вдвое, мы уже сможем воспользоватьсяprod_uint64
илиprod_double
. Сложность в нашем случае составляет пункт 3): подобрать такое $$$N$$$, чтобы, во-первых, оно было меньше $$$2^{32}$$$ либо в крайнем случае чуть-чуть больше, во-вторых, чтобы на него можно было быстро умножать числа порядка $$$N^2$$$. Оба требования выполнятся, если взять $$$N=\mathrm{round}\left(\sqrt m\right)$$$: действительно, тогда при $$$m<2^{64}$$$ верно $$$N<2^{32}$$$, а $$$\left|m_0\right|=\left|m-N^2\right|\leqslant N<2^{32}$$$; тогда $$$xN=\left(x_1N+x_0\right)N=x_1N^2+x_0N\equiv x_0N-x_1m_0\pmod m$$$, и оба умножения здесь осуществляются над числами порядка $$$N$$$.
Внимательный читатель заметит, что здесь мы обзавелись серьёзной проблемой: извлечение квадратного корня из целого числа. Если вы умеете применять в этой задаче умножение Карацубы, обходя данную проблему (в том числе достаточно быстро находить квадратный корень, написать более быстрый или более короткий код, чем у меня), напишите, пожалуйста, в комментариях!
Поскольку находить произведение $$$\left(x_0+x_1\right)\left(y_0+y_1\right)$$$ оказалось крайне неприятно (напомню,prod_double
не работает при $$$m>2^{63}$$$), я всё же решил просто вычислить $$$x_0y_1$$$ и $$$x_1y_0$$$ по-честному — таким образом, это не метод Карацубы в истинном смысле, так как я трачу четыре умножения чисел порядка $$$N$$$.
uint64_t dif(const uint64_t x, const uint64_t y, const uint64_t m)
{
uint64_t ans = x - y;
if (ans > x)
ans += m;
return ans;
}
bool check_ge_rounded_sqrt(const uint64_t m, const uint64_t r)
{
return ((r >= 1ull << 32) || r * (r + 1) >= m);
}
bool check_le_rounded_sqrt(const uint64_t m, const uint64_t r)
{
return (r == 0 || ((r <= 1ull << 32) && r * (r - 1) < m));
}
bool check_rounded_sqrt(const uint64_t m, const uint64_t r)
{
return check_ge_rounded_sqrt(m, r) && check_le_rounded_sqrt(m, r);
}
uint64_t rounded_sqrt(const uint64_t m)
{
uint64_t r = floorl(.5 + sqrtl(m));
if (!check_ge_rounded_sqrt(m, r))
while (!check_ge_rounded_sqrt(m, ++r));
else if (!check_le_rounded_sqrt(m, r))
while (!check_le_rounded_sqrt(m, --r));
return r;
}
uint64_t prod_karatsuba_aux(const uint64_t x, const uint64_t N, const int64_t m0, const uint64_t m)
{
uint64_t x1 = x / N;
uint64_t x0N = (x - N * x1) * N;
if (m0 >= 0)
return dif(x0N, x1 * (uint64_t)m0, m);
else
return sum(x0N, x1 * (uint64_t)-m0, m);
}
uint64_t prod_karatsuba(const test& t)
{
uint64_t x = t.x, y = t.y, m = t.modulo;
uint64_t N = rounded_sqrt(t.modulo);
int64_t m0 = m - N * N;
uint64_t x1 = t.x / N;
uint64_t x0 = t.x - N * x1;
uint64_t y1 = t.y / N;
uint64_t y0 = t.y - N * y1;
uint64_t x0y0 = sum(x0 * y0, 0, m);
uint64_t x0y1 = sum(x0 * y1, 0, m);
uint64_t x1y0 = sum(x1 * y0, 0, m);
uint64_t x1y1 = sum(x1 * y1, 0, m);
return sum(prod_karatsuba_aux(sum(prod_karatsuba_aux(x1y1, N, m0, m), sum(x0y1, x1y0, m), m), N, m0, m), x0y0, m);
}
Видно, что на самом деле единственное, что даёт нам тут метод Карацубы — что если вы найдёте большое число $$$N$$$, на которое вы умеете быстро умножать, то вы сможете умножать любые два числа по модулю. Фактически, если бы модуль $$$m$$$ был фиксированным, и было бы много запросов умножения по этому фиксированному модулю, то метод Карацубы был бы молниеносным, поскольку самая затратная операция в нём — это квадратный корень. Хотелось бы, таким образом, взять, например, $$$N=2^{32}$$$ и сделать всё то же самое, что в прошлом пункте, но без квадратного корня. Увы, но я не придумал, как умножать на $$$2^{32}$$$. Можно было бы написать примерно такую функцию:
uint64_t prod_double_small(const uint64_t x, const uint64_t y, const uint64_t m)
{
uint64_t c = (double)x * y / m;
uint64_t ans = (x * y - c * m) % m;
return ans;
}
Она вычисляет произведение по модулю при условии, что uint64_t c = (double)x * y / m
посчиталось абсолютно точно. Но гарантировать, что оно точно посчитается, не представляется возможным, поскольку $$$\frac{xy}m$$$ вполне может оказаться на $$$10^{-18}$$$ меньше, чем какое-то целое число, и типа double
не хватит, чтобы это заметить. Именно эту проблему обходит функция prod_karatsuba_aux
. Если вы смогли её обойти хитрее, добро пожаловать в комментарии.
Ниже приведены три таблицы по разным компиляторам (да славится имя MikeMirzayanov, ведь именно благодаря нецелевому использованию его Polygon'а я это сделал), в каждой таблице строки соответствуют различным функциям, столбцы соответствуют максимальной разрешённой битности $$$x$$$, $$$y$$$ и $$$m$$$. Если указано CE, значит, с данным компилятором программа не скомпилируется, а если WA — может выдать неверный ответ. В противном случае указано время работы функции на Intel® Core™ i3-8100 CPU @ 3.60GHz. Погрешность приблизительно равна одной-двум наносекундам, на самых медленных функциях может доходить до десяти наносекунд.
Microsoft® Visual C++® 2010
Метод 32 бита 57 битов 63 бита 64 бита prod_uint64 7 ns WA WA WA prod_binary 477 ns 847 ns 889 ns 870 ns prod_uint128 CE CE CE CE prod_double 66 ns 95 ns WA WA prod_long_double 66 ns 98 ns WA WA prod_karatsuba 128 ns 125 ns 138 ns 139 ns GNU G++17
Метод 32 бита 57 битов 63 бита 64 бита prod_uint64 4 ns WA WA WA prod_binary 455 ns 774 ns 841 ns 845 ns prod_uint128 CE CE CE CE prod_double 26 ns 36 ns WA WA prod_long_double 29 ns 20 ns 19 ns WA prod_karatsuba 82 ns 81 ns 91 ns 88 ns GNU G++17 (64 bit)
Метод 32 бита 57 битов 63 бита 64 бита prod_uint64 8 ns WA WA WA prod_binary 313 ns 550 ns 604 ns 630 ns prod_uint128 17 ns 34 ns 30 ns 30 ns prod_double 23 ns 22 ns WA WA prod_long_double 23 ns 24 ns 23 ns WA prod_karatsuba 65 ns 65 ns 69 ns 66 ns
Поэтому базовый рецепт такой: если доступен unsigned __int128
, то использовать его, если доступен 80-битный long double
, то его тоже должно всегда хватать, а в противном случае надо, если хватает double
, использовать double
, иначе применить метод Карацубы.
При желании можете попробовать применить эти идеи на задачах специального контеста.
Автокомментарий: текст был обновлен пользователем orz (предыдущая версия, новая версия, сравнить).
Auto comment: topic has been updated by orz (previous revision, new revision, compare).
KACTL also has a nice implementation of modmul: https://github.com/kth-competitive-programming/kactl/blob/main/content/number-theory/ModMulLL.h
Docs: https://github.com/kth-competitive-programming/kactl/blob/main/doc/modmul-proof.pdf
Yes, KACTL 134725980 is a quarter faster than my implementation 134726237 but has a shorter range: it fails on 63-bit numbers 134725895 while my submission 134726249 doesn't.
I don't have coach mode so I can't see the gym submissions.
This might be a bug that should be reported to simonlindholm. But looking at the git history, it might also be intentional to only support M < 0.394 * 2^64 = 7.2e18 for the sake of speed: https://github.com/kth-competitive-programming/kactl/pull/174
I agree that this behavior was a bit surprising since the comment warning about it is written in unrendered latex.
It's a bit shameful for Codeforces because I granted all the rights that I was able to grant given the contest is public. As far as I understand, you should be able to watch the submissions once you sent an accepted one, so if you copy the snippet from the statement and replace
// TODO
withreturn (unsigned __int128)t.x * t.y % t.modulo;
it should be OK.Are you sure you don't have a coach mode? I don't see a checkbox «I trust this user» in your account.
Автокомментарий: текст был обновлен пользователем orz (предыдущая версия, новая версия, сравнить).
Auto comment: topic has been updated by orz (previous revision, new revision, compare).
Thanks for the great blog! I was curious about how a custom implementation of
__int128
compares to your implementation. I'll try to update this comment with updates hopefully soon enough with the results.Just a minor nitpick: the sample code that's provided has some undefined behaviour issues (for instance, shifting by 64 is undefined, it would be better to use either
~0ULL
oruint64_t(-1)
for those purposes).What do you mean by «custom implementation of
__int128
»? Making a pared down version of it that is able of everything we need and supported by all compilers?Apologies for not having clarified earlier.
I was talking about an implementation along the lines of usual long division, and not the complete implementation of 128 bit arithmetic, so yes.
A reference to such an algorithm is The Art of Computer Programming, Vol. 2, Algorithm D, where the base of computation can be set to $$$2^{32}$$$ to get what we want.
An implementation of the mentioned algorithm needs the following kinds of operations:
Using a "place" as a block of 32 bits would get us what we want to do, on a computer that supports 64-bit arithmetic.
Unfortunately, the algorithm is quite involved to implement. An (possibly buggy) implementation can be seen here (not my implementation). It uses 16 bit integers rather than 32 bit ones though, but it shouldn't be hard to fix.
Actually I already have this code written for a solution in some old problem, I should definitely make use of it. The only thing missing is the modulo operation although, and writing it faster than $$$\mathcal O(\log(\mathrm{Ans}))$$$ might be laborious.
Did you mean that you have code for division of 128-bit unsigned integers (possibly stored in a struct) by a 64-bit unsigned integer, analogous to the code in the link?
Implementing multiplication of two 64-bit integers to get a 128-bit integer and subtraction of two 128-bit integers is fairly trivial (for multiplication, an implementation roughly identical to
u256::mult
here works, and for subtraction where the result is guaranteed to be non-negative, we only need to maintain a single carry).This should lead to at most a marginal overhead for implementing the modulo operation from the division operation.
I checked, I have a version with logarithmic division. Maybe it will be rather fast after a bit of optimization.
I implemented the code as given in their link but for 32 bits (and tried to identify places it can go wrong, but I'm not too sure since I don't understand the algorithm completely).
It gives a marginal speedup (10-20%) over the Karatsuba implementation.
UPD: I was able to optimize it to 1716ms (compared to
unsigned __int128
's 1201ms and Karatsuba's 2496ms). 134767943Please try in on a compiler without 128-bit integer.
Karatsuba performs faster in that case (3603ms vs 3432ms), so manually implementing the modulus operation doesn't make sense I guess.
I was able to further optimize the performance by roughly 400ms on 32 bit architectures, by using fast divide instructions like
divl
and restructuring code for those instructions to work. 134785220 makes me believe that this can be further optimized, if someone tries to actually understand how the algorithm works.UPD: Loop unrolling (and noticing that $$$n$$$ is constant) shaves off about 300ms more, and gives a runtime of about 3057ms 134787237.
Very nice job!
As for the nitpick, I kinda screwed up, and this undefined behavior leaked into every statement and every solution of all four special problems, so I will only be able to fix it after a good sleep. Is it there somewhere in the post?
I can't find such an issue with the post itself, however it does seem to be the case in the last problem that asks for operations on 64 bit modulos. The sample code uses
BITS = 64
and does1ll << BITS
.A better way of doing that would be
(-uint64_t(1)) >> (64 - BITS)
, unlessBITS = 0
, which doesn't seem to be feasible for any positive modulo.According to the standard,
1ull << 64
is zero, so I will just replace all1ll
with1ull
and thus handle the undefined behavior.I just don't get why you went for
1ull << 64
when~0ull
works just fine, is shorter and much clearer: "set all 64 bits of ull to 1".I have four versions of a problem which contain a substring
1ll << BITS
.BITS
is a constant integer 32, 57, 63, or 64 in different versions of the problem.nor noticed that in the case
BITS == 64
my code's behavior is undefined. I agreed, but noted that replacing1ll
with1ull
does the job.0ull
is great only forBITS == 64
whereas my aim is to have the same snippets of code in all four versions of the problem, differing only in the declaration of constantBITS
.Oh right, makes sense. It's still an ugly case where you have to dig in documentation to find out if it's not doing something it shouldn't, and gives annoying warnings in gcc by default.
Perhaps, but I allow such undesirable elements in pieces of code that are to be learnt by heart and used when necessary. Modular multiplication is one of these cases. If it is thoroughly tested and proven to work, it just fulfills its only purpose. Even if it's formally undefined and Sir Compiler swears, I won't have, for example, to check this function if I have a bug somewhere in the code.
until you upgrade compiler and it teaches you what undefined means
orz
orz
omg you tagged yourself XD
Why in the sum function of the Binary multiplication part, it's: ~~~~~ if (ans < x || ans >= m) ans -= m; ~~~~~ and not: ~~~~~ if (ans < 0) ans += m; if (ans >= m) ans -= m; ~~~~~ ?
Because
ans
is unsigned and instead ofans >= m
it can overflow and become some number less thanx
.UPD. It's not about
ans
being unsigned, it's about its ability to overflow, and it's able to overflow because $$$m$$$ can be greater than $$$2^{63}$$$.What about the non-recursive implementation of binary multiplication and the Karatsuba algorithm?
Karatsuba is non-recursive, only one iteration is performed.
I didn't try rewriting the binary multiplication in non-recursive manner, but expect negligible acceleration.
I wrote it now and the non-recursive version of binary multiplication is about 20-35% faster (dependent on compiler version)
And what about sweet rhythm?
Auto comment: topic has been updated by orz (previous revision, new revision, compare).
How about Barrett reduction? I think it might be useful for mass modulo operations
As far as I know, even for 32-bit mods, you need 128-bit integer multiplication and shift support, and 64-bit division support. Generalizing to 64-bit mods, it seems like implementing it would be kinda hard too.
Either hard or impractically slow.
I've never done it, but I think implementing the 64-bit version of the Barrett reduction by managing four 64-bit pieces of the intermediate product by hand (see my comment below) isn't that hard (at least compared to the Karatsuba algorithm), and I'm fairly certain that it would still run faster than 20ns on CodeForces (~70 CPU cycles). There is a huge computational advantage in knowing the modulo in advance.
We can use Barrett reduction to multiply modulo 16-bit number. For example we can use fft four times (or even less, when answer is small enough) for 16-bit prime, then recover whole answer using Chinese remainder theorem. Since we don't perform any modulo operation (inside fft ofc) it might speed up our fft pretty much. I think you should also consider 16-bit multiplications in your blog (and 32-bit with barrett reduction, why not)
As I said below, on x86_64 there's 128-bit multiplication although the result is still in two 64-bit registers.
there are mulq and imulq asm instructions, which store the result of multiplication in 2 64-bit variables. In theory, you can use
_umul128
msvc intrinsic (seems to be not working on codeforces, so inline asm is the only way)It's not very hard. The tough part is calculating the remainder afterwards.
The
int128
type is just two regular 64-bit registers bundled together. When you cast a 64-bit value to it and then immediately multiply it by another 64-bit value, compiler just replaces it with the 64x64-bitmul
instruction you mentioned.https://en.algorithmica.org/hpc/arithmetic/integer/#128-bit-integers
There are faster methods to do modular multiplication when modulo is constant.
The one currently used in compilers is called Barrett reduction. It uses a similar approach as the floating-point arithmetic one, but it sticks to integer types and uses precomputed magic constants to ensure that the results are always rounded correctly. Compilers do it automatically e. g. when the modulo is a 32-bit compile-time constant such as $$$10^9+7$$$, but in general it needs four times the bits to store its intermediate product, so for larger sizes you'd need to split results in halves and do the arithmetic manually since there is no 196- or 256-bit multiplication in hardware.
Daniel Lemire recently came up with a much simpler algorithm, which is the current state-of-the-art for 32-bit integers. It has been generalized with a similar technique for the 64-bit case, although I'm not sure if it will be faster for the general 128-by-64 reduction.
There is also a way to do it without quadruple-precision at all: Montgomery multiplication. It is not just a pure reduction though: it first transforms input integers into a "space" where it can perform modular multiplication cheaply, and then transforms results back when they are needed. There is some overhead associated with these transforms, and so it is only beneficial when you have multiple modular multiplications you need to perform (e. g. binary exponentiation). This representation is the one used in computational number theory algorithms that deal with big integers — mainly because you'd want to use SIMDized loops for most if not all operations, and it does not have 128-bit multiplication at all.
I've implemented the rho algorithm for factoring a 64-bit number (which involves a lot of 64-by-64 multiplications modulo a 64-bit number) and tried several reduction methods, Montgomery multiplication becoming the fastest. The implementation matches the speed of factorization in Sagemath, so I believe Montgomery multiplication is still the fastest when you need to do a large chain of modular multiplications with a fixed modulo, and Lemire reduction is the fastest when you need to do just one multiplication and have the result immediately.
(By the way, I'm writing a more comprehensive article on the topic, but it isn't ready yet. Stay tuned.)
The algorithm found by Daniel Lemire that you mentioned seems (not sure if this is correct) to have been called Barrett reduction in some competitive programming libraries. For instance, here. Is there a difference between this implementation and the one you mention?
I read the notes by Lemire. He states he is also good in case of fixed modulus. I feel I need to extend the article and the contest with fixed modulo algorithms and special problems.
I'll just add some references here in case you want to extend the article.
For some algorithms for fixed mods (constexpr or not), this repo could be a super valuable resource IMO. Their implementations are pretty heavily constant-optimized and are pretty educational too.
Also, while discussing the subject of the post on the AC Discord server, pajenegod mentioned this blog post with the same name. There's also a clever technique they mentioned much earlier while discussing about fast modular operations during matrix multiplication by a given modulus (which gives good speedups when the modulus is small) where you keep accumulating multiples of the modulus if necessary until you reach some threshold, and by looking at one of the highest bits, you can check if you need to subtract a specific large multiple of modulus from it or not. An example for reference.
$$$|m_0| = |m-N^2|$$$, you're missing a minus that makes a huge difference
Sorry, yes, the Russian version of the blog doesn't contain the mistake, so I didn't notice.
Btw, I think you're focusing too much on compiler differences instead of CPU differences. On x86_64, we have multiplication of 64-bit inputs into a 128-bit output, stored as 2 registers (
%rdx:%rax
), and inline assembly means we don't need to give a shit about what types the compiler supports. Implementing the modulo is the second part. For instance, your__int128
example does exactly that: onemulq
, move the outputs to correct registers and call a builtin function__umodti3
(or__udivmodti4
if you want to go deeper), which apparently works on 128-bit inputs using__attribute__((mode(TI)))
.tl;dr this implementation is pretty low-level but we can go a level or two lower and then implement
Auto comment: topic has been updated by orz (previous revision, new revision, compare).
Автокомментарий: текст был обновлен пользователем orz (предыдущая версия, новая версия, сравнить).
Auto comment: topic has been updated by orz (previous revision, new revision, compare).
If you look into the previous revisions of this plog post you'll find out that the code snippets didn't render (however, I do remember that at some point they rendered well). Now I fixed if at the cost of the correct enumeration: every item in the enumeration is 1.
This happened because the enumeration item must contain in a single paraphaph whereas I had to break the paragraph with a linebreak right before the
~~~~~
of the snippet. Does anyone here know how to make a code snippet without a linebreak? Or how to add a linebreak without breaking the enumeration item?Отличный пост! Автор, пиши ещё.
Спасибо, мне весьма приятно.
Автокомментарий: текст был обновлен пользователем orz (предыдущая версия, новая версия, сравнить).
Hi, I came across this blogpost and noticed it does not test using compiler intrinsics to achieve this in MSVC. I use a modmul version in my template which looks like this
I didn't try to micro-optimize, but I wonder if something like this would be faster on x64 MSVC than the karatsuba-like implementation.
I believe there was a 128-bit float type, both normal (quad-float) and decimal, in GCC. Could this work for 64 bit modulos? If that could work, I think it may be faster or almost equal to using the 128-bit integer type.
I dont understand this blog. Why not just do (x mod m)*(y mod m) mod m. This is obviously equal to xy mod m.
But also, given that m < 2^32, (x mod m) and (y mod m) will also be smaller than 2^32. Therefore (x mod m) * (y mod m) will be smaller than 2^64 so you can just use long long in c++?
Can someone pls explain why this is not considered or why I am wrong?
Yes, this blog discusses doing fast modulo arithmetic over a modulo that could be up to 2^64, that could not fit in an int where you either have to use int_128, binary multiplication or anything else.
I see, thank you kind sir.
Hi there, sorry for necroposting.
I noticed that the problem of the blog have us multiply 2 original numbers $$$a$$$ and $$$b$$$ then have its remainder, but usually in other problems we'll have to do a series of calculations in a row, each with 2 numbers that aren't original themselves but are remainders of another calculation. Because of this, binary multiplication won't work since $$$b$$$ is a remainder and we can't just divide it by $$$2$$$; we could probably multiply it by the inverse of $$$2$$$ instead but by leading us to another multiplication this doesn't really solve the issue.
I doubt that Karatsuba works for this particular problem either, so the way I see it that we're stuck with floating points and
unsigned __int128
, one of them may fail on 64-bit modulus and the other isn't always available. Is there any other approach for this, or perhaps could you prove me wrong and show that binary multiplication works? (I'm glad to be proven wrong)Hello.
No, it does not matter whether two integers we are dealing with are the original integers or remainders of some intermediate results. We don't care, we're solving an explicitly formulated subproblem of multiplying two integers modulo a third one. We definitely can express one of the numbers $$$b$$$ as $$$2k+r$$$ where $$$r \in \{0, 1\}$$$ and then deal with two multiplications $$$a \cdot 2k$$$ and $$$a \cdot r$$$ separately.
The idea behind Karatsuba's method and the way how we apply it are also explained in the text. Please read the blog more carefully. Or, alternatively, just copy any of the functions implemented in the blog and paste in a snippet from the special contest, try to solve some of problems there.
Thanks for your response, I definitely overlooked quite some things!
a * b % mod = a * (b % mod) % mod So binary multiplication works
Damn..orz Sir!!