Fast modular multiplication

№	Пользователь	Рейтинг
1	tourist	3993
2	jiangly	3743
3	orzdevinwang	3707
4	Radewoosh	3627
5	jqdai0815	3620
6	Benq	3564
7	Kevin114514	3443
8	ksun48	3434
9	Rewinding	3397
10	Um_nik	3396

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	156
8	TheScrasse	154
9	Dominater069	153
10	nor	152

I wrote an article/blog about how to do fast modular multiplication:

https://simonlindholm.github.io/files/bincoef.pdf

tl;dr:

avoid latency-bound loops
dynamic modulus is slow, constant modulus is fast
if you perform many multiplications with the same dynamic modulus, you can do what the compiler does and use Barrett reduction (involves some precomputation)
it is actually possible to beat the compiler if you accept a result in [0, 2*MOD):

uint64_t reduce(uint64_t a) {
  return a - (uint64_t)((__uint128_t(-1ULL / MOD) * a) >> 64) * MOD;
}

Same goes for addition and subtraction: if you can live with a result in [0, 2*MOD), just do a + b or a - b + MOD and skip the range correction that brings the result into [0, MOD). Delay modular reductions far as possible, ideally combining them with multiplications. While being mindful of overflows, of course.

on 32-bit, use Montgomery multiplication instead, to avoid __uint128_t
if really desperate, combine Montgomery multiplication with SIMD; this runs 3x faster than Barrett reduction when AVX2 is available
for larger numbers (e.g. multiplication of 64-bit numbers), use floating-point based methods

Блог пользователя simonlindholm