Blog entries - Codeforces

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	153
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

Spheniscine's blog

Unofficial Editorial — Back to School '24

By Spheniscine, history, 5 months ago, In English

Contest hosted on DMOJ https://dmoj.ca/contest/bts24

First, for each node we want to find $$$src[i]$$$ : the nearest tree to each node, and $$$ds[i]$$$, the distance to that tree. These can be found via multi-source breadth-first search. To ensure that ties are broken correctly, ensure that $$$T$$$ is sorted in increasing order before entering them in the queue.

Now consider the park to be a tree graph rooted at node $$$1$$$. Let $$$c(u, v)$$$ be the cost of starting from node $$$u$$$ and going to node $$$v$$$ (without considering what the dog does after reaching node $$$v$$$). It is easy to answer $$$c(u, v)$$$ if $$$u$$$ and $$$v$$$ are neighbors in the graph; $$$c(u, v) = 1$$$ if $$$src[u] = src[v]$$$ and $$$ds[u] > ds[v]$$$ (because the dog is already chasing in that direction), otherwise $$$c(u, v) = 2 \cdot ds[u] + 1$$$, as the dog has to chase the squirrel before you could bring him back to node $$$u$$$ to step to node $$$v$$$.

It can be seen that for nodes that aren't neighbors $$$u$$$ and $$$v$$$, if the simple path between them is $$$u, p_1, p_2, \ldots, p_m, v$$$, $$$c(u, v) = c(u, p_1) + c(p_1, p_2) + \ldots + c(p_m, v)$$$.

To find arbitrary $$$c(u, v)$$$ quickly, precalculate $$$w_{up}[i]$$$ and $$$w_{dn}[i]$$$, these are cumulative costs for paths going up toward or down from the root of the tree graph. $$$w_{up}[i] = w_{up}[P_i] + c(i, P_i)$$$, and $$$w_{dn}[i] = w_{dn}[P_i] + c(P_i, i)$$$.

To answer a query for $$$x, y$$$, first find $$$l$$$, the lowest common ancestor (LCA) of nodes $$$x$$$ and $$$y$$$, using your favorite method (link, link). Then then answer is given by $$$c(x, y) + (2 \cdot ds[y]) = (w_{up}[x] - w_{up}[l]) + (w_{dn}[y] - w_{dn}[l]) + (2 \cdot ds[y])$$$: the cost of the path from $$$x$$$ upward to $$$l$$$, the path from $$$l$$$ downward to $$$y$$$, and we must not forget to include the last chase at node $$$y$$$.

Note that it is possible to "cheese" subtask 1 if you can spot the method:

Spoiler

Time complexity: $$$O((N+Q) \log N)$$$

Full text and comments »

Spheniscine
5 months ago
0

Unofficial AtCoder Beginner Contest 237 Editorial

By Spheniscine, history, 3 years ago, In English

A – Not Overflow

Spoiler

B – Matrix Transposition

Spoiler

C – kasaka

Spoiler

D – LR insertion

Spoiler

E – Skiing

Spoiler

F – |LIS| = 3

Spoiler

G – Range Sort Query

Spoiler

Ex – Hakata

Spoiler

Let $$$sub_1, ..., sub_{|sub|}$$$ be all distinct palindromic substrings of $$$S$$$, this can be constructed using hashmap in $$$O(|S|^3)$$$. This problem thus amounts to finding the maximum antichain among $$$sub$$$, where the partial order relation $$$s < t$$$ holds if $$$s$$$ is a proper substring of $$$t$$$.

To do so, we can use Dilworth's theorem to transform this problem to a bipartite matching problem. We could create a flow graph with the following vertices:

$$$U_i$$$, $$$V_i$$$ for $$$i \in [1, |sub|]$$$
$$$src$$$ (the source node), $$$snk$$$ (the sink node).

And the following edges, all with a capacity of $$$1$$$:

$$$src \rightarrow U_i$$$, $$$V_i \rightarrow snk$$$ for $$$i \in [1, |sub|]$$$
$$$U_i \rightarrow V_j$$$ for all $$$i, j$$$ where $$$sub_i > sub_j$$$ (in other words, $$$sub_j$$$ is a proper substring of $$$sub_i$$$)...

And here we run into a problem. $$$|sub|$$$ is $$$O(|S|^2)$$$, so there could be up to $$$O(|S|^4)$$$ edges in this graph. If this doesn't exhaust the memory limit, finding the max-flow with Dinic/Hopcroft-Karp almost certainly would exceed the time limit with $$$O(E\sqrt V) = O(|S|^5)$$$. [Update: Actually this is fine, as there can only be $$$O(|S|)$$$ distinct palindromes, so the flow would run at $$$O(|S|^{2.5})$$$. Proof: Define a substring $$$S[l...r]$$$ as "earlier" than another substring if its $$$r$$$ is lower. For each $$$r$$$ there is at most one $$$l$$$ that $$$S[l...r]$$$ is the earliest occurrence of a palindromic substring. This can be proven by contradiction: suppose that $$$S[i...r]$$$ and $$$S[j...r]$$$ ($$$i < j$$$ WLOG) are both palindromic, and are both earliest occurrences of themselves. However, since $$$S[i...r]$$$ is a palindrome, $$$S[i...r-(j-i)] = \text{rev}(S[j...r]) = S[j...r]$$$, hence $$$S[j...r]$$$ can't be the earliest occurrence of itself.]

To reduce the number of edges in the flow graph, we could, instead of only including palindromic substrings in $$$sub$$$, include all distinct substrings of $$$S$$$. (Let's call the original set of palindromic substrings $$$pal$$$).

Our new flow graph would also have two vertices for every $$$sub_i$$$, as well as $$$src$$$ and $$$snk$$$. It would have the following edges:

$$$1$$$-capacity: $$$src \rightarrow U_i$$$, $$$V_i \rightarrow snk$$$ only if $$$sub_i \in pal$$$
We still want to allow flow from $$$U_i \rightarrow V_j$$$ for all $$$sub_i > sub_j$$$...
- Let $$$sub_i$$$ be a substring of length $$$\geq 2$$$. Let $$$sub_j$$$ be $$$sub_i$$$ without its last character, similarly $$$sub_k$$$ is $$$sub_i$$$ without its first character. Note that all proper substrings of $$$sub_i$$$ is a substring of at least one of $$$sub_j$$$ and $$$sub_k$$$.
- Thus we can add these edges, all infinite-capacity:
  - $$$U_i \rightarrow V_j$$$, $$$U_i \rightarrow V_k$$$ for all $$$sub_i$$$ of length $$$\geq 2$$$
  - $$$V_x \rightarrow U_x$$$ for all $$$x \in [1, |sub|]$$$. This allows flow to "continue on" to lesser substrings.

This keeps the number of edges down to $$$O(|S|^2)$$$. The final answer is $$$|pal| - maxFlow$$$.

Time complexity: $$$O(|S|^3)$$$

Note: Codeforces also quite recently had another problem involving maximum antichain and Dinic's max flow: 1630F - Making It Bipartite

Full text and comments »

atcoder, shortest path, dp, linked list, segment tree, maxflow, dilworth

+115

Spheniscine
3 years ago
12

Please change the math renderer back

By Spheniscine, history, 4 years ago, In English

The math renderer was changed/updated recently but it looks much worse than the previous version.

First off, it takes much longer to fully render; before it does it looks like this, which is small and hard to read:

Then when it does fully render, it's quite blurry:

Full text and comments »

+374

Spheniscine
4 years ago
7

Unofficial AtCoder Beginner Contest 191 Editorial

By Spheniscine, history, 4 years ago, In English

A – Vanishing Pitch

Spoiler

B – Remove It

Spoiler

C – Digital Graffiti

Spoiler

D – Circle Lattice Points

Spoiler

E – Come Back Quickly

Spoiler

F – GCD or MIN

Spoiler

Full text and comments »

Spheniscine
4 years ago
8

"Splitmixes" Pseudorandom Permuter, aka "WS-APHF"

By Spheniscine, history, 4 years ago, In English

Forgive the quirkiness of the title; I am not aware of a name for this concept, therefore I just made a few up.

This blogpost is about a particular way to construct what I call a "pseudorandom permuter". It is distinct from a regular pseudorandom number generator (PRNG) in that it is constructed such that the sequence will never repeat until all $$$M = 2^k$$$ integers in its machine-integer state space (32- or 64-bit) has been used. In other words, it's a way to produce a permutation of $$$[0, M)$$$ that's (hopefully) statistically indistinguishable from a truly random permutation.

The generator is made up of two parts, a "discrete Weyl sequence", and an "avalanching perfect hash function".

A "discrete Weyl sequence" is defined by a parameterized function $$$W_{s, \gamma}: [0, M) \rightarrow [0, M)$$$ where the $$$i$$$-th member of the sequence is $$$W_{s, \gamma}(i) = (s + \gamma \cdot i) \mod M$$$. The two parameters can be chosen using a standard random number generator, with the caveat that $$$\gamma$$$ must be odd (this can be easily ensured by a | 1 instruction). This ensures it is coprime with the machine integer size $$$M$$$.

The advantage of the discrete Weyl sequence is that as it is parameterized, it is hard to predict by an adversarial test (whether by the problemsetter or by the "hacking" mechanic). The disadvantage however, is that it is extremely regular, and this regularity may be exploited by adversarial tests.

In comes the second part. An "avalanching perfect hash function" is a function that is "perfect" in the sense that it maps $$$[0, M) \rightarrow [0, M)$$$ without collisions, i.e. $$$a \neq b \Leftrightarrow f(a) \neq f(b)$$$, and exhibits the avalanche property, which is that flipping one bit in the input should flip roughly half the bits of the output without a regularly predictable pattern. Note that our discrete Weyl sequence fits the definition of a perfect hash function, however it doesn't exhibit the avalanche property.

Constructing a good APHF is more difficult, however since it doesn't have to be parameterized, we can simply use a fixed function. One option is to use the "multiply-xorshift" construction from the "splitmix64" generator:

function splitmix64_aphf(z: uint64): uint64 {
	z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
	z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
	return z ^ (z >> 31);
}

where ^ denotes the binary xor operator, and >> the unsigned right-shift operator. I also use another option for 32-bit integers, taken from this offsite blogpost by Chris Wellons:

function int32_aphf(z: uint32): uint32 {
	z = (z ^ (z >> 16)) * 0x7feb352d;
	z = (z ^ (z >> 15)) * 0x846ca68b;
	return z ^ (z >> 16);
}

We thus define our final hash function $$$f_{s, \gamma}(x) = A(W_{s, \gamma}(x))$$$, where $$$A$$$ is our chosen APHF. This combines the advantages of both parts; easy parametrization ("seedability") and good avalanche properties that make it hard to predict, as long as the adversary doesn't have access to the parameters or the generated values. (However, it is possible for an adversary with access to the generated values to reverse the function to find the parameters and thus predict the function; thus this construction is unsuitable for cryptographic purposes.)

Note that the actual splitmix64 generator is actually a construction of this type with a fixed $$$\gamma$$$ value. Hence my alternate name, "Splitmixes", as this construction allows easy generation of an unpredictable hash function with similar properties.

Applications

This generator should not be used like a regular random-number generator, as its "no-collision until period" property will fail statistical tests that test the "birthday paradox" properties of a PRNG, i.e. generating a large number of values and checking if the number of collisions conforms to the statistically expected value.

However, it is this very property that makes it more useful than regular PRNGs for, e.g.,

hash maps with integer keys (use $$$f_{s, \gamma}(x)$$$ for the hash-key of $$$x$$$)
priority values in treaps (simply have a counter variable and use $$$f_{s, \gamma}(0), f_{s, \gamma}(1),$$$ etc, and you're guaranteed never to produce the same priority unless you generate more than 4-billion-plus values)

Full text and comments »

Spheniscine
4 years ago
0

[AtCoder] Unofficial HHKB Programming Contest 2020 Editorial

By Spheniscine, history, 4 years ago, In English

A – Keyboard

Spoiler

B – Futon

Spoiler

C – Neq Min

Spoiler

D – Squares

Spoiler

E – Lamps

Spoiler

F – Random Max

Spoiler

Thanks to Everule for help with this problem.

Let's denote the random variates with $$$X_i$$$ (capital X) to avoid confusion with the parametric variable $$$x$$$ in the following notations.

Let $$$F(x) := \Pr[\max X \leq x]$$$, the cumulative distribution function (CDF) of the maximum variate. There is a formula for deriving the expected value from the CDF of a random variate:

$$$\displaystyle \operatorname{E}[\max X] = \int\limits_0^\infty (1-F(x))\,dx - \int\limits^0_{-\infty} F(x)\,dx$$$

The latter integral can be ignored as $$$F(x) = 0$$$ for all $$$x \leq 0$$$. We can also adjust the upper bound of the left integral to $$$\max R$$$ as $$$F(x) = 1$$$ when $$$x \geq \max R$$$. Thus we may simplify:

$$$\displaystyle \operatorname{E}[\max X] = \int\limits_0^{\max R} (1-F(x))\,dx = \max R - \int\limits_0^{\max R} F(x)\,dx$$$

Now let's figure out the behavior of $$$F(x)$$$. We may note that it's the product of the CDFs of all the individual random variates:

$$$\displaystyle F(x) = \Pr[\max X \leq x] = \prod_{i=1}^N \Pr[X_i \leq x]$$$

Let $$$f_i(x) := \Pr[X_i \leq x]$$$. Let's decompose $$$f_i(x)$$$, the CDF for a single variate:

$$$\displaystyle f_i(x) = \begin{cases} 1 & \text{if } x \geq R_i;\\ \dfrac {x - L_i} {R_i - L_i} & \text{if } L_i < x < R_i;\\ 0 & \text{if } x \leq L_i. \end{cases}$$$

Note that the middle expression is a degree-$$$1$$$ polynomial. To analyze the behavior of $$$F(x)$$$ we can imagine a sweep line on decreasing $$$x$$$. When $$$x$$$ touches the $$$R_i$$$ of one of the given intervals, its corresponding $$$f_i(x)$$$ "switches on" and contributes its middle expression to the product. Then as soon as $$$x$$$ touches $$$\max L$$$ one of the CDFs goes to $$$0$$$, which sets the product to $$$0$$$ for good. (This matches intuition, as the maximum variate could never be less than $$$\max L$$$). From this we can conclude that $$$F(x)$$$ is a piecewise function, with each piece being describable by a polynomial. We can integrate each piece individually, and their sum would be the integral of $$$F(x)$$$.

What does this mean in terms of implementation? We could at first set variable $$$E := \max R$$$, and sort all intervals in non-increasing (descending) order of $$$R_i$$$ (henceforth we assume $$$R_i \geq R_{i+1}$$$ for all $$$1 \leq i \leq N-1$$$). Let $$$m$$$ denote the number of intervals where $$$R_i > \max L$$$. Let $$$S_{1..m} := R_{1..m}$$$, and $$$S_{m+1} := \max L$$$, which together denote the change-points of $$$F(x)$$$.

Now let $$$p_0(x)$$$ be the polynomial $$$1$$$ (storing all polynomials as arrays of coefficients). As you iterate $$$i$$$ over $$$[1, m]$$$, construct the new polynomial:

$$$p_{i}(x) := p_{i-1}(x) \cdot \dfrac {x - L_i} {R_i - L_i}$$$

Then integrate this polynomial over $$$[S_{i+1}, S_{i}]$$$ (the formula being derivable from the power rule for integrating polynomials):

$$$\displaystyle \int\limits_{S_{i+1}}^{S_{i}} p_{i}(x)\,dx = \int\limits_{S_{i+1}}^{S_{i}} \left(\sum_{k=0}^i c_k x^k\right)\,dx = \sum_{k=0}^i \dfrac { c_k \cdot (S_i^{k+1} - S_{i+1}^{k+1}) } { k + 1 }$$$

Subtract each integral from $$$E$$$ and we have our desired expected value. Don't forget to multiply by the required scaling factor $$$(N+1)! \cdot \prod_{i=1}^N (R_i - L_i)$$$ before outputting.

This works in $$$\tilde O(N^2)$$$ as the polynomial increases in length by $$$1$$$ each iteration, taking about $$$\tilde O(n)$$$ to multiply and to integrate. The tildes (~) over the O's indicate hidden $$$\log$$$ factors from modular inverses and exponentiation. It can also be optimized to strict $$$O(N^2)$$$ using the multiple inverse trick and by accumulating the powers instead of exponentiating each time.

Full text and comments »

Spheniscine
4 years ago
9

Unofficial ACL Beginner Contest Editorial

By Spheniscine, history, 4 years ago, In English

A – Repeat ACL

Спойлер

Easy programming-language-knowledge check typical of the first task of AtCoder beginner contests. The following is a passing Kotlin submission:

fun main() {
    val k = readLine()!!.toInt()
    val ans = "ACL".repeat(k)
    println(ans)
}

B – Integer Preference

Спойлер

C – Connect Cities

Спойлер

D – Flat Subsequence

Спойлер

E – Replace Digits

Спойлер

F – Heights and Pairs

Спойлер

Define a bad pair as a pair where the two persons' heights are equal.

Let's define a useful function $$$\displaystyle pairs(n) := \frac{(2n)!}{n!\cdot 2^n}$$$, the number of ways to make $$$n$$$ pairs (good or bad) out of $$$2n$$$ people. Explanation: Start with any of $$$(2n)!$$$ permutations. When you pair them up, you don't care about the order within the pair, so you divide by $$$2$$$ a total of $$$n$$$ times. Then you also don't care about the order between the pairs, so you divide by $$$n!$$$.

The main idea involves the Inclusion–exclusion principle.

In general, principle of inclusion-exclusion (PIE) can be summarized with the following steps:

Identify a set of "violations", i.e. a set of rules you want to obey and specific requirements for each rule to be broken.
Identify a way to calculate $$$F(S)$$$: for every subset $$$S$$$ of possible violations, the number of ways to violate at least that subset (note that we don't need to know how to calculate the number of ways to violate exactly that subset).
The number of ways to obey all rules (i.e. not commit any violations) will then equal $$$(\sum _ {|S| \text{ is even}}F(S)) - (\sum _ {|S| \text{ is odd}}F(S)) = \sum _ S (-1) ^ {|S|} \cdot F(S)$$$ (note that zero is also even)

In this problem there are up to $$$N$$$ possible violations, i.e. the $$$i$$$-th pair of people being of equal height, i.e. a bad pair. Note that we don't actually care about the order of the pairs, so we can just push all the "bad pairs" to the beginning, so we actually only have to worry about $$$O(N)$$$ possible subsets of violations. Let's define $$$G(k) = \sum _ {|S|=k}F(S)$$$, and calculate the final answer as $$$(\sum _ {k \text{ is even}}G(k)) - (\sum _ {k \text{ is odd}}G(k)) = \sum _{k=0} ^N (-1) ^ {k} \cdot G(k)$$$

Let $$$f(k)$$$ be the number of ways to match up $$$k$$$ bad pairs from the $$$2N$$$ given people. We can thus see that $$$G(k) = f(k) \cdot pairs(N-k)$$$; note we don't care about accidentally making more bad pairs, as it's accounted for in the definitions of $$$F(S)$$$ and $$$G(k)$$$.

Now let's figure out how to find $$$f(k)$$$. First note that the exact heights don't matter, only their equality. We can divide the people into components with the same height as each other. Collect all component sizes into array $$$a$$$. For example, if $$$h = [1, 1, 1, 2, 2, 4]$$$, then $$$a = [3, 2, 1]$$$. This can be easily done using sorting and run-length compression.

Then from each component $$$a_i$$$, construct array $$$b_i$$$, indexed over $$$[0, \lfloor a_i/2 \rfloor]$$$, where $$$b_i[j]$$$ denotes how many ways are there to fix $$$j$$$ bad pairs from this component. $$$\displaystyle b_i[j] = \binom {a_i}{2j} \cdot pairs(j)$$$, since we take $$$2j$$$ people from the component and make $$$j$$$ pairs out of them.

In the case of a single component, $$$f(j) = b_1[j]$$$, but how do we combine the results of several components? Let's assume we have only two components. Then we may take $$$0$$$ bad pairs from one component and $$$j$$$ from the other, or $$$1$$$ from one and $$$j-1$$$ from the other, etc., therefore:

$$$\displaystyle f(j) = b_1[0] \cdot b_2[j] + b_1[1]\cdot b_2[j-1] + b_1[2]\cdot b_2[j-2] + ... + b_1[j]\cdot b_2[0] = \sum_{i=0}^j b_1[i] \cdot b_2[j-i] $$$.

This is called a convolution, and is the same formula for when we multiply two polynomials of the form $$$p(x) = B_0x^0 + B_1x^1 + B_2x^2 + ... + B_nx^n$$$ and wishing to find the $$$j$$$-th coefficient. The case with more components is similar, except we must convolve all the $$$b_i$$$ arrays, i.e. find the product of several polynomials of different lengths.

[One can visualize this by treating $$$x^k$$$ as a formal symbol meaning "ways to make $$$k$$$ bad pairs", which naturally add together when their coefficients are multiplied]

However, computing convolutions for two polynomials of size $$$n$$$ and $$$m$$$ naively is $$$O(nm)$$$, which is too slow. There is just the right tool in the AtCoder library for that; the function convolution. This function uses the Fast Fourier Transform method to convolve two polynomials in $$$O((n+m) \log (n+m))$$$. However note that it only accepts two arguments, and convolving multiple polynomials could be inefficient if called in the wrong order; imagine if you had about $$$N/2$$$ polynomials, and all of them are short except for one. Then imagine if you convolved a short one to the longest one over and over; suddenly your submission takes $$$O(N^2 \log N)$$$ and times out.

Fortunately there is a simple fix: just put all the polynomials into a queue. While there is more than one polynomial in that queue, withdraw two from the front, convolve them, then put the result back in the end of the queue.

This works in $$$O(N \log ^2 N)$$$, the worst case being $$$N$$$ polynomials of degree $$$1$$$ (length $$$2$$$), which would take $$$O(\log N)$$$ "cycles" to fully combine, with the convolutions taking a time of $$$O(N \log N)$$$ each cycle.

And thus finally, you are able to obtain $$$f(k)$$$ by reading the coefficients of the final polynomial, and can apply the inclusion-exclusion formula above to obtain the answer.

Full text and comments »

Spheniscine
4 years ago
10

Atcoder Beginner Contest 178 Unofficial Editorial

By Spheniscine, history, 4 years ago, In English

A – Not

Spoiler

B – Product Max

Spoiler

C – Ubiquity

Spoiler

D – Redistribution

Spoiler

E – Dist Max

Spoiler

F – Contrast

Spoiler

Let $$$C$$$ be the array that holds the answer sequence. At first, we can greedily put any value of $$$B$$$ that doesn't match $$$A_i$$$ for each position from left to right.

If this strategy fails, it means that we are in a position where $$$A_i = z$$$ and all remaining members of $$$B$$$ are $$$z$$$. If we put all the remaining $$$z$$$ values into $$$C$$$, the situation looks like this:

A = [   < z   ][           z           ][    > z    ]
C = [ maybe z ][  not z  ][            z            ]
                          {TROUBLE ZONE}

So to fix the "trouble zone", we need to swap the $$$z$$$ values in it with some other values. It can be seen that the only values we can swap with lie in the "maybe $$$z$$$" zone; we can't swap with "not $$$z$$$" as that will cause another collision.

Thus we can iterate through the "maybe $$$z$$$" zone to try to find enough non-$$$z$$$ values to swap with. If we run out of values to swap with without eliminating all collisions, the answer is No, as there are no more values to swap with.

Full text and comments »

Spheniscine
4 years ago
9

Addendum: Optimized variant of Barrett reduction + proof re: ultimate NTT article

By Spheniscine, history, 5 years ago, In English

This is an extension of my article proposing the "ultimate" NTT. Here I will present a simplified and optimized variant of the Barrett reduction algorithm on that page. As it relaxes certain bounds of the standard parameters of the algorithm, I will also present a proof of the correctness of the variant, at least for this specific case.

First, the algorithm:

function mulMod(a: Long, b: Long): Long {
    xh = multiplyHighUnsigned(a, b) // high word of product
    xl = a * b // low word of product

    g = multiplyHighUnsigned( (xh << 2) | (xl >>> 62) , BARR_R)

    t = xl - g * MOD - MOD
    t += (t >> 63) & MOD
    return t
}

Notice that the middle routine has been greatly simplified; instead of computing the high and middle words of $$$xr$$$, we simply take the top 64 bits of $$$x$$$, in effect computing $$$\displaystyle \left\lfloor \left\lfloor \frac{x}{2^{62}} \right\rfloor \frac{r}{2^{64}} \right\rfloor$$$ in place of $$$\displaystyle \left\lfloor \frac {xr}{4^{63}} \right\rfloor$$$. Experimentally, this seems to work (and indeed, this variant is how Barrett reduction is normally implemented for multi-word cases), however a naive adjustment of the bounds of the Barrett reduction algorithm would relax the pre-normalization bound of $$$t$$$ to $$$[-m, 2m)$$$. This is bad news, as there is a possibility of overflowing into the sign bit, thus forcing the use of a modulus one bit shorter (e.g. $$$4611686018326724609$$$, $$$g = 3$$$) so that we could easily correct the result with two normalization rounds.

So the following is a proof of correctness for this relaxed algorithm, at least for the very specific use case in the described NTT:

On the Project Nayuki article on Barrett reduction (note that I will use $$$n$$$ instead of $$$m$$$ for the modulus in accordance to the notation on that page), one important lemma is the inequality

$$$\displaystyle\frac{x}{n} - 1 < \frac{xr}{4^k} ≤ \frac{x}{n} \implies \frac{x}{n} - 2 < \left\lfloor \frac{x}{n} - 1 \right\rfloor ≤ \left\lfloor \frac{xr}{4^k} \right\rfloor ≤ \frac{x}{n}$$$

We shall show that this inequality holds even if $$$\dfrac{xr}{4^k}$$$ is replaced by $$$\dfrac{(x-\delta) r}{4^k}$$$, with $$$0 ≤ \delta < 2^{62}$$$, the bits that have been omitted.

First we need to obtain a tighter bound on $$$r$$$. Given that $$$r$$$, $$$k$$$, and $$$n$$$ are fixed in our algorithm, we can verify that $$$\dfrac {4^k} n - r < 2^{-9}$$$. Let this value be $$$\varepsilon$$$. We thus obtain $$$\displaystyle \frac{4^k}{n} - \varepsilon < r < \frac{4^k}{n}$$$

Multiply by $$$x-\delta \geq 0$$$: $$$\displaystyle (x-\delta)\left(\frac{4^k}{n} - \varepsilon\right) ≤ (x-\delta)r ≤ (x-\delta)\frac{4^k}{n}$$$

Given that $$$\delta$$$ is nonnegative we can relax the right bound to $$$x\dfrac{4^k}{n}$$$.

Divide by $$$4^k$$$: $$$\displaystyle (x-\delta)\left(\frac{1}{n} - \frac \varepsilon {4^k}\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

Recompose the leftmost expression: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) + \frac{\delta\varepsilon} {4^k} ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$\displaystyle\frac{\delta\varepsilon} {4^k} \geq 0$$$, so relax the bound: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$x < n^2 < 4^k \implies \dfrac{x}{4^k} < 1$$$, so further relax the bound: $$$\displaystyle \frac x n - \left({\varepsilon} + \frac \delta n\right) < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$\delta < 2^{62}$$$ while $$$n$$$ is very slightly below $$$2^{63}$$$, so it can be verified that $$$\dfrac \delta n$$$ must be $$$< \dfrac34$$$.

$$$\dfrac34 + \varepsilon < 1$$$, therefore $$$\displaystyle \frac x n - 1 < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

We can thus follow the rest of the proof in the Nayuki article to prove that our $$$t$$$ is in $$$[-n, n)$$$.

Samples (Kotlin, so YMMV on benchmarks for other languages)

74853038 for 993E - Nikita and Order Statistics, 328 ms faster

74852214 for 986D - Perfect Encoding, 452 ms faster

Addendum

There are other applications of this proof besides NTT. For example, I found a safe prime $$$9223372036854771239$$$, which is a good property for a rolling-hash modulus to have, as any number in $$$[2, m-2]$$$ will have multiplicative order at least $$$\dfrac{m-1}{2}$$$. It seems that $$$\varepsilon$$$ and $$$\dfrac\delta n - \dfrac 1 2$$$ will both be rather small for moduli so close below a power of 2.

Full text and comments »

fft, ntt, proof

Spheniscine
5 years ago
0

Notes on FFT / NTT, and the "ultimate" NTT with modulus > 9 * 10^18

By Spheniscine, history, 5 years ago, In English

Prerequisites: you need to be familiar with both modular arithmetic and Fast Fourier Transform / number theoretic transform. The latter can be a rather advanced topic, but I personally found this article helpful. Note that there are some relatively easy optimizations/improvements that can be done, like precalculating the modular inverse of $$$n$$$, the "bit-reversed-indices" (can be done in $$$O(n)$$$ with DP), as well as the powers of $$$\omega_n$$$. Also useful is modifying it so that the input array length can be any power of two $$$\leq n$$$; some problems require multiplying polynomials of many different lengths, and you'd rather the runtime be $$$O(n \log n)$$$ over the sums of lengths, rather than (number of polynomials * maximum capacity).

Also helpful is knowing how to use NTT to solve problems modulo $$$998244353$$$, like 1096G - Lucky Tickets and 1251F - Red-White Fence. Note that for some problems it's easier to think of FFT not as multiplying polynomials, but of finding multiset $$$C$$$ as the pairwise sums of multisets $$$A$$$ and $$$B$$$, in the form of arrays $$$A[i] =$$$ number of instances of $$$i$$$ in multiset $$$A$$$. This is equivalent to multiplying polynomials of the form $$$\sum _{i=0} ^n A[i]x^i$$$.

Note that $$$\omega_n$$$ can be easily found via the formula $$$g ^ {(m-1) / n} \ \text{ mod } m$$$, provided that:

$$$m$$$ is prime
$$$g$$$ is any primitive root modulo $$$m$$$. It is easiest to find this before hand and then hardcode it in the submission. You can either implement the algorithm yourself or use Wolfram Alpha to find it via the query PrimitiveRoot[m]. (Spoiler alert, $$$g = 3$$$ works well for $$$998244353$$$)
$$$n$$$ divides $$$m-1$$$ evenly. As $$$n$$$ is typically rounded up to the nearest power of 2 for basic FFT implementations, this is easiest when $$$m$$$ is of the form $$${a \cdot 2^{k} + 1}$$$ where $$$2^k \geq n$$$. This is why $$$998244353$$$ is a commonly-appearing modulus; it's $$$119 \cdot 2^{23} + 1$$$. Note that this modulus also appears in many problems that don't require FFT/NTT; this is a deliberate "crying-wolf" strategy employed by puzzle writers, so that you can't recognize immediately that a problem requires FFT/NTT via the given modulus.

Now onto the main topic, the "ultimate" NTT.

Rationale: There are a few problems like 993E - Nikita and Order Statistics that require FFT, however the results aren't output with any modulus, and indeed may exceed the range of a 32-bit integer type. There are several usual solutions for these types of problems:

Do NTT with two moduli and restore the result via Chinese Remainder Theorem. This has several prominent disadvantages:
1. Slow, as the NTT routine has to be done twice.
2. Complicated setup, as several suitable moduli must be found, and their primitive roots calculated
3. Restoring the result with CRT requires either brute force or multiplications modulo $$$pq$$$, which may overflow even 64-bit integer types.
Do FFT with complex numbers and floating point types. Disadvantages are:
1. Could be slow due to heavy floating-point arithmetic. Additionally, JVM-based languages (Java, Kotlin, Scala) suffer complications here, as representing complex numbers with object-based tuples adds a significant overhead.
2. Limited precision due to rounding error. Typically the problems are constructed such that it won't be a problem if care is taken in the implementation, but won't it be nice to just not to have to worry about it?

To solve these problems, I propose the "ultimate" NTT solution — just use one huge modulus. The one I use is $$$m = 9223372036737335297 = 549755813881 \cdot 2^{24} + 1, g = 3$$$. This is just over a hundred million less than $$$2^{63} - 1$$$, the maximum value of a signed 64-bit integer.

However, this obviously raises the issue of how to safely do modular arithmetic with such huge integers. Addition is complicated by possible overflow into the sign bit, thus the usual if(x >= m) x -= m won't work. Instead, first normalize $$$x$$$ into the range $$$[-m, m)$$$; this is easily done with subtracting $$$m$$$ before any addition operation. Then do x += (x >> 63) & m. This has the effect of adding $$$m$$$ to $$$x$$$ if and only if $$$x$$$ is negative.

The elephant in the room however is multiplication. The usual method requires computing a 126-bit product, then doing a modulo operation over a 63-bit integer; this could be slow and error-prone to implement. C++ users could rejoice, as Codeforces recently implemented support for 128-bit integers via this update. However, before you celebrate too early, there are still issues: this may not be available on other platforms, and I can't imagine that straight 128-bit modulo 64-bit is exactly the fastest operation in the world, so the following might still be helpful even to C++ users.

A rather simple-to-code general-case method is to "double-and-add" similar to modular exponentiation, however it is too slow for the purposes of this article; FFT implementation speed is heavily bound by the speed of multiplications.

My favored technique for this problem is called Barrett reduction. The explanation is (adapted from this website)

Choose integer $$$k$$$ such that $$$2^k > m $$$. $$$k = 63$$$ works for our modulus.
Precompute $$$\displaystyle r = \left\lfloor \frac {4^k} m \right\rfloor$$$. For our modulus this is $$$9223372036972216319$$$, which just overflows a signed 64-bit integer. You can either store it as unsigned, or the as the signed 64-bit representation $$$-9223372036737335297$$$ (yes, this just happens to be $$$-m$$$, which is quite a neat coincidence)
Multiply the two integers. Note that we need all 126 bits of this product. The low bits can be easily obtained via straight multiplication, however the high bits need some bit-shifting arithmetic tricks to obtain. Adapt the following Java code, taken from here for your preferred language:

multiplyHighUnsigned code

public static long multiplyHighUnsigned(long x, long y) {
    long x_high = x >>> 32;
    long y_high = y >>> 32;
    long x_low = x & 0xFFFFFFFFL;
    long y_low = y & 0xFFFFFFFFL;
        
    long z2 = x_low * y_low;
    long t = x_high * y_low + (z2 >>> 32);
    long z1 = t & 0xFFFFFFFFL;
    long z0 = t >>> 32;
    z1 += x_low * y_high;
    return x_high * y_high + z0 + (z1 >>> 32);
}

Let the product be $$$x$$$. Then calculate $$$\displaystyle t = x - \left\lfloor \frac{xr}{4^k} \right\rfloor m - m$$$. This requires some adaptation of grade-school arithmetic; pseudocode in spoiler below. $$$t$$$ is guaranteed to be in the range $$$[-m, m)$$$, so the t += (t >> 63) & m trick should work to normalize it. In the following pseudocode, BARR_R $$$= r$$$ and MOD $$$= m$$$. Also, ^ represents bitwise xor, not exponentiation.

mulMod pseudocode

function mulMod(a: Long, b: Long): Long {
    xh = multiplyHighUnsigned(a, b) // high word of product
    xl = a * b // low word of product

    xrh = multiplyHighUnsigned(xh, BARR_R) // high word of xr
    xrm = multiplyHighUnsigned(xl, BARR_R) // middle word of xr, first part
    add = xh * BARR_R // second part of middle word
    xrm += add // add them
    if(add ^ (1L << 63) > xrm ^ (1L << 63)) xrh++ // carry, see note 1

    t = xl - ((xrh << 2) | (xrm >>> 62)) * MOD - MOD // see note 2
    t += (t >> 63) & MOD
    return t
}

Note 1: The carry step does a little trick for unsigned comparison (flip high bits and compare as signed). If the addend is greater than the result, an overflow must have occurred, so we perform the carry.

Note 2: $$$t$$$ fits in 64 bits, so we only need the low word of $$$x$$$. We reconstruct $$$\left\lfloor \frac{xr}{4^{63}} \right\rfloor$$$ with the bits from xrh and xrm. Note we don't need the low word of $$$xr$$$ as it's discarded by the floor operation. When we multiply this factor with $$$m$$$, we ignore the overflow as the final result is guaranteed to fit in 64 bits.

Update: I have further optimized the Barrett reduction step, however it requires a new proof of correctness, so I'm keeping the old variant on this page. For info on the optimized version, read this article.

And that's it for this blogpost. This should be sufficient to solve most integer-based FFT problems in competitive programming. Note that if you need negative numbers in your polynomials, you can input them modulo $$$m$$$, then assume any integer in the output that exceeds $$$m/2$$$ is negative; this works as long as no number in the output has an absolute value greater than $$$m/2$$$, yet another advantage of using such a huge modulus.

Note that for arbitrary polynomials of length $$$n$$$ with input values not exceeding $$$a$$$, the maximum possible value in the product polynomial is around $$$a^2n$$$.

Samples

74644979 for 993E - Nikita and Order Statistics

74641269 for 986D - Perfect Encoding — this is particularly interesting, because this huge modulus allows 6 digits per word in the custom big-integer implementation instead of the recommended 3, cutting down the FFT size (as well as the size of other arithmetic operations) by half. With this tight a time limit, I'm not sure this problem is solvable in Kotlin otherwise.

Addendum

There are a small minority of problems that may ask to perform FFT with an inconvenient modulus like $$$10^9 + 7$$$, or an arbitrarily given one in the input (do any exist on Codeforces?). $$$a^2n$$$ in this case could overflow even this large modulus, but it can be handled with the "multiplication with arbitrary modulus" method listed in the CP-algorithms article. (I heard this is called MTT; also known as "FFT-mod") Even then, the large modulus this article covers might be helpful, as it reduces the number of decompositions you must make.

More addendum (December 2021)

It seems this method is significantly faster when run on a 64-bit system:

130750094 (1808 ms, Rust ran on 32-bit before an update circa November, IIRC)

140255354 (576 ms, after 64-bit update)

It even beats out my implementations using floats (140229918, 717 ms) and CRT/Garner's algorithm with two 32-bit moduli (140230689, 842 ms)

Unfortunately Java and Kotlin, at this time of writing, still seems to run on a VM that's effectively 32-bit.

There is also another interesting prime modulus, as noted in this offsite blogpost: $$$18446744069414584321$$$ (hex: 0xffff_ffff_0000_0001, $$$g = 7$$$). Due to its bitpattern, you can use some adds, subtracts, and shifts to do the reduction under this modulus, instead of the multiplies needed by Barrett reduction.

However my implementation of it ended up slightly slower (140308499, 639 ms). It seems multiplies are just that good on 64-bit systems. It might still be worth considering for its slightly higher range, huge capacity (can theoretically support FFTs up to $$$2^{32}$$$ in length, though you'd likely MLE first). It might also be friendlier to 32-bit systems (needs testing). This modulus, as noted in the blog post, also has other interesting properties; chiefly that $$$2$$$ is a $$$192^{\text{nd}}$$$ root of unity; this allows some roots of unity to be expressed in powers of two, which allows using shifts instead of multiplication in some cases. However I've not figured out how to efficiently use this fact (it might be more helpful for mixed-radix variants of FFT, which I've not learned how to implement).

Either modulus can also be used to emulate convolutions in other arbitrary moduli, such as $$$10^9 + 7$$$ (https://judge.yosupo.jp/submission/70563) and $$$2^{64}$$$ (https://judge.yosupo.jp/submission/70564), by splitting each coefficient into $$$k$$$ "place values" for a particular base (I use $$$2^{16}$$$ in these examples), then assigning $$$2k - 1$$$ slots for each coefficient (the extra slots allow the convolution to "spill over"), essentially emulating multivariate convolution with a single FFT. The speed overhead of this method, however, is quite significant; my implementation using CRT/Garner's (https://judge.yosupo.jp/submission/70678) is much faster in this case.

Full text and comments »

fft, ntt, modular arithmetic

+183

Spheniscine
5 years ago
4

Issues with math formatting? [FIXED]

By Spheniscine, history, 5 years ago, In English

I have noticed some of my older blogs and comments have some of their $$$\TeX$$$ mathematical expressions incorrectly rendered in their own line instead of in the line of text, which can be really distracting to read.

Examples: https://codeforces.me/blog/entry/72593 ($$$\mathbb Z /p \mathbb Z$$$, and the large numbers in the spoilers)

https://codeforces.me/blog/entry/73211#comment-575285 This comment had it especially bad

I'm not sure what the cause of this is, and there is no discernible pattern as to which expressions and comments are affected.

Upd: It also affects others' comments too, like: https://codeforces.me/blog/entry/73211?#comment-575152

Full text and comments »

codeforces, bug, latex

Spheniscine
5 years ago
0

Tutorial for Advent of Code 2019 day 22 part 2

By Spheniscine, history, 5 years ago, In English

Advent of Code is a website that releases programming puzzles every December from the 1st to the 25th. The puzzles have two parts, and the second part isn't revealed until you solve the first part.

This year, there have been many questions regarding part 2 of day 22, as it involves modular arithmetic in a way that hasn't been seen in previous puzzles there. I have been asked for a tutorial for it on Reddit, but I'm posting it here due to better support for mathematical notation.

Part 2 summary

First off, if you are unfamiliar with modular arithmetic, I encourage you to read my other blog post, Modular Arithmetic for Beginners. The most important thing to understand from there is how the four basic operations (addition, subtraction, multiplication, division) can be redefined to work in modular ($$$\mathbb Z / p \mathbb Z$$$ field) arithmetic. I also recommend you try some of the puzzles linked (This one is appropriately Christmas-themed too). The terminology and notations I'll use will also be similar.

It is possible to solve this without explicitly invoking modular arithmetic (see this post for an example) but they basically amount to rediscovering certain properties of modular arithmetic anyway, as such that's the "language" I'll use for this tutorial.

The first thing to notice is that each of the given instructions amount to a transformation on the position of each card. Let $$$m$$$ represent the number of cards in the deck.

"deal into new stack" moves cards from position $$$x$$$ to position $$$m - x - 1$$$. We can write this as $$$f(x) = m - x - 1$$$.
"cut $$$n$$$" moves cards from position $$$x$$$ to position $$$x - n\ \text{ mod } m$$$ (note how indices "wrap around" to the other side), Thus $$$f(x) = x - n\ \text{ mod } m$$$. Note this also works for the version with negative $$$n$$$.
"deal with increment $$$n$$$" moves cards from position $$$x$$$ to position $$$n \cdot x\ \text{ mod } m$$$. Thus, $$$f(x) = n \cdot x\ \text{ mod } m$$$

The next thing to notice is that each transformation can be rewritten in the form of $$$f(x) = ax + b\ \text{ mod } m$$$. This is called a linear congruential function, and forms the basis for linear congruential generators, a simple type of pseudorandom number generator.

"deal into new stack": $$$f(x) = -x - 1\ \text{ mod } m$$$, so $$$a = -1, b = -1$$$
"cut $$$n$$$": $$$f(x) = x - n\ \text{ mod } m$$$, so $$$a = 1, b = -n$$$
"deal with increment $$$n$$$": $$$f(x) = n \cdot x\ \text{ mod } m$$$, so $$$a = n, b = 0$$$

The next step is to see what happens when you compose two arbitrary linear congruential functions $$$f(x) = ax + b\ \text{ mod } m$$$ and $$$g(x) = cx + d\ \text{ mod } m$$$. So what do you get if you evaluate $$$g(f(x))$$$? By substitution:

$$$g(f(x)) = c(ax + b) + d\ \text{ mod } m$$$

As I established in Modular Arithmetic for Beginners, algebra works with modular residues similarly to real numbers, so let's expand:

$$$g(f(x)) = acx + bc + d\ \text{ mod } m$$$

An alternate notation for $$$g(f(x))$$$ is $$$f\ ;g(x)$$$ ("first apply $$$f$$$, then apply $$$g$$$"). Further more, we can abstract all linear congruential functions (LCF) as its own data type, consisting of a tuple $$$(a, b)$$$. Thus we can implement a compose operation between two LCFs:

$$$(a, b)\ ; (c, d) = (ac \text{ mod } m, bc + d\ \text{ mod } m)$$$

We store all LCF coefficients modulo $$$m$$$, to avoid them growing too big and slowing down our program or taking all available memory. Also note the possible pitfall: when multiplying two residue values, it's possible to overflow even the 64-bit data type, as our modulus for part 2 exceeds $$$2^{32}$$$. If you use a language with an arbitrary-size number type (e.g. Java's BigInteger), that'd do. If you don't, you might consider either importing a third-party library, or you can implement "multiplication by doubling" via a process similar to "exponentiation by squaring" described on the Modular Arithmetic for Beginners page. It's not the most efficient way to get around this problem, but for our purposes it'd work fine.

With these methods, you can translate all the steps in your puzzle input into LCFs, then compose them successively ($$$f_1\ ; f_2\ ; f_3\ ;...$$$) into a single LCF. Let's call it $$$F$$$. Recall that we've stored it as a tuple $$$(a, b)$$$ representing the equation $$$F(x) = ax + b\ \text{ mod } m$$$.

$$$F$$$ now represents a single shuffle as directed by your input. You can test your implementation by re-solving part 1 with it. (note that you have to re-compose it with a different modulus for each part.)

However, we require over a hundred trillion shuffles (we'll call this number $$$k$$$), so we can't just compose $$$F$$$ into itself naively. There are two approaches to solving this problem:

Method 1:

Compose $$$F$$$ into itself $$$k$$$ times algebraically. Do a few rounds by hand on paper and you'll notice a pattern emerge:

$$$F^k(x) = a^k x + (a^{k-1} + a^{k-2} + ... + a^1 + 1) b\ \text{ mod } m$$$. The more-formal notation would be $$$\displaystyle F^k(x) = a^k x + \sum_{i=0}^{k-1} ba^i\ \text{ mod } m$$$.

The summation forms a geometric series. According to that Wikipedia page, we can thus transform the expression to: $$$\displaystyle F^k(x) = a^k x + \frac{ b(1 - a^k) } { 1 - a } \ \text{ mod } m$$$

Exponentiation by squaring is required to perform the exponentiations, and modular multiplicative inverse to perform the division.

Method 2:

Notice that composing LCFs are associative, i.e. $$$(f\ ; g)\ ; h = f\ ; (g\ ; h)$$$.

Proof: $$$h(g(f(x))) = h(f\ ; g(x)) = g\ ; h(f(x)) = f\ ; g\ ; h(x)$$$

Note that this doesn't uniquely apply to LCFs; composition of any "pure functions" (i.e. always returns the same output for a given input) is associative. This is computationally useful for any function that can be composed in a similar manner to what we do here with LCFs before invoking them with actual parameters.

The "exponentiation by squaring" method works on all associative operations, so it can be adapted to the compose operation as well. We can thus obtain $$$F^k$$$, $$$F$$$ composed into itself $$$k$$$ times.

For example, we could adapt the pseudocode for exponentiation by squaring (from Modular Arithmetic for Beginners) like this:

function pow_compose(f: lcf, k: int64) → lcf:
    g := lcf(1, 0)
    while k > 0:
        if k is odd:
            g := compose(g, f)
        k := ⌊k/2⌋
        f := compose(f, f)
    return g

Notice how multiplication is replaced with the compose function we used to compose the steps in the input, and the initial multiplicative identity of 1 is replaced with the "identity LCF" $$$f(x) = x \text{ mod } m$$$, i.e. $$$(1, 0)$$$

Continued:

Now for the last step. Notice that we're not asked for where card $$$x$$$ ends up, but rather what card is in position $$$x$$$. To do that, we need to invert the function $$$F^k$$$. Let $$$F^k(x) = Ax + B\ \text{ mod } m$$$.

We invert it by substitution: $$$x = A \cdot F^{-k}(x) + B\ \text{ mod } m$$$, therefore $$$F^{-k}(x) = \dfrac {x - B} A\ \text{ mod } m$$$. Again this requires an application of modular multiplicative inverse.

Plug in $$$2020$$$ for $$$x$$$, and we should finally get our solution. Whew.

Addendum: matrix representation of LCFs

Full text and comments »

advent of code, modular arithmetic

Spheniscine
5 years ago
1

Modular Arithmetic for Beginners

By Spheniscine, history, 5 years ago, In English

Introduction

If you're new to the world of competitive programming, you may have noticed that some tasks, typically combinatorial and probability tasks, have this funny habit of asking you to calculate a huge number, then tell you that "because this number can be huge, please output it modulo $$$10^9 + 7$$$". Like, it's not enough that they ask you to calculate a number they know will overflow basic integer data types, but now you need to apply the modulo operation after that? Even worse are those that say you need to calculate a fraction $$$\frac pq$$$ and ask you to output $$$r$$$ where $$$r \cdot q \equiv p \pmod m$$$... not only do you have to calculate a fraction with huge numbers, how in the world are you going to find $$$r$$$?

Actually, the modulo is there to make the calculation easier, not harder. This may sound counterintuitive, but once you know how modular arithmetic works, you'll see why too. Soon you'll be solving these problems like second nature.

Terminology and notation

For convenience, I will define the notation $$$n \text{ mod } m$$$ (for integers $$$n$$$ and $$$m$$$) to mean $$$n - \left\lfloor \dfrac nm \right\rfloor \cdot m$$$, where $$$\lfloor x \rfloor$$$ is the largest integer that doesn't exceed $$$x$$$. (This should always produce an integer between $$$0$$$ and $$$m-1$$$ inclusive.) This may or may not correspond to the expression n % m in your programming language (% is often called the "modulo operator" but in some instances, it's more correct to call it the "remainder operator"). If -8 % 7 == 6, you're fine, but if it is -1, you'll need to adjust it by adding $$$m$$$ to any negative results. If you use Java/Kotlin, the standard library function Math.floorMod(n, m) does what we need.

Also for convenience, I will also define the $$$\text{ mod }$$$ operator to have lower precedence than addition or subtraction, thus $$$ax + b\ \text{ mod } m \Rightarrow (ax + b) \text{ mod } m$$$. This probably does not correspond with the precedence of the % operator.

The value $$$m$$$ after the modulo operator is known as the modulus. The result of the expression $$$n \text{ mod } m$$$ is known as $$$n$$$'s residue modulo $$$m$$$.

You may also sometimes see the notation $$$expr_1 \equiv expr_2 \pmod m$$$. This is read as "$$$expr_1$$$ is congruent to $$$expr_2$$$ modulo $$$m$$$", and is shorthand for $$$expr_1 \text{ mod } m = expr_2 \text{ mod } m$$$.

"Basic" arithmetic

First off, some important identities about the modulo operator:

$$$(a \text{ mod } m) + (b \text{ mod } m)\ \text{ mod } m = a + b\ \text{ mod } m$$$

$$$(a \text{ mod } m) - (b \text{ mod } m)\ \text{ mod } m = a - b\ \text{ mod } m$$$

$$$(a \text{ mod } m) \cdot (b \text{ mod } m)\ \text{ mod } m = a \cdot b\ \text{ mod } m$$$

These identities have the very important consequence in that you generally don't need to ever store the "true" values of the large numbers you're working with, only their residues $$$\text{ mod } m$$$. You can then add, subtract, and multiply with them as much as you need for your problem, taking the modulo as often as needed to avoid integer overflow. You may even decide to wrap them into their own object class with overloaded operators if your language supports them, though you may have to be careful of any object allocation overhead. If you use Kotlin like I do, consider using the inline class feature.

But what about division and fractions? That's slightly more complicated, and requires a concept called the "modular multiplicative inverse". The modular multiplicative inverse of a number $$$a$$$ is the number $$$a^{-1}$$$ such that $$$a \cdot a^{-1}\ \text{ mod } m = 1$$$. You may notice that this is similar to the concept of a reciprocal, but here we don't want a fraction; we want an integer, specifically an integer between $$$0$$$ and $$$m-1$$$ inclusive.

But how do you actually find such a number? Bruteforcing all numbers to a prime number close to a billion will usually cause you to exceed the time limit. There are two faster ways to calculate the inverse: the extended GCD algorithm, and Fermat's little theorem. Though the extended GCD algorithm is more versatile and sometimes slightly faster, the Fermat's little theorem method is more popular, simply because it's almost "free" once you implement exponentiation, which is also often a useful operation in itself, so that's what we'll cover here.

Fermat's little theorem says that as long as the modulus $$$m$$$ is a prime number ($$$10^9 + 7$$$ is prime, and so is $$$998\ 244\ 353$$$, another common modulus in these problems), then $$$a^m \text{ mod } m = a \text{ mod } m$$$. Working backwards, $$$a^{m-1} \text{ mod } m = 1 = a \cdot a^{m-2}\ \text{ mod } m$$$, therefore the number we need is $$$a^{m-2} \text{ mod } m$$$.

Note that this only works for $$$a \text{ mod } m \neq 0$$$, because there is no number $$$x$$$ such that $$$0 \cdot x\ \text{ mod } m = 1$$$. In other words, you still can't divide by $$$0$$$, sorry.

Multiplying $$$m-2$$$ times would still take too long; therefore a trick known as exponentiation by squaring is needed. It's based on the observation that for positive integer $$$n$$$, that if $$$n$$$ is odd, $$$x^n=x( x^{2})^{\frac{n - 1}{2}}$$$, while if $$$n$$$ is even, $$$x^n=(x^{2})^{\frac{n}{2}}$$$. It can be implemented recursively by the following pseudocode:

function pow_mod(x, n, m):
    if n = 0 then return 1
    t := pow_mod(x, ⌊n/2⌋, m)
    if n is even:
        return t · t  mod m
    else:
        return t · t · x  mod m

Or iteratively as follows:

function pow_mod(x, n, m):
    y := 1
    while n > 0:
        if n is odd:
            y := y · x  mod m
        n := ⌊n/2⌋
        x := x · x  mod m
    return y

Now that you know how to compute the modular multiplicative inverse (to refresh, $$$a^{-1} = a^{m-2} \text{ mod } m$$$ when $$$m$$$ is prime), you can now define the division operator:

$$$a / b\ \text{ mod } m = a \cdot b^{-1}\ \text{ mod } m$$$

This also extends the $$$\text{ mod }$$$ operator to rational numbers (i.e. fractions), as long as the denominator is coprime to $$$m$$$. (Thus the reason for choosing a fairly large prime; that way puzzle writers can avoid denominators with $$$m$$$ as a factor). The four basic operations, as well as exponentiation, will still work on them as usual. Again, you generally never need to store the fractions as their "true" values, only their residues modulo $$$m$$$.

Congratulations! You have now mastered $$$\mathbb Z / p \mathbb Z$$$ field arithmetic! A "field" is just a fancy term from abstract algebra theory for a set with the four basic operators (addition, subtraction, multiplication, division) defined in a way that works just like you've learned in high-school for the rational and real numbers (however division by zero is still undefined), and $$$\mathbb Z / p \mathbb Z$$$ is just a fancy term meaning the set of integers from $$$0$$$ to $$$p - 1$$$ treated as residues modulo $$$p$$$.

This also means that algebra works much like the way you learned in high school. How to solve $$$3 = 4x + 5\ \text{ mod } 10^9 + 7$$$? Simply pretend that $$$x$$$ is a real number and get $$$x = -1/2\ \text{ mod } 10^9 + 7 = 500000003$$$. (Technically, all $$$x$$$ whose residue is $$$500000003$$$, including rationals, will satisfy the equation.)

You can also now take advantage of combinatoric identities, like $$$\displaystyle \binom{n}{k} = \frac{n!}{k! (n-k)!}$$$. The factorials can be too big to store in their true form, but you can store their modular residues instead, then use modular multiplicative inverse to do the "division".

There are only a few things you need to be careful of, like:

divisions through modular multiplicative inverse would be slower than the other operations ($$$O (\log m)$$$ instead of $$$O(1)$$$), so you may want to cache/memoize the inverses you use frequently in your program.
comparisons (once you represent a number by its modulo residue, comparisons are generally meaningless, as your $$$1$$$ might "really" be $$$m + 1$$$, $$$10^{100} m + 1$$$, $$$-5m + 1$$$, or even $$$\dfrac {1} {m+1}$$$)
exponentiation (when evaluating $$$x^n \text{ mod } m$$$, you can't store $$$n$$$ as $$$n \text{ mod } m$$$. If $$$n$$$ turns out to be really huge, you need to calculate it modulo $$$\varphi(m)$$$ instead, where $$$\varphi$$$ stands for Euler's totient function. If $$$m$$$ is prime, $$$\varphi(m) = m - 1$$$. Note that this new modulus will then usually not be prime, thus "division" in it will not be reliable (you can still use the extended GCD algorithm, but only for numbers coprime to the new modulus), but you can still use the other three operators. In abstract algebra theory, $$$\mathbb Z / n \mathbb Z$$$ is a "ring" rather than a "field" when $$$n$$$ isn't prime due to this loss). Do be careful about the special case $$$0^0$$$, which should typically be defined as 1, while $$$0^{\varphi(m)}$$$ would still be $$$0$$$.

Puzzles

Here are some simpler puzzles that require a modulo answer:

1281C - Cut and Paste

1279D - Santa's Bot

1178C - Tiles

1248C - Ivan the Fool and the Probability Theory

935D - Fafa and Ancient Alphabet

300C - Beautiful Numbers

Full text and comments »

modulo, modular arithmetic

Spheniscine
5 years ago
26

Notes on using Kotlin for competitive programming

By Spheniscine, history, 5 years ago, In English

I pretty much exclusively use Kotlin for competitive programming, mostly because it's the language I'm currently most comfortable with. Here are some scattered notes and tidbits about my experience which I think might be useful to others; if you have any tips/suggestions, feel free to let me know.

Primer

Kotlin has an official primer for competitive programming. However, the IO code suggested there is only so-so; it's definitely better than Scanner, but you definitely can save a lot of runtime in heavy input problems by using the classic Java combination of BufferedReader + StringTokenizer

My current IO template

@JvmField val INPUT = System.`in`
@JvmField val OUTPUT = System.out

@JvmField val _reader = INPUT.bufferedReader()
fun readLine(): String? = _reader.readLine()
fun readLn() = _reader.readLine()!!
@JvmField var _tokenizer: StringTokenizer = StringTokenizer("")
fun read(): String {
    while (_tokenizer.hasMoreTokens().not()) _tokenizer = StringTokenizer(_reader.readLine() ?: return "", " ")
    return _tokenizer.nextToken()
}
fun readInt() = read().toInt()
fun readDouble() = read().toDouble()
fun readLong() = read().toLong()
fun readStrings(n: Int) = List(n) { read() }
fun readLines(n: Int) = List(n) { readLn() }
fun readInts(n: Int) = List(n) { read().toInt() }
fun readIntArray(n: Int) = IntArray(n) { read().toInt() }
fun readDoubles(n: Int) = List(n) { read().toDouble() }
fun readDoubleArray(n: Int) = DoubleArray(n) { read().toDouble() }
fun readLongs(n: Int) = List(n) { read().toLong() }
fun readLongArray(n: Int) = LongArray(n) { read().toLong() }

@JvmField val _writer = PrintWriter(OUTPUT, false)
inline fun output(block: PrintWriter.() -> Unit) { _writer.apply(block).flush() }

The output function allows me to wrap all code within the main function in an output { ... } block, and call println etc. within it. It's automatically flushed when the block is finished.

Useful features

A lot less boilerplate than Java. Members are public by default. Type inference means a lot less "Pokémon speak". Variables and functions can be declared straight in the top-level of the file. (basically the equivalent of static functions). Fields have implicit getters and setters that can easily be overridden when necessary.
PHP-like string templates, e.g. "$id $cost"
Extension functions – syntactic sugar for static functions; gives more natural "afterthought" syntax, as well as allowing direct access to public members of the receiver
data classes – basically custom tuples. Allows convenient destructuring declarations too.
Has access to the data structures in the Java standard library (TreeMap, HashMap, PriorityQueue etc.), and also can use BigInteger and BigDecimal if needed
Functional idioms for collection manipulation – map, fold, filter, etc.
Sequences – lazy sequence generators, potentially infinite, has standard collection manipulation functions too. Using the sequence { ... } block function allows building a sequence using a scoped yield(value) function, reminiscent of Python
inline classes – allows the creation of a new type that wraps over a base type, but that is represented by an underlying type at runtime. Especially useful for "modulo $$$10^9 + 7$$$" problems, as I keep code for a ModInt class that overloads the arithmetic operators appropriately, but is represented as a plain int in JVM runtime. Keep in mind that they are experimental as of Kotlin 1.3, but that's fine for CP in my opinion
unsigned integer types in the standard library that use the inline class feature. Not used very often, but handy if needed
inline functions – tells the compiler to inline the function to call sites. Useful for higher-order functions (JVM won't need to create a new object for the lambda) as well as small functions that are called very often; basically, anything you might use a macro for in C++, you probably want to use an inline fun or inline val for
tailrec fun – tail recursion optimization
run block function – great way to write code that needs to shortcut (e.g. return@run "NO") without having to write a new function and pass every relevant argument
functions in functions – functions can be defined within e.g. the main function, so again, no having to pass lots of arguments or global variables. Keep in mind that these are represented as objects during runtime. It's too bad they can't be inline as of yet

Potential pitfalls

Generic wrappers for JVM primitive types can cause TLE for some problems. Use primitive arrays (IntArray etc.) whenever possible to avoid this, but see next point
Inherits the hack-prone quicksort from Java for primitive arrays. Easiest solution is to use generic arrays or lists instead, but due to the performance benefit of primitive arrays, I've took the trouble to write code that shuffles them randomly before sorting.
For Kotlin versions < 1.3.30, there is a bug that will throw an exception when using .asJavaRandom() on instances of kotlin.random.Random, including Kotlin's default instance. Either use Java's own Random class, or steal this wrapper:

class JavaRandom(val kt: Random) : java.util.Random(0) {
    override fun next(bits: Int): Int = kt.nextBits(bits)
    override fun setSeed(seed: Long) {}
}

Full text and comments »

kotlin

Spheniscine
5 years ago
25