[Tutorial] Square root decomposition and applications

#	User	Rating
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

#	User	Contrib.
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

A brief introduction into the applications of square root decomposition↵
==================↵
↵
Square root decomposition is the process of separating a structure of size $O(N)$ into $O(\sqrt{N})$ "blocks" of size $O(\sqrt{N})$ each, in a way that aids manipulation of the entire structure. Square root decomposition is extremely versatile. Some of its most well-known use cases are:↵
↵
- answering queries on a static array, with methods such as Mo's algorithm or block precomputation↵
- "lazy" brute force modifications, where it is easy to query an entire block, but tag pushing isn't obvious↵
- separating objects based on a threshold, when there exists both an $O(x)$ algorithm and an $O(n/x)$ algorithm↵
↵
In this tutorial we will walk through multiple types of square root decomposition.↵
↵
Answering queries on a static array, offline (Mo's algorithm)↵
---↵
↵
Consider a problem where we are asked to find the answer for certain intervals $[l,r]$. We can't quickly compute the answer for an arbitrary interval, but we know how to transition to $[l,r\pm1]$ and $[l\pm1,r]$ fast given some information remaining from $[l,r]$ answer calculation. The number of transitions we need to do to get from $[l_1,r_1]$ to $[l_2,r_2]$ is $|l_1-l_2|+|r_1-r_2|$.   ↵
↵
If there are only two intervals we need to answer such transitioning wouldn't help us. However, if there are *many* intervals, choosing a good transitioning route will drastically reduce the total time needed. Finding the best transition route quick is allegedly NP-hard, so we will focus on estimating a "good enough" route.↵
↵
Consider the following scheme for 2D Manhattan TSP: We do square root decomposition on the $x$ coordinate, separating the grid into vertical blocks, $n$ cells tall and $B$ cells wide. There are $n/B$ such blocks.↵
↵
For an arbitrary block, call the number of points we need to visit in it $C$. To get to all of them, we can visit them in order from top to bottom: we spend at most $O(BC)$ transitions moving left and right, and $O(n)$ transitions going down.↵
↵
If for each block we need $O(BC+n)$ transitions, in total we will need $O(\sum(BC+n))$ = $O(Bq+n^2/B)$ transitions. Selecting a block size of $B=O(\sqrt{N})$ gives us a total transition count of $O(q\sqrt{N}+n^2/\sqrt{N})$ = $O((n+q)\sqrt{n})$.↵
↵
Implementation of sorting these transitions can be done as follows:↵
↵
```cpp↵
int B = ???;↵
struct query {↵
    int l, r, id;↵
    const bool operator<(const query& o) const {↵
        if(l/B == o.l/B) {↵
            // they are in the same block, sort going down↵
            return r < o.r;↵
        } else {↵
            // they are not in the same block, sort by block number↵
            return l < o.l;↵
        }↵
    }↵
}↵
```↵
↵
Observe here that between each block, we need to run from the bottom of the "grid" all the way up. We can prevent this by using a zig-zag pattern, getting a bit of a constant optimization:↵
↵
```cpp↵
int B = ???;↵
struct query {↵
    int l, r, id;↵
    const bool operator<(const query& o) const {↵
        if(l/B == o.l/B) {↵
            if((l/B) % 2 == 0) return r < o.r;↵
            else return r > o.r;↵
        } else {↵
            return l < o.l;↵
        }↵
    }↵
}↵
```↵
↵
In the following we will look at some examples.↵
↵
### Range distinct query (SPOJ DQUERY)↵
↵
Statement: Given a static array and queries of the form $[l,r]$, count the number of different values that occur between $[l,r]$. ↵
↵
The naive transition is intuitive: we maintain a table $v$ where $v[x]$ is the number of times $x$ occured in the interval that we are maintaining. When adding an element $x$ such that $v[x]=0$, increment the answer; when deleting an element $x$ such that $v[x]=1$, decrement the answer.↵
↵
**However**, this needs many random accesses to this table, which is very cache-unfriendly, leading to a massive constant. Can we do better?↵
↵
Consider arrays $pre[i]$ and $nxt[i]$, which equal the last location of the element $i$ and the next location of the element $i$, respectively (equal to some infinity if this element doesn't exist.) We can use these arrays to quickly check whether or not some element occurs in an range.↵
↵
- Case 1: $[l,r]\rightarrow[l,r+1]$. We add 1 if and only if $r+1$ has not occured in $[l,r]$, or equivalently, $pre[r+1] < l$.↵
- Case 2: $[l,r]\rightarrow[l,r-1]$. We subtract 1 iff $r$ has not occured in $[l,r-1]$.↵
- Case 3: $[l,r]\rightarrow[l-1,r]$. We add 1 iff $l-1$ has not occurned in $[l,r]$, or equivalently, $r < nxt[l-1]$.↵
- Case 4: $[l,r]\rightarrow[l+1,r]$. We subtract 1 iff $l$ has not occured in $[l+1,r]$.↵
↵
This leads to a much quicker solution, which runs plausibly fast even for $n=10^6$ even if we directly implement Mo's.↵
↵
### Range inversion query (naive version)↵
↵
Statement: Given a static array and queries of the form $[l,r]$, count the number of pairs $(i,j)$ such that $a[i]>a[j]$. $O(n\log n\sqrt n)$.↵
↵
Every time we add or erase an element, we consider how much this element contributes to the answer. If we maintain a Fenwick tree corresponding to the current interval, we can quickly find out how many inversions involve a certain value. Transition is $O(\log n)$; hence overall it is $O(n\log n\sqrt n)$.↵
↵
### **Advanced**: Sweepline Mo / Offline Again↵
↵
This is [problem:617E], except there are multiple favorite numbers.↵
↵
Statement: Given a static array, a static set $S$ of size $C$ and queries of the form $[l,r]$, count the number of pairs $(i,j)$ such that $a[i]\text{ xor }a[j]\in S$. $O(nC+n\sqrt n)$ time; $O(n)$ memory. Assume $q=O(n)$.↵
↵
↵
Notice that $a[i]\text{ xor }a[j]\in S$ is equivalent to $\exists v\in S:a[i]\text{ xor }v=a[j]$. We can consider each $v$ separately and run $C$ instances of Mo's, but that would be $O(nC\sqrt n)$. Consider the transitions: we add an element $j$ and want to know the amount of $i$ such that $a[i]\text{ xor }a[j]\in S$ and $i\in[l,r]$.↵
↵
This amount is differentiable: the answer for $i\in[l,r]$ is equal to the answer for $i\in[1,r]$ minus the answer for $i\in[1,l-1]$. There are going to be $O(n\sqrt n)$ transitions. We will "bucket" each transition into its corresponding $r$ and $l-1$. Then, we do a sweep-line: move from index $1$ to index $n$; add the value into the set, and process the transitions on this index offline, counting the contribution to each query. We get an $O(nC+n\sqrt n)$ method, but this method uses $O(n\sqrt n)$ memory and has generally a atrocious constant. ↵
↵
The key observation to reduce memory usage is that the transitions are always contiguous. When we move from one interval to another, the $r$ and $l-1$ are not random. On one end (either $r$ or $l-1$) it will be fixed and contribute to the queries through some contiguous interval of $a[j]$. On the other end it will be adjacent to where it's contributing to so we cando a special check. We now have $O(n)$ memory.↵
↵
### Range inversion query, again↵
↵
$O(n\sqrt n)$↵
↵
Consider sweepline mo. Sweepline mo in general is a method used to reduce the number of *modifications* of the Mo data structure to $O(n)$, not affecting the query count of $O(n\sqrt n)$. This means modifications have to be at most $O(\sqrt n)$ and queries have to be at most $O(1)$.↵
↵
For this problem we need to support point increment/decrement in $O(\sqrt n)$ and prefix sum query in $O(1)$. Flip it around to get an equivalent problem: suffix increment/decrement in $O(\sqrt n)$ and point query in $O(1)$.↵
↵
Separate this auxillary array into blocks of size $O(\sqrt n)$. For each block, maintain a value $lazy$. When we do increment/decrement a suffix, for blocks that intersect with this suffix yet are not completely covered, we change the elements with brute force. Otherwise, we modify the $lazy$ tag. To query a position, we add the (brute force) value with its corresponding $lazy$ tag.↵
↵
---↵
↵
For people who think there is not much of a difference between $O(n\sqrt n)$ and $O(n\log n\sqrt n)$: When $n=100000$, the former runs in $300ms$ while the latter runs in $3000ms$. Normally time limits are $1000ms$.↵
↵
Answering queries on a static array, online↵
---↵
↵
When the data structure for Mo's is small enough, you can use first do block decomposition into size $B$ and then calculate the data structure for all intervals of blocks, creating $O((N/B)^2)$ structures.↵
↵
Then, assuming the transition time for this data structure is still $O(T)$, querying an arbitrary interval can be done in $O(TB)$.↵
↵
Normally intialization of each structure is $O(B)$ and transition is $O(1)$, so it would be best to choose a block size of $O(\sqrt n)$ to get $O(n\sqrt n)$ precalculation and $O(\sqrt n)$ query time. Here we look at some examples:↵
↵
### [problem:522D]↵
↵
Statement: Given a static array and queries of the form $[l, r]$, for each query calculate $\min(|i-j|:a[i]=a[j]\wedge i,j\in[l,r])$. $n\le 500000$↵
↵
Doing Mo's for this problem looks a bit intractable because although insertion is easy using $pre$ and $nxt$ arrays, it seems impossible to delete values and "roll back" the $\min$ value. Hence we consider block decomposition, which would only require insertion.↵
↵
Decompose this array into contiguous blocks of size $O(B)$; for each contiguous interval of blocks compute the answer, for a total precomputation complexity of $O((N/B)N)$.↵
↵
Note that if we want the minimum distance the only possible $j$ values when we fix an $i$ that couuld possibly contribute to the answer are $pre[i]$ and $nxt[i]$. As such, when extending a precomputed block, we calculate if $pre[i]$ and $nxt[i]$ are in the interval we need to answer, and if so we try to update the minimum, for a query complexity of $O(B)$.↵
↵
Total time complexity is $O(n^2/B+qB)$, selecting $B=O(\sqrt n)$ is best, getting a total time complexity of $O((n+q)\sqrt n)$ with tiny constant. Note that input and outout for this problem is a massive portion of the total running time.↵
↵
### [problem:617E] (online) version↵
↵
Statement: Given a static array, a number $k$ and queries of the form $[l,r]$, for each query calculate the number of $(i,j)$ such that $a[i]\text{ xor }a[j]=k$.↵
↵
$a[i]\text{ xor }a[j]=k \iff a[i]\text{ xor }k=a[j]$; we precalculate answers for each block and now count the contribution of new elements to the block interval. We want to know how many elements with value $a[i]\text{ xor }k$ there are in the current interval.↵
↵
First of all the current interval is block chunk + at most $O(B)$ elements, so we can directly maintain a population array on these $O(B)$ elements. For the block chunk, notice that the count "# of $x$ in blocks from $l$ to $r$" is differentiable. Precalculate "# of $x$ in blocks from $1$ to $r$", and then we're done.↵
↵
But not quite; this method takes $O(B\max A)$ memory for the chunk count array. Notice that there are at most $O(n)$ possible values for $a[i]$ and $a[i]\text{ xor }k$; hash these values and we get $O(nB+(n/B)^2)$ memory, $O((n+q)B+(n/B)^2)$. Square root decomposition gives $O((n+q)\sqrt n)$.↵
↵
### Range distinct query revisited↵
↵
Statement: Range distinct query but online↵
↵
Obviously direct precomputation and utilizing $pre[i]$ and $nxt[i]$ arrays lead to an $O((n+q)\sqrt n)$ running time. But can we do better?↵
↵
Notice that the bottleneck is in the precomputation $O((n/B)n)$, because for each block we need to compute the answers for all intervals that have this block as its first block. Yet we only store $O((n/B)(n/B))$ values. Obviously time complexity will be lower bounded by this if we use decomposition methods. Here we will reach this lower bound.↵
↵
Consider the definition of the answer for an arbitrary interval $[l,r]$:↵
↵
$$A_{[l,r]}=(\sum_i[pre[i]+1\le l\wedge i\le r])-(\sum_i[pre[i]+1\le l\wedge i\le l-1])$$↵
↵
Let's put the points $(pre[i]+1, i)$ onto the 2D plane. Define a function $p(x,y)$ as the number of points such that its x-coordinate is smaller than $x$ and its y-coordinate is smaller than $y$, or a 2D prefix sum. Our answer is now↵
↵
$$A_{[l,r]}=p(l,r)-p(l,l-1)$$↵
↵
but $p(x,y)$ takes $O(n^2)$ time to calculate. Consider transforming the $A$ expression to incorporate decomposition into blocks of size $B$:↵
↵
$$A_{[l,r]}=(\sum_i[pre[i]/B+1\le l\wedge i/B\le r])-(\sum_i[pre[i]/B+1\le l\wedge i/B\le l-1])$$↵
↵
Now this is a prefix sum on the points $(pre[i]/B+1,i/B)$ where division is rounded down. This can obiously be computed in $O((n/B)^2)$.↵
↵
Consider the optimal choice of $B$. The time complexity is↵
↵
$$O(\frac{n^2}{B^2}+qB)$$↵
↵
Selecting $B=O(\sqrt[3]{n})$ gives us a total time complexity of↵
↵
$$O((n+q)\sqrt[3]{n})$$↵
↵
This runs faster than the currently best known method of persistent segment trees because of its tiny constant for $n=10^6$. Its memory usage is also much, much smaller.↵
↵
More compilicated examples↵
---↵
↵
### Optimizing brute force: [problem:911G]↵
↵
Statement: Given an array and operations of the form $l,r,x,y$, set $a[i]=y$ for all $i$ such that $i\in[l,r]$ and $a[i]=x$. Here $1\le a[i],x,y\le10^5$.↵
↵
We observe that we can apply these operations to the entire array pretty quickly by storing the locations each value occur in something simlar to a vector and do small-to-large merging.↵
↵
Do block decomposition on this array into blocks of size $B$. For blocks that are completely contained by an operation, do small-to-larger merging; otherwise, reconstruct this block and modify with brute force.↵
↵
I don't know how to prove the true time complexity rigorously, but we can upper bound it. If there is no brute force changes, for each block it will take a total of $O(B\log B)$ time, because whenever an element is moved, the size of the vector that it is in at least doubles. Total time would thus be $O(NB\log B)$. If brute force changes on a block absolutely breaks amortized time, moving would take $O((N+Q)B\log B)$ time, for a total of $O((N+Q)B\log B+QN/B)$. Assume $Q=O(N)$ to simplify; $O(NB\log B+N^2/B)$. Selecting $B=\sqrt{\frac{N}{\log N}}$ gives↵
↵
$$O(N\sqrt{\frac{N}{\log N}}\log \sqrt{\frac{N}{\log N}}+\frac{N^2}{\frac{\sqrt{N}}{\sqrt{\log N}}}) \le O(N\sqrt{N\log N})$$↵
↵
Note here that the $\log$ is inside the square root sign.↵
↵
### Value distance query↵
↵
Statement: Given a static array and queries of the form $x, y$, find $\min(|i-j|:a[i]=x,a[j]=y)$.↵
↵
There are at most $O(\sqrt n)$ values that occur more than $O(\sqrt n)$ times. Consider a pair $x,y$ such that at least one of $x,y$ occurs more than $O(\sqrt n)$ times. We try to precompute answers for all such $x, y$ as there are at most $O(n\sqrt n)$ pairs.↵
↵
For all $x$ where $x$ occurs more than $O(\sqrt n)$ times, we sweep the array using $x$, once forwards and once backwards. We keep track of the last location where $x$ occured in the direction that we are sweeping, and then we try to update the answer for $x$ and $a[i]$ (i is the index of the sweep pointer.)↵
↵
As well as that, for each value keep a sorted vector of the indices where said value occurs.↵
↵
When we answer queries, if we already precomputed the answer, we directly use it; otherwise, we run two-pointers on the corresponding vectors that we stored. The total time complexity is $O((n+q)\sqrt n)$.↵
↵
### [Ynoi2008] rplexq↵
↵
Statement: Given a rooted tree and queries of the form $(l,r,x)$, for each query find the number of $i,j$ such that $l\le i<j\le r$ and the LCA of $i$ and $j$ is $x$.↵
↵
First of all, we answer these queries offline and put the $[l,r]$ values on the corresponding node. We do dsu on tree, maintaining an implicit segtree of the indices of nodes that belong to some subtree with small-to-large merging.↵
↵
If we don't even care about the intervals, the amount of pairs that have LCA at node $x$ is equal to↵
↵
$$\binom{sz_x}2-\sum\binom{sz_j}2$$↵
↵
where $j$ is the children of $x$.↵
↵
When a node has a pretty small amount of children, we can answer the queries with brute force, as prefix sum queries on an implicit segtree has a small enough constant. But when a node has a lot of children AND has a lot of queries on it, this is much too slow.↵
↵
First off we do a special check for the heavy child, so as to guarantee that the amount of nodes we parse are on the order of $O(n\log n)$. We then take the light children and condense the indices. Now, we can run Mo's on these indices, where what we store is $\sum\binom{colors}2$. Due to the ridiculously high initialization constant for Mo's it might be better to run brute force when queries or children count is small enough.↵
↵
Note that the complexity for Mo's with length $n$ queries $m$ is $O(n\sqrt m)$. Because of↵
↵
$$\sum n_i\sqrt m_i\le(\sum n_i)\sqrt{\max m}$$↵
↵
the total time complexity is upper-bounded by $O(n\log n\sqrt m)$. [user:ODT,2020-10-01] claims that the time complexity can be $O(n\sqrt{m\log n})$ to $O(n\sqrt m)$, but I do not know how to prove this.↵
↵
---↵
↵
If you have other interesting square root decomposition problems, please share them in the comments below and I'll include them in the next edit!

Rev.	By	When	Δ	Comment
en5	box	2020-10-01 13:59:45	100
en4	box	2020-10-01 13:58:37	2716	sweepline mo explanations
en3	box	2020-10-01 10:55:58	19	thx @dorijanko
en2	box	2020-10-01 07:57:11	10	Tiny change: 'lem:617E] (online)\n\nStatem' -> 'lem:617E] online version\n\nStatem' (published)
en1	box	2020-10-01 07:55:27	17037	Initial revision (saved to drafts)

History