A brief introduction into the applications of square root decomposition
Square root decomposition is the process of separating a structure of size $$$O(N)$$$ into $$$O(\sqrt{N})$$$ "blocks" of size $$$O(\sqrt{N})$$$ each, in a way that aids manipulation of the entire structure. Square root decomposition is extremely versatile. Some of its most well-known use cases are:
- answering queries on a static array, with methods such as Mo's algorithm or block precomputation
- "lazy" brute force modifications, where it is easy to query an entire block, but tag pushing isn't obvious
- separating objects based on a threshold, when there exists both an $$$O(x)$$$ algorithm and an $$$O(n/x)$$$ algorithm
In this tutorial we will walk through multiple types of square root decomposition.
Answering queries on a static array, offline (Mo's algorithm)
Consider a problem where we are asked to find the answer for certain intervals $$$[l,r]$$$. We can't quickly compute the answer for an arbitrary interval, but we know how to transition to $$$[l,r\pm1]$$$ and $$$[l\pm1,r]$$$ fast given some information remaining from $$$[l,r]$$$ answer calculation. The number of transitions we need to do to get from $$$[l_1,r_1]$$$ to $$$[l_2,r_2]$$$ is $$$|l_1-l_2|+|r_1-r_2|$$$.
If there are only two intervals we need to answer such transitioning wouldn't help us. However, if there are many intervals, choosing a good transitioning route will drastically reduce the total time needed. Finding the best transition route quick is allegedly NP-hard, so we will focus on estimating a "good enough" route.
Consider the following scheme for 2D Manhattan TSP: We do square root decomposition on the $$$x$$$ coordinate, separating the grid into vertical blocks, $$$n$$$ cells tall and $$$B$$$ cells wide. There are $$$n/B$$$ such blocks.
For an arbitrary block, call the number of points we need to visit in it $$$C$$$. To get to all of them, we can visit them in order from top to bottom: we spend at most $$$O(BC)$$$ transitions moving left and right, and $$$O(n)$$$ transitions going down.
If for each block we need $$$O(BC+n)$$$ transitions, in total we will need $$$O(\sum(BC+n))$$$ = $$$O(Bq+n^2/B)$$$ transitions. Selecting a block size of $$$B=O(n/\sqrt{q})$$$ gives us a total transition count of $$$O(qn/\sqrt{q}+n\sqrt{q})$$$ = $$$O(n\sqrt{q})$$$.
Implementation of sorting these transitions can be done as follows:
int B = ???;
struct query {
int l, r, id;
const bool operator<(const query& o) const {
if(l/B == o.l/B) {
// they are in the same block, sort going down
return r < o.r;
} else {
// they are not in the same block, sort by block number
return l < o.l;
}
}
}
Observe here that between each block, we need to run from the bottom of the "grid" all the way up. We can prevent this by using a zig-zag pattern, getting a bit of a constant optimization:
int B = ???;
struct query {
int l, r, id;
const bool operator<(const query& o) const {
if(l/B == o.l/B) {
if((l/B) % 2 == 0) return r < o.r;
else return r > o.r;
} else {
return l < o.l;
}
}
}
In the following we will look at some examples.
Range distinct query (SPOJ DQUERY)
Statement: Given a static array and queries of the form $$$[l,r]$$$, count the number of different values that occur between $$$[l,r]$$$.
The naive transition is intuitive: we maintain a table $$$v$$$ where $$$v[x]$$$ is the number of times $$$x$$$ occured in the interval that we are maintaining. When adding an element $$$x$$$ such that $$$v[x]=0$$$, increment the answer; when deleting an element $$$x$$$ such that $$$v[x]=1$$$, decrement the answer.
However, this needs many random accesses to this table, which is very cache-unfriendly, leading to a massive constant. Can we do better?
Consider arrays $$$pre[i]$$$ and $$$nxt[i]$$$, which equal the last location of the element $$$i$$$ and the next location of the element $$$i$$$, respectively (equal to some infinity if this element doesn't exist.) We can use these arrays to quickly check whether or not some element occurs in an range.
- Case 1: $$$[l,r]\rightarrow[l,r+1]$$$. We add 1 if and only if $$$r+1$$$ has not occured in $$$[l,r]$$$, or equivalently, $$$pre[r+1] < l$$$.
- Case 2: $$$[l,r]\rightarrow[l,r-1]$$$. We subtract 1 iff $$$r$$$ has not occured in $$$[l,r-1]$$$.
- Case 3: $$$[l,r]\rightarrow[l-1,r]$$$. We add 1 iff $$$l-1$$$ has not occurned in $$$[l,r]$$$, or equivalently, $$$r < nxt[l-1]$$$.
- Case 4: $$$[l,r]\rightarrow[l+1,r]$$$. We subtract 1 iff $$$l$$$ has not occured in $$$[l+1,r]$$$.
This leads to a much quicker solution, which runs plausibly fast even for $$$n=10^6$$$ even if we directly implement Mo's.
Range inversion query (naive version)
Statement: Given a static array and queries of the form $$$[l,r]$$$, count the number of pairs $$$(i,j)$$$ such that $$$a[i]>a[j]$$$. $$$O(n\log n\sqrt n)$$$.
Every time we add or erase an element, we consider how much this element contributes to the answer. If we maintain a Fenwick tree corresponding to the current interval, we can quickly find out how many inversions involve a certain value. Transition is $$$O(\log n)$$$; hence overall it is $$$O(n\log n\sqrt n)$$$.
Advanced: Sweepline Mo / Offline Again
This is 617E - XOR and Favorite Number, except there are multiple favorite numbers.
Statement: Given a static array, a static set $$$S$$$ of size $$$C$$$ and queries of the form $$$[l,r]$$$, count the number of pairs $$$(i,j)$$$ such that $$$a[i]\text{ xor }a[j]\in S$$$. $$$O(nC+n\sqrt n)$$$ time; $$$O(n)$$$ memory. Assume $$$q=O(n)$$$.
Notice that $$$a[i]\text{ xor }a[j]\in S$$$ is equivalent to $$$\exists v\in S:a[i]\text{ xor }v=a[j]$$$. We can consider each $$$v$$$ separately and run $$$C$$$ instances of Mo's, but that would be $$$O(nC\sqrt n)$$$. Consider the transitions: we add an element $$$j$$$ and want to know the amount of $$$i$$$ such that $$$a[i]\text{ xor }a[j]\in S$$$ and $$$i\in[l,r]$$$.
Call the transition delta $$$f(l,r,p)$$$ where $$$p$$$ is the element we are adding. Removing can be done by subtracting $$$f(l,r-1,p)$$$ or $$$f(l+1,r,p)$$$ depending on the direction. Sweepline Mo can be used when this transition function is differentiable: i.e. it satisfies $$$f(l,r,p)=f(1,r,p)-f(1,l-1,p)$$$.
The main idea is to first run a "dry" Mo and find all $$$f(1,x,p)$$$ that we need to calculate. Then, we do a sweep-line on $$$x$$$, updating the data structure accordindly, and calculating the $$$f(1,x,p)$$$ that involve this $$$x$$$. We generally needs the "calculation" part to be $$$O(1)$$$, like normal Mo, but we make more wiggle room for the update part.
However, if we store all $$$(x,p)$$$, we get $$$O(n\sqrt n)$$$ memory and a ridiculous constant factor. Notice that because of the Mo transition order, we can separate all $$$p$$$ into several types, given a fixed $$$x$$$:
- $$$p=x$$$
- $$$p=x+1$$$
- $$$p\in[A,B]$$$
For the first and second type, we calculate $$$ansPrev[i]$$$ and $$$ansNext[i]$$$ and don't update anything, leaving these arrays for the second "wet" Mo to use.
For the third type, we find all $$$(p,A,B)$$$ that result from the dry Mo. There are at most $$$4(M-1)$$$ such intervals because the movement of any end of the Mo interval creates exactly one interval. Then, on the corresponding $$$x$$$, we iterate through the interval and add the transition value we found to the respective query.
Example implementation:
sort(qs, qs + m);
cl = 1, cr = 0;
for(int i = 0; i < m; i++) {
if (cr < qs[i].r) {
iv[cl - 1].push_back({cr + 1, qs[i].r, i, -1});
cr = qs[i].r;
}
if (qs[i].l < cl) {
iv[cr].push_back({qs[i].l, cl - 1, i, 1});
cl = qs[i].l;
}
if (qs[i].r < cr) {
iv[cl - 1].push_back({qs[i].r + 1, cr, i, 1});
cr = qs[i].r;
}
if (cl<qs[i].l) {
iv[cr].push_back({cl, qs[i].l - 1, i, -1});
cl = qs[i].l;
}
}
for(int i = 1; i <= n; i++) {
ansPrev[i] = ansPrev[i-1] + cgc[ar[i]];
add(ar[i]);
for (auto [l, r, i, c] : iv[i]) {
ll g = 0;
for(int p = l; p <= r; p++)
g += cgc[ar[p]];
ans[i] += c * g;
}
ansThere[i] = ansThere[i - 1] + cgc[ar[i]];
}
for(int i = 0; i < m; i++) {
curans += ans[i];
fans[qs[i].i] = curans + ansPrev[qs[i].r] + ansThere[qs[i].l-1];
}
for(int i = 0; i < m; i++) {
print(fans[i]);
pc('\n');
}
Special implementation details:
- Here
add
just updates thecgc
array, which is a population count of the things in the current prefix xor the things in the set. ivs
is the intervals on eachx
, along with the query and contribution coefficient.- If we do a prefix sum on
ansPrev
andansThere
, we don't even need to use a while loop in the second Mo.
Range inversion query, again
$$$O(n\sqrt n)$$$
Consider sweepline mo. Sweepline mo in general is a method used to reduce the number of modifications of the Mo data structure to $$$O(n)$$$, not affecting the query count of $$$O(n\sqrt n)$$$. This means modifications have to be at most $$$O(\sqrt n)$$$ and queries have to be at most $$$O(1)$$$.
For this problem we need to support point increment/decrement in $$$O(\sqrt n)$$$ and prefix sum query in $$$O(1)$$$. Flip it around to get an equivalent problem: suffix increment/decrement in $$$O(\sqrt n)$$$ and point query in $$$O(1)$$$.
Separate this auxillary array into blocks of size $$$O(\sqrt n)$$$. For each block, maintain a value $$$lazy$$$. When we do increment/decrement a suffix, for blocks that intersect with this suffix yet are not completely covered, we change the elements with brute force. Otherwise, we modify the $$$lazy$$$ tag. To query a position, we add the (brute force) value with its corresponding $$$lazy$$$ tag.
For people who think there is not much of a difference between $$$O(n\sqrt n)$$$ and $$$O(n\log n\sqrt n)$$$: When $$$n=100000$$$, the former runs in $$$300ms$$$ while the latter runs in $$$3000ms$$$. Normally time limits are $$$1000ms$$$.
Answering queries on a static array, online
When the data structure for Mo's is small enough, you can use first do block decomposition into size $$$B$$$ and then calculate the data structure for all intervals of blocks, creating $$$O((N/B)^2)$$$ structures.
Then, assuming the transition time for this data structure is still $$$O(T)$$$, querying an arbitrary interval can be done in $$$O(TB)$$$.
Normally intialization of each structure is $$$O(B)$$$ and transition is $$$O(1)$$$, so it would be best to choose a block size of $$$O(\sqrt n)$$$ to get $$$O(n\sqrt n)$$$ precalculation and $$$O(\sqrt n)$$$ query time. Here we look at some examples:
522D - Closest Equals
Statement: Given a static array and queries of the form $$$[l, r]$$$, for each query calculate $$$\min(|i-j|:a[i]=a[j]\wedge i,j\in[l,r])$$$. $$$n\le 500000$$$
Doing Mo's for this problem looks a bit intractable because although insertion is easy using $$$pre$$$ and $$$nxt$$$ arrays, it seems impossible to delete values and "roll back" the $$$\min$$$ value. Hence we consider block decomposition, which would only require insertion.
Decompose this array into contiguous blocks of size $$$O(B)$$$; for each contiguous interval of blocks compute the answer, for a total precomputation complexity of $$$O((N/B)N)$$$.
Note that if we want the minimum distance the only possible $$$j$$$ values when we fix an $$$i$$$ that couuld possibly contribute to the answer are $$$pre[i]$$$ and $$$nxt[i]$$$. As such, when extending a precomputed block, we calculate if $$$pre[i]$$$ and $$$nxt[i]$$$ are in the interval we need to answer, and if so we try to update the minimum, for a query complexity of $$$O(B)$$$.
Total time complexity is $$$O(n^2/B+qB)$$$, selecting $$$B=O(\sqrt n)$$$ is best, getting a total time complexity of $$$O((n+q)\sqrt n)$$$ with tiny constant. Note that input and outout for this problem is a massive portion of the total running time.
617E - XOR and Favorite Number online version
Statement: Given a static array, a number $$$k$$$ and queries of the form $$$[l,r]$$$, for each query calculate the number of $$$(i,j)$$$ such that $$$a[i]\text{ xor }a[j]=k$$$.
$$$a[i]\text{ xor }a[j]=k \iff a[i]\text{ xor }k=a[j]$$$; we precalculate answers for each block and now count the contribution of new elements to the block interval. We want to know how many elements with value $$$a[i]\text{ xor }k$$$ there are in the current interval.
First of all the current interval is block chunk + at most $$$O(B)$$$ elements, so we can directly maintain a population array on these $$$O(B)$$$ elements. For the block chunk, notice that the count "# of $$$x$$$ in blocks from $$$l$$$ to $$$r$$$" is differentiable. Precalculate "# of $$$x$$$ in blocks from $$$1$$$ to $$$r$$$", and then we're done.
But not quite; this method takes $$$O(B\max A)$$$ memory for the chunk count array. Notice that there are at most $$$O(n)$$$ possible values for $$$a[i]$$$ and $$$a[i]\text{ xor }k$$$; hash these values and we get $$$O(nB+(n/B)^2)$$$ memory, $$$O((n+q)B+(n/B)^2)$$$. Square root decomposition gives $$$O((n+q)\sqrt n)$$$.
Range distinct query revisited
Statement: Range distinct query but online
Obviously direct precomputation and utilizing $$$pre[i]$$$ and $$$nxt[i]$$$ arrays lead to an $$$O((n+q)\sqrt n)$$$ running time. But can we do better?
Notice that the bottleneck is in the precomputation $$$O((n/B)n)$$$, because for each block we need to compute the answers for all intervals that have this block as its first block. Yet we only store $$$O((n/B)(n/B))$$$ values. Obviously time complexity will be lower bounded by this if we use decomposition methods. Here we will reach this lower bound.
Consider the definition of the answer for an arbitrary interval $$$[l,r]$$$:
Let's put the points $$$(pre[i]+1, i)$$$ onto the 2D plane. Define a function $$$p(x,y)$$$ as the number of points such that its x-coordinate is smaller than $$$x$$$ and its y-coordinate is smaller than $$$y$$$, or a 2D prefix sum. Our answer is now
but $$$p(x,y)$$$ takes $$$O(n^2)$$$ time to calculate. Consider transforming the $$$A$$$ expression to incorporate decomposition into blocks of size $$$B$$$:
Now this is a prefix sum on the points $$$(pre[i]/B+1,i/B)$$$ where division is rounded down. This can obiously be computed in $$$O((n/B)^2)$$$.
Consider the optimal choice of $$$B$$$. The time complexity is
Selecting $$$B=O(\sqrt[3]{n})$$$ gives us a total time complexity of
This runs faster than the currently best known method of persistent segment trees because of its tiny constant for $$$n=10^6$$$. Its memory usage is also much, much smaller.
More compilicated examples
Optimizing brute force: 911G - Mass Change Queries
Statement: Given an array and operations of the form $$$l,r,x,y$$$, set $$$a[i]=y$$$ for all $$$i$$$ such that $$$i\in[l,r]$$$ and $$$a[i]=x$$$. Here $$$1\le a[i],x,y\le10^5$$$.
We observe that we can apply these operations to the entire array pretty quickly by storing the locations each value occur in something simlar to a vector and do small-to-large merging.
Do block decomposition on this array into blocks of size $$$B$$$. For blocks that are completely contained by an operation, do small-to-larger merging; otherwise, reconstruct this block and modify with brute force.
I don't know how to prove the true time complexity rigorously, but we can upper bound it. If there is no brute force changes, for each block it will take a total of $$$O(B\log B)$$$ time, because whenever an element is moved, the size of the vector that it is in at least doubles. Total time would thus be $$$O(NB\log B)$$$. If brute force changes on a block absolutely breaks amortized time, moving would take $$$O((N+Q)B\log B)$$$ time, for a total of $$$O((N+Q)B\log B+QN/B)$$$. Assume $$$Q=O(N)$$$ to simplify; $$$O(NB\log B+N^2/B)$$$. Selecting $$$B=\sqrt{\frac{N}{\log N}}$$$ gives
Note here that the $$$\log$$$ is inside the square root sign.
Value distance query
Statement: Given a static array and queries of the form $$$x, y$$$, find $$$\min(|i-j|:a[i]=x,a[j]=y)$$$.
There are at most $$$O(\sqrt n)$$$ values that occur more than $$$O(\sqrt n)$$$ times. Consider a pair $$$x,y$$$ such that at least one of $$$x,y$$$ occurs more than $$$O(\sqrt n)$$$ times. We try to precompute answers for all such $$$x, y$$$ as there are at most $$$O(n\sqrt n)$$$ pairs.
For all $$$x$$$ where $$$x$$$ occurs more than $$$O(\sqrt n)$$$ times, we sweep the array using $$$x$$$, once forwards and once backwards. We keep track of the last location where $$$x$$$ occured in the direction that we are sweeping, and then we try to update the answer for $$$x$$$ and $$$a[i]$$$ (i is the index of the sweep pointer.)
As well as that, for each value keep a sorted vector of the indices where said value occurs.
When we answer queries, if we already precomputed the answer, we directly use it; otherwise, we run two-pointers on the corresponding vectors that we stored. The total time complexity is $$$O((n+q)\sqrt n)$$$.
[Ynoi2008] rplexq
Statement: Given a rooted tree and queries of the form $$$(l,r,x)$$$, for each query find the number of $$$i,j$$$ such that $$$l\le i<j\le r$$$ and the LCA of $$$i$$$ and $$$j$$$ is $$$x$$$.
First of all, we answer these queries offline and put the $$$[l,r]$$$ values on the corresponding node. We do dsu on tree, maintaining an implicit segtree of the indices of nodes that belong to some subtree with small-to-large merging.
If we don't even care about the intervals, the amount of pairs that have LCA at node $$$x$$$ is equal to
where $$$j$$$ is the children of $$$x$$$.
When a node has a pretty small amount of children, we can answer the queries with brute force, as prefix sum queries on an implicit segtree has a small enough constant. But when a node has a lot of children AND has a lot of queries on it, this is much too slow.
First off we do a special check for the heavy child, so as to guarantee that the amount of nodes we parse are on the order of $$$O(n\log n)$$$. We then take the light children and condense the indices. Now, we can run Mo's on these indices, where what we store is $$$\sum\binom{colors}2$$$. Due to the ridiculously high initialization constant for Mo's it might be better to run brute force when queries or children count is small enough.
Note that the complexity for Mo's with length $$$n$$$ queries $$$m$$$ is $$$O(n\sqrt m)$$$. Because of
the total time complexity is upper-bounded by $$$O(n\log n\sqrt m)$$$. ODT claims that the time complexity can be $$$O(n\sqrt{m\log n})$$$ to $$$O(n\sqrt m)$$$, but I do not know how to prove this.
If you have other interesting square root decomposition problems, please share them in the comments below and I'll include them in the next edit!