Hi everyone!
Today I want to describe an efficient solution of the following problem:
Composition of Formal Power Series. Given $$$A(x)$$$ and $$$B(x)$$$ of degree $$$n-1$$$ such that $$$B(0) = 0$$$, compute $$$A(B(x)) \pmod{x^n}$$$.
The condition $$$B(0)=0$$$ doesn't decrease the generality of the problem, as $$$A(B(x)) = P(Q(x))$$$, where $$$P(x) = A(x+B(0))$$$ and $$$Q(x) = B(x) - B(0)$$$. Hence you could replace $$$A(x)$$$ and $$$B(x)$$$ with $$$P(x)$$$ and $$$Q(x)$$$ when the condition is not satisfied.
Solutions that I'm going to describe were published in Joris van der Hoeven's article about operations on formal power series. The article also describes a lot of other common algorithms on polynomials. It is worth noting that Joris van der Hoeven and David Harvey are the inventors of breakthrough $$$O(n \log n)$$$ integer multiplication algorithm in the multitape Turing machine model.
Naive methods
An obvious solution would be to evaluate $$$B(x)$$$ in $$$n^2$$$ points, then evaluate $$$A(B(x_i))$$$ in these points and interpolate $$$(A \circ B)(x)$$$. This solution would require $$$O(n^2 \log^2 n)$$$ operations with an enormous constant. Not the best idea for $$$n \leq 8000$$$.
Another possible approach is to go from the definition of the polynomial composition:
You could compute $$$B^2(x), B^3(x), \dots, B^{n-1}(x)$$$ modulo $$$x^n$$$ one by one and sum them up, which would take $$$O(n^2 \log n)$$$ operations.
Divide and conquer
If we write $$$A(x) = A_0(x) + x^t A_1(x)$$$, then
With this decomposition, you can reduce the computation of $$$A(B(x))$$$ to $$$2$$$ recursive calls to $$$A_0(B(x))$$$ and $$$A_1(B(x))$$$ with $$$O(n \log n)$$$ overhead for the multiplication. If you use $$$t = \lfloor \frac{n}{2} \rfloor$$$, the degree of $$$A_0$$$ will be at most $$$\lfloor \frac{n}{2} \rfloor$$$ and the degree of $$$A_1$$$ will be $$$\lceil \frac{n}{2} \rceil$$$.
Considering possible values of $$$t$$$ over all recursion levels they will be at most $$$O(\log n)$$$ different values, hence you can compute necessary powers of $$$B(x)$$$ in advance, which would require $$$O(n \log^2 n)$$$ cumulative time for the pre-processing.
The overall time complexity for this algorithm is still $$$O(n^2 \log n)$$$, but it allows significant constant-time optimizations, which is enough to pass the Library Checker problem. For this one need to carefully consider how many terms do you really need on each level of recursion.
In a more generic situation when $$$\deg A = p$$$ and $$$\deg B = q$$$, it is possible to prove the alternative complexity estimation of $$$O(pq \log^2 n)$$$, which is a bit better when $$$p$$$ or $$$q$$$ are significantly smaller than $$$n$$$.
Square root trick
In 1978, Brent and Kung suggested a faster algorithm to compute $$$A \circ B$$$ assuming that formal power series are taken over the division ring. Instead of decomposing $$$A(x)$$$, it is possible to decompose $$$B(x) = B_0(x) + x^q B_1(x)$$$ and use the Taylor expansion:
For this expression, it generally holds that
which allows to compute $$$A^{(1)}(B_0(x)), A^{(2)}(B_0(x)),\dots$$$ in $$$O(\frac{n^2}{q} \log n)$$$ by successively multiplying them with $$$B_0'(x)$$$ and integrating, starting with $$$A^{(\lfloor n/q \rfloor)}(B_0(x))$$$, which is computed with the divide and conquer algorithm above with the complexity $$$O(nq \log^2 n)$$$.
Optimal value of $$$q$$$ to minimize both complexities is $$$q = \sqrt{\frac{n}{\log n}}$$$, with which the final complexity is