Recenlty I was solving this problem. It was obvious strongly connected components. I wrote them and got TL14: 277474774.
I knew that push_backing to 1e6 vectors is slow due to the dynamic reallocation. I know the only way to handle it: precompute the size of each vector, and reinterpret vector<vector> as a static array. It was awful to implement, but I did it and got TL38: 277476691
Then I tried many things like removing the vector of Euler tour, adding pragmas, but have not passes the test 38. So, there are 2 questions for experienced optimizers:
Is there another, simpler way to speed up vector push_backs, without precomputing each vector's size?
Why does my final solution get TL, and how to speed up it? Constraints up to $$$10^6$$$ are enough to pass in this problem I think.
UPD: #define vector basic_string
is a useful optimization.