Fractional cascading is in fact slow?

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	168
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	159
5	atcoder_official	156
6	djm03178	153
6	adamant	153
8	luogu_official	149
9	awoo	148
10	TheScrasse	146

Consider a well-known problem: given a static array of size n, answer m queries of kind "how many numbers on [l, r] have value less than x". The standard solution is to build a segment tree where in every node we store a sorted vector. To answer a query we do a binary search in every corresponding node, achieving $\text{[math]}$ per query time complexity.

There is a method to reduce time complexity to $\text{[math]}$ per query, called fractional cascading. Instead of doing binary search in each node we do it only in the root, and then "push" its result to children in O(1).

For years I thought that the second approach is blazingly faster than the first one. And today I've made a test. I implemented both approaches in a pretty straightforward way and tested them on a random data. The results were quite surprising.

Fractional cascading: 760 ms

Top-down implementation: 670 ms

Bottom-up implementation: 520 ms

The first one is $\text{[math]}$ , others are $\text{[math]}$ ! Time is averaged over several consecutive runs. Test data is generated randomly with n = 100000, m = 200000.