Tl;dr — Assuming negligible cache misses, how many simple operations (reading/setting array values, comparing integers, addition, subtraction, multiplication) can you perform in a second on Codeforces?
So recently I was trying 1303E - Удаляем подпоследовательности and I realized that my code 79311246 which appeared to be $$$ N^{4} $$$ ran extremely quickly in custom test for large inputs, so I ended up submitting it and got accepted in just 650ms.
While at first glance it appears as if it the loop would run $$$ N^{4} $$$ times ($$$ 2.5 * 10^{10}$$$ times at worst), you can show it will run much faster than that, actually taking $$$ \sum_{i = 1}^{\left | t \right |} (i + 1)(\left | t \right | - i + 1)(\left | s \right | - \left | t \right | + 1) $$$ operations. Simplyifying this we get $$$ \frac{1}{6}(\left | s \right | + 1 - \left |t \right |)(\left | x \right |^{3} + 6 * \left | x \right |^{2}) $$$. Maximizing $$$ \left | s \right | $$$ and throwing this equation into Wolfram Alpha we get a maxima of around 5e8 operations. A test which generates this maxima is simply $$$ \left | s \right | = 400 $$$ and $$$ \left | t \right | = 300 $$$ with all characters of the strings being the same.
However considering the few if statements with character comparisons and dp array value assignments, as well as the zeroing of the array, the number of simple operations performed easily approaches $$$ 4 * 10^{9} $$$.
So my question is — Assuming a fair amount of cache persistence, has anyone tested how many simple operations we can get to run in a second on Codeforces?
The reason I mention cache persistance is it clearly plays a massive role in solutions like this, swapping the 2nd and 4th loops results in the runtime increasing from $$$750ms$$$ to nearly $$$3000ms$$$ for the max test.
Theoretically the limit would be $$$ \frac{\text{clock speed} * \text{instructions per clock per core}}{\text{average clock cycles per operation}} $$$, which assuming Codeforces is running a fairly modern processor capable of around 8 instructions per clock per core, clocked at 4Ghz (unlikely if its a high core count processor), would yield somewhere around $$$ 8 * 10 ^ {9} $$$ instructions per second, however in practice its likely to be lower.