I know what is the problem
Where we are going we need all the courage we can get
When I first read the comment that our shiny new gcc 6.2 C++14 compiler has slow printf and scanf I immediately knew what I would do the next weekend. The new MinGW switched to custom implementation and I didn't need anything else to tell what the problem is. Of course, their format string parsing is too slow, they copy data around countless times, no file buffering is used and they have enough levels of indirection to make finding the real code feeling like a treasure hunt. What else could be the problem after all?
Keeping this thoughts in mind I patiently waited for a weekend. When I finally opened MinGW source and searched for printf function, I was pleasantly surprised. The real implementation was only a couple hops away and soon enough I was looking at the heart of printf code. MinGW did try to conceal it a bit though by naming it mingw_pformat.c. But a ~100Kb source file was a very clear hint where all the action is. The format string parsing was straightforward and located in a single big function. Everything was passed by pointer and there were hardly any levels of indirection. But the data was clear — MinGW printf is twice as slow for integer numbers compared with msvcrt.dll and scanf is even worse. Increasing the file buffer size didn't help at all. Reluctantly, I had to admit that the problem is somewhere else. But what could possibly be the problem?
Locked out of performance
A new wonderful multithreaded world
The journey to performance promised not to be as straightforward as I thought. So I needed a quick way to make changes in MinGW CRT library and test its performance. Looking through libraries symbols I found that MinGW versions of printf and scanf are located in mingwex library. So I just need to find out how to build it. Looking for build files brought bad news — MinGW uses automake and autoconf. I don't know a way to extract a single library from this monstrosity but no one can stop us from creating our own build files! The list of all source files in the library was clearly visible in the build files and nothing else is really needed. Time to dust off CMake book on my desk and get to work. A few hours later I had a working build of mingwex library and the benchmark program that uses it instead of system one.
With our bases covered, now it is time to read and follow through pformat code searching for possible performance issues. At the bottom of the call stack I invariably found the __pformat_putc function which in turn calls fputc function to write into the file. Aha, there it is! Everybody knows that writing data symbol by symbol is slow. But why is it slow exactly? Does it flush the buffer after every symbol? Documentation says otherwise. And if fputc is buffered just like everything else, we shouldn't have any slowdown except a function call. The only way to find out is look at the fputc source code. You will not find it in MinGW folder though. Unlike printf, fputc doesn't have MinGW variant and so it goes straight to msvcrt.dll library. Its source code is included in Visual Studio 6.0 installation but we can use source of its more modern versions as an approximation. So how do you write a symbol to a file? You check file stream for errors (like end-of-file), create a SEH frame, take a lock, call _fputc_nolock, and release a lock in the SEH finally block. Now it all makes sense. In the multithreaded world we need to take locks on every file operation and these SEH frames doesn't look cheap either. Well, printf is a file operation and so takes locks too. So why do we need to take them again and again? We don't and replacing fputc with _fputc_nolock solves the printf performance problem.
The same trick can be done for scanf. But this time we have 2 functions to worry about: getc and... ungetc. Replacing getc with _getc_nolock (and adding missing locks to scanf) immediately gives a huge performance boost but replacing ungetc causes compile error. Turns out, _ungetc_nolock is only present in post-msvcrt runtime libraries. But there is still a way to increase performance. We can replace ungetc with writing into internal buffer and do all ungetc at once at the end of the function. This way we will not need to lock the file if ungetc is followed by getc. Moreover this internal buffer already exist for this purpose to support string variants of scanf. You can't ungetc into a user-provided string (it can be in read-only memory) and the fact that the function only tries to put back exactly the same symbols that were there before was not noticed by authors.
Zeroing in
Cutting corners and hopefully not blowing up on a pad
Even with getc and ungetc locking under control scanf was still 2.3x times slower on my integer numbers test. Unfortunately, there is more going on than in printf case. The next breakthrough happened when I thought about internal buffer for ungetc for the umpteenth time. It is wastefully large — 1024 elements which have int type. It is not like stack place is precious, it takes a single command to allocate any amount of stack space after all. But we need to initialize it with zeros. Stop, why are we doing that? Initializing structures with memset is convenient but what would happen it we initialize only what we need i.e. everything else but this large buffer? Finally, the scanf performance became just below 2x times slower.
The next improvement came from a simple thought — we got rid of file locks but there might be other locks hiding in function calls. Luckily, there are very few of those in scanf code. It is just malloc/free and localeconv. Both turned out to be a problem. The only reason malloc was used in integer parsing code is to copy input into a string in order to call standard strtoXX function on it. The string buffer is lazily allocated so every scanf run needs at least one malloc call. Allocating 256-byte buffer on the stack and using malloc only when larger buffer is needed helped a lot. The problem with localeconv was caused by locale being different for each thread and so the function needs to read thread-local variable which is not very fast on Windows. The localeconv function were called twice to get decimal and thousands separators. Luckily, we need only one call to get both of them. After applying both changes the scanf integer test performance reached 1.5x times slowdown mark.
It is not a compiler
Everybody lies
With all easy performance problems gone, it is time to try a profiler. Normally, one starts with one but we are trying to speed up a MinGW library. This means that Visual Studio profiler can't understand its symbols. And MinGW profiler need to be compiled first which turned out to be a long quest in and out of itself. Unfortunately, even MinGW profiler only shows useful per-function information while per-line information is complete nonsense. But looking through per-function information gave unexpected gift — some short functions are not inlined. Adding inline and separating code that is good for inlining from code that can stay in a function squeezed a bit more speed. The final performance on the integer benchmark is 1.3-1.35x slower than the native version.
Drowning in floating point numbers
There is no easy road to correct rounding
The performance for floating point numbers is very different. On a floating point test MinGW version of scanf is ~1.9x times slower while printf is ~1.6x times faster. Unfortunately, it is very hard to improve scanf. Both native (at least in modern version of the runtime) and MinGW versions of scanf implement correct rounding and so had to use big integers and very complex algorithms. Profiler shows that a significant time is spent on allocating memory for these big integers (which also needs locking by the way) but there is no easy fix for that. Native version in modern C/C++ runtime uses fixed-width big integers which are capable to handle the worst case and allocates them on the stack instead (without zero-initialization of course). It doesn't handle long double though and had to be more careful with how many big integers are used.
For now this performance problem remains unresolved. Any substantial change of floating point parsing code would better have good tests.
When can I use it on Codeforces?
Making a reality check
New printf/scanf are available in GNU G++14 6.2.0 compiler since yesterday. I have run simple integer and floating point benchmarks on Codeforces servers via custom invocation tab to test real speed. Integer benchmark behaves similar to my machine while floating point times are quite different (both MinGW printf and scanf are slower but performance difference is smaller). In the table below, MinGW means new C++14 compiler, msvcrt means new C++14 compiler using msvcrt functions by including stdio.h first (should be similar to existing gcc 5.2 C++11 compiler) and VS 2010 means Visual Studio compiler using variant of msvcrt from 2010.
Integer benchmark. 10M integers, printf("%d")/scanf("%d")
printf scanf
mingw: 1482ms 1919ms
msvcrt: 1310ms 1560ms
vs 2010: 1779ms 1154ms
Floating point benchmark. 5M doubles, printf("%.1lf")/scanf("%lf")
printf scanf
mingw: 2153ms 4977ms
msvcrt: 1762ms 3198ms
vs 2010: 3807ms 2558ms
It's definitely a nice story. But I got confused with the results.
What do the benchmark numbers mean? Which of them are "stock" functions, and which are your "new" functions? Do you mean that Codeforces compilers got your "new" functions now, and if so, which ones did? Are the changes visible in Polygon and PBox? Lastly, is the code available somewhere?
The update is for C++14 gcc 6.2 compiler only. C++11/C++ gcc 5.3 compiler uses msvcrt functions and so no update is needed. I included stdio.h first, to use msvcrt functions on new C++14 compiler for honest performance comparison. I didn't include "stock" functions because they are not available on Codeforces any more and their performance makes me sad.
Thanks! It is clearer now.
So, using the new functions is still slower than the other solution (including
stdio.h
first)? What's the gain of using the new functions, then (in or out of contests)?And, besides, how did you enter so much data for a custom invocation?
Yes, it is still slower than including stdio.h first. Hopefully, it will rarely matter and new users will not get TLs because they don't know about stdio.h trick. On the other hand, new functions support long double, conform to C++ standard and implement POSIX extensions. If you use Linux to participate on Codeforces, you can now be sure that your printf's and scanf's will work on Codeforces too.
I used sprintf to generate data for sscanf. So I did not need to enter any data in custom invocation. I just realized that this is not exactly the same as reading and writing from file though since no file locking is used. I think this doesn't affect measurements that much since new functions only have one extra file lock in scanf compared to msvcrt ones.
Thanks, that makes sense again. So, the new implementation does conform to the current standard.
Are you planning to propose the patch to upstream (the MinGW-branch currently used), then? Ideally, this will either confirm that your version is faster and still implements the standard — and then, all MinGW-branch users can get the benefit, not only Codeforces and friends, — or the reviewers point out the exact place where it contradicts the standard. (Edit: this place may still be irrelevant to programming contests, in which case, everything is left as it is now.)
There is one problem with my patch that I didn't mention. I improved scanf but not wscanf (unicode version). They are basically the same but I only ported some of scanf changes to wscanf. I didn't remove any printf/scanf features so standard compliance shouldn't be an issue.
Even with wscanf issue fixed, it is not clear where contributors supposed to send patches to. MinGW is still hosted on SourceForge which doesn't have "contribute" button.
"So why do we need to take them again and again? We don't and replacing fputc with _fputc_nolock solves the printf performance problem."
Is this patch for competitive programming only when multithreaded output is not needed?
Is such patch contributable at all for library that should support multithreading?
The lock is already taken at the start of printf and scanf. Taking it again in fputc is not necessary. My changes do not break multithreading case.
Theres a link, but I'm not sure whether it's actual
oops, here is link http://www.mingw.org/wiki/SubmitPatches
Can you please clarify for newbies about this stdio.h trick? Where should I include it and how much it speed ups my program?
If you include stdio.h as first header in your program, then due to some preprocessing magic msvcrt versions of printf and scanf functions are chosen by gcc 6.2 C++14 compiler. This is what I call msvcrt in the benchmarks. As you see, the difference is not that big anymore. The same msvcrt functions are currently used by gcc 5.1 C++11 compiler. If you include <cstdio>, MinGW standard-compliant versions will be chosen by gcc 6.2 C++14 compiler instead. I don't know what happens if you include both.
Is it the same if you include cstdio?
Parser ate a part of my comment. I fixed it now.
The change is visible in PBox. I haven't used Polygon yet, so I don't know about it. If you are really interested, you can download the source of my MinGW gcc 6.2 package here: mingw-w64-crt-fast.zip .
I updated the post to clarify benchmarks comparisons.
What was the slowdown before removing locks? The first number in the article is 2.3x after lock removal.
The initial slowdown is a very sad number. I am trying to forget it. If you really-really want to get it, current MinGW-w64 pbox package contains both new and old libmingwex library. You can revert the change and restore slow printf and scanf.
Interesting read. Thank's a lot for your work and even more your write up of it.
Interesting reading...
I have one question tho, what exactly do you mean by "native version"? do you mean the linux g++ functions??
No, I mean printf/scanf implemented by Microsoft in msvcrt.