| ben_zine ( @ 2008-03-27 03:46:00 |
| Current mood: | a bit lonely |
awk matcher performance - what gives?
I'm tapping this out in T9 on my E50 rather than using one of the two keyboards I usually have in my kit bag[1]. It's not as goofy a setup as it sounds: one board is a ThinkOutside Stowaway that is smaller than the paperbacks I tote, and the other is an Apple Wireless which is thin enough to be a non-issue. I carry two because I keep the former paired with my phone and the latter with my N800, and I use the devices concurrently often enough that re-pairing is more onerous than taking the cargo space hit. Real-world space versus time optimization tradeoff, ladies and gents... But I digress.
I'm tapping away on my phone because there's something I want to get out of my increasingly murky brain before I get so tired that it disappears completely, and there's a good chance that'll happen before I get home.
The thing is this: mikeX, a fellow who hangs out in an IRC channel that I also frequent (#awk on irc.freenode.net), mentioned that he ran a performance comparison between a couple of constructions that ought to be equivalent:
time gawk '/test/{print $2}' file.txttime mawk '/test/{print $2}' file.txttime grep test file.txt | awk '{print $2}'
The file[2] was fifteen mebibytes of short lines with only one line, the last, containing the string "test". The first command turned out to be the slowest by far - about an order of magnitude. The second and third executed at roughly the same speed.
What makes this particularly interesting is that mikeX had just finished skimming through the same paper[3] that I had coincidentally re-read just the night before - one which compares the merits of several classes of regular expression and regex engines - and concluded that the speed difference was due to a deficiency of the type discussed in the paper (basically, that grep, awk and others use a DFA-style matcher which offers sometimes huge speedups with patterns that do not require backreferences, and that Perl and several other languages use a backtracking matcher with horrific worst-case performance even in those situations when the simpler, hugely faster DFA could be used instead).
One the one hand, this is definitely not the case, since the pattern in question is a fixed string and therefore is both the best case for any matcher and devoid of the kind of backtracking that would highlight the bad behaviour described in the paper; on the other hand, something is causing that performance delta. So what gives?
I seem to recall that grep calls into play some kind of fast-path logic when the pattern is a fixed string (something tricky like Boyer-Moore? Something macho like exploiting CPU-specific string-scanning instructions? Something slightly less macho but more insightful, like taking into account cache fill timings?). I know that mawk is reputed to be a very fast version of the tool. I suspect that the matcher in gawk is quite sophisticated. I'm curious to see how Dmitry Zakharov's busybox awk applet ranks on the speed tests.
What I'm getting at, here, is that I'm about to do some code-diving that I think may yield some very educational results. I'll need some sleep first, though.
Hey, here's my stop.
- 1
-
OK, I'm cleaning this up after getting home. HTML editing and linking is way, way too painful to do with T9 and an impoverished text editor.
- 2
-
If you're curious enough to want to duplicate the test, you can generate your own test file with the following pipeline:
time awk 'function r(s,n){while(n-->0)print s};BEGIN{s=ARGV[1];n=ARGV[2];r(s,n);print "test case hello world";r(s,n);exit 0}' "Hello world, how are you" 300000 > test-file.txt That shouldn't take longer than a second or two to run, and should generate a file with the md5sum
9cf078b42b15b6927d04b08c82699e16. - 3
-
Regular Expression Matching Can Be Simple And Fast, Russ Cox, 2007-01