| ben_zine ( @ 2008-04-08 23:14:00 |
| Current mood: | |
| Current music: | Postmaster, from Demographically Unsound by Tettix |
| Entry tags: | code awk |
awk snippet, rep: duplicate a string N times
I'm going to start posting the random bits of awk code that I come up with.
Sorry.
For a project about which I'll be writing more in a bit, I needed the ability to quickly produce long strings - on the order of a few dozen, and in some case a few hundred, mebibytes. I went through several iterations before I came up with a version that I liked. That version is called rep6 in the below source code; in my personal library file I would just call it rep.
#!/bin/gawk -f # by gnomon # sha1sum of printf "%s %s", firstname, lastname: # 1110be40263bc427f7906f4e01c0cc218c6daf6e # Tue Apr 8 21:11:32 EDT 2008 # # posted to #awk in the hopes of garnering some feedback # about style, ideas and performance. # GPL v2 (but it's not finished yet, so why bother?) function rep1(s,n, r) { # O(n) allocate/appends # 2 lines of code # This is the simplest possible solution that will work: # just repeatedly append the input string onto the value # that will be passed back, decrementing the input count # until it reaches zero. while (n-->0) r = r s; return r; } function rep2(s,n, r) { # O(1) + O(n) allocate/appends # 3 lines of code # The idea here is to generate a string of length n with # sprintf and then use gsub to replace everything in the # generated string with the input string. Hopefully this # will take advantage of some smart internal workings of # the regex engine to avoid repeated allocations. # UPDATE: false hope. This is by far the slowest tactic. r = sprintf(("%0"n-1"s"),"0"); gsub(/0/,s,r); return r; } function rep3(s,n, c,n2,r,r2) { # O(log n) allocate/appends UPDATE: this is a lie! # 6 lines of code # Here we trade for faster execution by sacrificing some # code clarity. We keep a temporary string and a counter # that we start off at 1, doubling both until we pass n; # then we start over at 1, and we repeat until exactly n # is reached. This may be ugly but it is faaaaaaast. for (;c < n; c += n2) { r2 = s; #for (n2=1; n2*2<(n-c); n2*=2) { # APPEND++; r2=r2 r2} for (n2=1; n2*2<(n-c); n2*=2) r2=r2 r2; # APPEND++; r = r r2; } return r; } function rep4(s,n) { # O(1) allocate/appends # 2 lines of code # A feat - both stupid and naive! Is consing up a string # to avoid repeated print operations really that bad, or # can good buffering make a dumb idea actually workable? # UPDATE: no, it can't. This still handily beats out the # gsub version, but it's slower than rep1 and way slower # than rep3. Moral: buffer not the unbufferable! while (n-->0) printf s; print ""; } function rep5(s,n, r,x) { # O(log n) allocate/appends - *actually true* this time! # 3 lines of code # Rather than keeping track of a counter like rep3 does, # here we just use the call stack and recurse. Turns out # that I run out of RAM way faster than gawk runs out of # call stack. There may be a good threshold at which the # recursion can halt and we fall back to a simple append # approach like rep1 - but really, why bother? This code # is simple and fast. Also, the shortest possible string # is one character, and 32 recursive calls turn that one # character into 4,294,967,296; that is more than enough # to fill all available RAM on most current machines. If # it doesn't, the performance is likely to be limited by # the speed of malloc() (and therefore sbrk()) anyhow. A # quick test reveals that gawk 3.1.5 on a 32-bit x86 box # with 512 megs of RAM can stack just short of 7e3 calls # before segfaulting; dc -e'2 7000^f' prints a very long # number indeed: 2108 decimal digits, to be precise, and # it will presumably take a good long while before every # normal machine has that many bytes of RAM. We're safe. #DEPTH++; if (n < 2) x = (n == 1); else r = rep5(s,(n-(x=(n%2==1)))/2); return (r r (x?s:"")); } function rep6(str,num, remain,result) { # O(log n) allocate/appends - identical to rep5 # 7 lines of code # This is a much cleaner version of the preceding block. # For some reason I enjoy reading awk code bereft of the # optional semicolons, but I enjoy writing it with; this # is odd, but there it is. It'd be less embarassing if I # showed only the below without any of the above, but it # would be less valuable and less honest. #DEPTH++; # global var for counting recursions if (num < 2) { remain = (num == 1) } else { remain = (num % 2 == 1) result = rep6(str, (num - remain) / 2) } return result result (remain ? str : "") } BEGIN { #NUM = 10000000; STR = "foo"; NUM = ARGV[1] + 0; STR = ARGV[2]; delete ARGV[1]; delete ARGV[2]; if (ARGV[3] ~ /0/) { exit 0; } else if (ARGV[3] ~ /1/) { print rep1(STR,NUM); } else if (ARGV[3] ~ /2/) { print rep2(STR,NUM); } else if (ARGV[3] ~ /3/) { print rep3(STR,NUM); #print APPEND >> "/dev/stderr"; } else if (ARGV[3] ~ /4/) { rep4(STR,NUM); } else if (ARGV[3] ~ /5/) { print rep5(STR,NUM); #print DEPTH >> "/dev/stderr"; } else if (ARGV[3] ~ /6/) { print rep6(STR,NUM); #print DEPTH >> "/dev/stderr"; } delete ARGV[3]; exit 0; } # FIXME: # - investigate whether some tactics work better when # consing small iterations of very large strings; # * done: the version that starts at one, doubles until # overflow and then starts again at one # - see if there's some way of making the gsub approach # work better: it's too fun to abandon outright; # - is there some way to use fewer state variables in # rep3? The current version is really awkward; # - is there any way to collapse or compound the r2 and # r assignments in rep3 to shave a line or two? # * sure, but rep3 is a dead end anyway. Reusing the # value returned by a variable assignment to remove a # reference to it works, as in rep5, but the result # is a mess and a waste of effort. Still fun, though.