Home

Ben's · Beta · LJ

Recent Entries · Archive · Friends · User Info

* * *

I'm going to start posting the random bits of awk code that I come up with.

Sorry.

For a project about which I'll be writing more in a bit, I needed the ability to quickly produce long strings - on the order of a few dozen, and in some case a few hundred, mebibytes. I went through several iterations before I came up with a version that I liked. That version is called rep6 in the below source code; in my personal library file I would just call it rep.

#!/bin/gawk -f 

# by gnomon
# sha1sum of printf "%s %s", firstname, lastname:
# 1110be40263bc427f7906f4e01c0cc218c6daf6e
# Tue Apr  8 21:11:32 EDT 2008
#
# posted to #awk in the hopes of garnering some feedback
# about style, ideas and performance. 

# GPL v2 (but it's not finished yet, so why bother?)

function rep1(s,n,      r) {
# O(n) allocate/appends
# 2 lines of code
# This is the simplest possible solution that will work:
# just repeatedly append the input string onto the value
# that will be passed back, decrementing the input count
# until it reaches zero.
        while (n-->0) r = r s;
        return r;
}

function rep2(s,n,      r) {
# O(1) + O(n) allocate/appends
# 3 lines of code
# The idea here is to generate a string of length n with
# sprintf and then use gsub to replace everything in the
# generated string with the input string. Hopefully this
# will take advantage of some smart internal workings of
# the regex engine to avoid repeated allocations.
# UPDATE: false hope. This is by far the slowest tactic.
        r = sprintf(("%0"n-1"s"),"0");
        gsub(/0/,s,r);
        return r;
}

function rep3(s,n,      c,n2,r,r2) {
# O(log n) allocate/appends UPDATE: this is a lie!
# 6 lines of code
# Here we trade for faster execution by sacrificing some
# code clarity. We keep a temporary string and a counter
# that we start off at 1, doubling both until we pass n;
# then we start over at 1, and we repeat until exactly n
# is reached. This may be ugly but it is faaaaaaast.
        for (;c < n; c += n2) {
                r2 = s;
                #for (n2=1; n2*2<(n-c); n2*=2) {
                #     APPEND++; r2=r2 r2}
                for (n2=1; n2*2<(n-c); n2*=2) r2=r2 r2;
                # APPEND++;
                r  = r r2;
        }
        return r;
}

function rep4(s,n) {
# O(1) allocate/appends
# 2 lines of code
# A feat - both stupid and naive! Is consing up a string
# to avoid repeated print operations really that bad, or
# can good buffering make a dumb idea actually workable?
# UPDATE: no, it can't. This still handily beats out the
# gsub version, but it's slower than rep1 and way slower
# than rep3. Moral: buffer not the unbufferable!
        while (n-->0) printf s;
        print "";
}

function rep5(s,n,      r,x) {
# O(log n) allocate/appends - *actually true* this time!
# 3 lines of code
# Rather than keeping track of a counter like rep3 does,
# here we just use the call stack and recurse. Turns out
# that I run out of RAM way faster than gawk runs out of
# call stack. There may be a good threshold at which the
# recursion can halt and we fall back to a simple append
# approach like rep1 - but really, why bother? This code
# is simple and fast. Also, the shortest possible string
# is one character, and 32 recursive calls turn that one
# character into 4,294,967,296; that is more than enough
# to fill all available RAM on most current machines. If
# it doesn't, the performance is likely to be limited by
# the speed of malloc() (and therefore sbrk()) anyhow. A
# quick test reveals that gawk 3.1.5 on a 32-bit x86 box
# with 512 megs of RAM can stack just short of 7e3 calls
# before segfaulting; dc -e'2 7000^f' prints a very long
# number indeed: 2108 decimal digits, to be precise, and
# it will presumably take a good long while before every
# normal machine has that many bytes of RAM. We're safe.
        #DEPTH++;
        if (n < 2) x = (n == 1);
        else r = rep5(s,(n-(x=(n%2==1)))/2);
        return (r r (x?s:""));
}

function rep6(str,num,  remain,result) {
# O(log n) allocate/appends - identical to rep5
# 7 lines of code
# This is a much cleaner version of the preceding block.
# For some reason I enjoy reading awk code bereft of the
# optional semicolons, but I enjoy writing it with; this
# is odd, but there it is. It'd be less embarassing if I
# showed only the below without any of the above, but it
# would be less valuable and less honest.
        #DEPTH++; # global var for counting recursions
        if (num < 2) {
                remain = (num == 1)
        } else {
                remain = (num % 2 == 1)
                result = rep6(str, (num - remain) / 2)
        }
        return result result (remain ? str : "")
}

BEGIN {
        #NUM = 10000000;     STR = "foo";
        NUM  = ARGV[1] + 0; STR = ARGV[2];
        delete ARGV[1];    delete ARGV[2];
        if        (ARGV[3] ~ /0/) {
                exit 0;
        } else if (ARGV[3] ~ /1/) {
                print rep1(STR,NUM);
        } else if (ARGV[3] ~ /2/) {
                print rep2(STR,NUM);
        } else if (ARGV[3] ~ /3/) {
                print rep3(STR,NUM);
                #print APPEND >> "/dev/stderr";
        } else if (ARGV[3] ~ /4/) {
                rep4(STR,NUM);
        } else if (ARGV[3] ~ /5/) {
                print rep5(STR,NUM);
                #print DEPTH >> "/dev/stderr";
        } else if (ARGV[3] ~ /6/) {
                print rep6(STR,NUM);
                #print DEPTH >> "/dev/stderr";
        }
        delete ARGV[3];
        exit 0;
}

# FIXME:
# - investigate whether some tactics work better when
#   consing small iterations of very large strings;
# * done: the version that starts at one, doubles until
#   overflow and then starts again at one
# - see if there's some way of making the gsub approach
#   work better: it's too fun to abandon outright;
# - is there some way to use fewer state variables in
#   rep3? The current version is really awkward;
# - is there any way to collapse or compound the r2 and
#   r assignments in rep3 to shave a line or two?
# * sure, but rep3 is a dead end anyway. Reusing the
#   value returned by a variable assignment to remove a
#   reference to it works, as in rep5, but the result
#   is a mess and a waste of effort. Still fun, though.
Tags:
Current Mood:
tired tired
Current Music:
Postmaster, from Demographically Unsound by Tettix
* * *

Advertisement