Log in

Ben's · Beta · LJ

awk snippet, rep: duplicate a string N times

Recent Entries · Archive · Friends · Profile

* * *

I'm going to start posting the random bits of awk code that I come up with.


For a project about which I'll be writing more in a bit, I needed the ability to quickly produce long strings - on the order of a few dozen, and in some case a few hundred, mebibytes. I went through several iterations before I came up with a version that I liked. That version is called rep6 in the below source code; in my personal library file I would just call it rep.

#!/bin/gawk -f 

# by gnomon
# sha1sum of printf "%s %s", firstname, lastname:
# 1110be40263bc427f7906f4e01c0cc218c6daf6e
# Tue Apr  8 21:11:32 EDT 2008
# posted to #awk in the hopes of garnering some feedback
# about style, ideas and performance. 

# GPL v2 (but it's not finished yet, so why bother?)

function rep1(s,n,      r) {
# O(n) allocate/appends
# 2 lines of code
# This is the simplest possible solution that will work:
# just repeatedly append the input string onto the value
# that will be passed back, decrementing the input count
# until it reaches zero.
        while (n-->0) r = r s;
        return r;

function rep2(s,n,      r) {
# O(1) + O(n) allocate/appends
# 3 lines of code
# The idea here is to generate a string of length n with
# sprintf and then use gsub to replace everything in the
# generated string with the input string. Hopefully this
# will take advantage of some smart internal workings of
# the regex engine to avoid repeated allocations.
# UPDATE: false hope. This is by far the slowest tactic.
        r = sprintf(("%0"n-1"s"),"0");
        return r;

function rep3(s,n,      c,n2,r,r2) {
# O(log n) allocate/appends UPDATE: this is a lie!
# 6 lines of code
# Here we trade for faster execution by sacrificing some
# code clarity. We keep a temporary string and a counter
# that we start off at 1, doubling both until we pass n;
# then we start over at 1, and we repeat until exactly n
# is reached. This may be ugly but it is faaaaaaast.
        for (;c < n; c += n2) {
                r2 = s;
                #for (n2=1; n2*2<(n-c); n2*=2) {
                #     APPEND++; r2=r2 r2}
                for (n2=1; n2*2<(n-c); n2*=2) r2=r2 r2;
                # APPEND++;
                r  = r r2;
        return r;

function rep4(s,n) {
# O(1) allocate/appends
# 2 lines of code
# A feat - both stupid and naive! Is consing up a string
# to avoid repeated print operations really that bad, or
# can good buffering make a dumb idea actually workable?
# UPDATE: no, it can't. This still handily beats out the
# gsub version, but it's slower than rep1 and way slower
# than rep3. Moral: buffer not the unbufferable!
        while (n-->0) printf s;
        print "";

function rep5(s,n,      r,x) {
# O(log n) allocate/appends - *actually true* this time!
# 3 lines of code
# Rather than keeping track of a counter like rep3 does,
# here we just use the call stack and recurse. Turns out
# that I run out of RAM way faster than gawk runs out of
# call stack. There may be a good threshold at which the
# recursion can halt and we fall back to a simple append
# approach like rep1 - but really, why bother? This code
# is simple and fast. Also, the shortest possible string
# is one character, and 32 recursive calls turn that one
# character into 4,294,967,296; that is more than enough
# to fill all available RAM on most current machines. If
# it doesn't, the performance is likely to be limited by
# the speed of malloc() (and therefore sbrk()) anyhow. A
# quick test reveals that gawk 3.1.5 on a 32-bit x86 box
# with 512 megs of RAM can stack just short of 7e3 calls
# before segfaulting; dc -e'2 7000^f' prints a very long
# number indeed: 2108 decimal digits, to be precise, and
# it will presumably take a good long while before every
# normal machine has that many bytes of RAM. We're safe.
        if (n < 2) x = (n == 1);
        else r = rep5(s,(n-(x=(n%2==1)))/2);
        return (r r (x?s:""));

function rep6(str,num,  remain,result) {
# O(log n) allocate/appends - identical to rep5
# 7 lines of code
# This is a much cleaner version of the preceding block.
# For some reason I enjoy reading awk code bereft of the
# optional semicolons, but I enjoy writing it with; this
# is odd, but there it is. It'd be less embarassing if I
# showed only the below without any of the above, but it
# would be less valuable and less honest.
        #DEPTH++; # global var for counting recursions
        if (num < 2) {
                remain = (num == 1)
        } else {
                remain = (num % 2 == 1)
                result = rep6(str, (num - remain) / 2)
        return result result (remain ? str : "")

        #NUM = 10000000;     STR = "foo";
        NUM  = ARGV[1] + 0; STR = ARGV[2];
        delete ARGV[1];    delete ARGV[2];
        if        (ARGV[3] ~ /0/) {
                exit 0;
        } else if (ARGV[3] ~ /1/) {
                print rep1(STR,NUM);
        } else if (ARGV[3] ~ /2/) {
                print rep2(STR,NUM);
        } else if (ARGV[3] ~ /3/) {
                print rep3(STR,NUM);
                #print APPEND >> "/dev/stderr";
        } else if (ARGV[3] ~ /4/) {
        } else if (ARGV[3] ~ /5/) {
                print rep5(STR,NUM);
                #print DEPTH >> "/dev/stderr";
        } else if (ARGV[3] ~ /6/) {
                print rep6(STR,NUM);
                #print DEPTH >> "/dev/stderr";
        delete ARGV[3];
        exit 0;

# - investigate whether some tactics work better when
#   consing small iterations of very large strings;
# * done: the version that starts at one, doubles until
#   overflow and then starts again at one
# - see if there's some way of making the gsub approach
#   work better: it's too fun to abandon outright;
# - is there some way to use fewer state variables in
#   rep3? The current version is really awkward;
# - is there any way to collapse or compound the r2 and
#   r assignments in rep3 to shave a line or two?
# * sure, but rep3 is a dead end anyway. Reusing the
#   value returned by a variable assignment to remove a
#   reference to it works, as in rep5, but the result
#   is a mess and a waste of effort. Still fun, though.
Current Mood:
tired tired
Current Music:
Postmaster, from Demographically Unsound by Tettix
* * *
* * *
[User Picture]
On April 9th, 2008 02:13 am (UTC), moiread commented:
Lj-cut? :D
[User Picture]
On April 9th, 2008 04:48 pm (UTC), ben_zine replied:
LJ cuts suck
Ah, the LJ cut. Let me first explain why I hate it and have avoided it, and then explain why I'll start using it.

For starters, I moved to LiveJournal from my original Blosxom-based weblog purely for the comments. Anyone at all can build a static site and publish a set of HTML files, and anything with a processor in it ought to be able to serve that up fast enough to saturate the link. I really liked to use Blosxom for this because it was tiny, simple, fast and fun. I especially liked how its well-defined template system allowed me to do very interesting things with metadata on the back end and then choose how to display it later in the rendering process, if at all. Sometimes well over half the byte length of my posts went to ancillary information like click paths, attribution, extensive link content descriptions, extra writing in the comments and so on. I really enjoyed exploring and pushing the edges of the HTML weblog format.

Unfortunately Blosxom's comment ("writeback") system fell over again and again under spam load, and I had to kill it. I can't port my old content into the current incarnation of LiveJournal without abandoning a lot of the extra work and depth that I put into the original posts. That's a shame, but it's not too suprising. I can, though, deal with the very minimal spam load. That's an enormous plus.

Cuts, though, are evil. There are any number of ways of marking up a piece of HTML as collapseable - a div with an appropriate class declaration is the simplest solution I can imagine. Instead, you've got to dump completely non-standard markup into what would otherwise be clean and correct HTML. It's irksome. It also commits you to either sticking with LiveJournal rather than retaining your ability to grab your data and port it someplace else, or to spend a bunch of effort filtering out its non-standard crap when you do simply because they couldn't be bothered to play by the rules.

That reminds me: I owe you a writeup about how to install and use the the Perl LJ export tool. I'll get on that.

As for the utility of hiding text, I agree; but this is a task that belongs at the rendering level. The smallest "normal" TCP/IP packet that gets shuffled around is just under 1500 bytes. Factor in the overhead of the normal HTTP header and your LJ authentication cookie, and the typical size of the text behind a cut, and you're looking at the bandwidth equivalent of shipping a pencil in a shoebox. It's just silly.

Furthermore, a simple Greasemonkey script - or an arbitrarily complex and sophisticated one - can easily and quickly take on this kind of reformatting task. And yes, there are similar facilities in Internet Explorer, Opera and Safari; I believe that even elinks includes some kind of optional programmatic post-processing. Client code can easily conceal extra information, but fetching and patching it into an already-rendered document is a far more onerous task. In general, splitting up an otherwise atomic unit like a full post across several transmissions complicates both the upstream and the downstream components, for little - I would argue no - gain. It's pure waste!

That said, it's presumably less onerous for me to grub through my data programmatically in the future than it is to force everyone to scroll past several pages of my self-indulgent code crap. Sorry about that. I'll put them behind cuts in the future.
[User Picture]
On April 9th, 2008 04:50 pm (UTC), ben_zine replied:

Gacck, I owe you an apology! That big rant up above was totally aimed at my annoyance at LJ cuts, not at you. Sorry for venting in your direction.
* * *

Previous Entry · Leave a comment · Share · Next Entry