Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Submitted for y...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 13 Topic 2118 of 2317
Post > Topic >>

Submitted for your review: random sampling filters in gawk

by "steven.huwig@[EMAIL PROTECTED] " <steven.huwig@[EMAIL PROTECTED] > Jan 2, 2008 at 08:44 PM

After becoming a little curious about random sampling and how to do it
without knowing the number of records in advance, I found Waterman's
Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z
at <http://doi.acm.org/10.1145/3147.3165>.

The nice features of these algorithms are that they take space
pro****tional to the sample size, and that they work without scanning
the entire file to determine how many records are present. I think
they are appropriate to be implemented as Unix stdin/stdout pipeline
filters.

I wrote both versions up in gawk and figured I'd post them somewhere
where people can poke holes in them, use them, or change them if they
like.

I know that there are a lot of useless comparisons because I used
patterns instead of explicit loops and getline for flow control, but
it's just so darn readable for me this way. :-)

On the other hand, the variable names are not readable. I took them
directly from the algorithm pseudocode to make checking the
implementation a bit easier.

My implementation preserves the input order when it outputs the random
sample. I think this behavior is more useful in shell, where you work
with things like uniq or join that require the inputs to have some
sort of order.

Here's a comparison timing on my system.

scully:~/bin steve$ time bzcat words.bz2 | vitter-sample.awk -v
n=10000 > /dev/null

real	0m54.739s
user	0m28.725s
sys	0m10.722s
scully:~/bin steve$ time bzcat words.bz2 | waterman-sample.awk -v
n=10000 > /dev/null

real	1m35.501s
user	0m50.242s
sys	0m19.479s
scully:~/bin steve$ bzcat words.bz2 | wc -l
 11746850

Comments are welcome.

-- Steve


# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()
}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next
}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }
}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" > "/dev/
stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        newpool[ordered[i]]
}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR
}



# Vitter's Algorithm Z for random sampling
# http://doi.acm.org/10.1145/3147.3165
#

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    T = 10                      # tuneable per (Vitter 85)
    srand()
    nplusone = n + 1
}

NR == nplusone {
    pool[NR] = $0
    places[NR - 1] = NR
    t = n
    thresh = T*n
    num = 0
}

NR > n && t > thresh {
    if (!W) {
        W = exp(-log(rand())/n)
        term = t - n + 1
    }
    while (1) {
        U = rand()
        X = t*(W - 1.0)
        S = int(X)
        lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
        rhs = (((t + X)/(term + S))*term)/t
        if (lhs <= rhs) {
            W = rhs/lhs
            break
        }
        y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
        if (n < S) {
            denom = t
            numer_lim = term + S
        } else {
            denom = t - n + S
            numer_lim = t + 1
        }
        for (numer = t + S; numer > numer_lim; numer--) {
            y = (y*numer)/denom
            denom--
        }
        W = exp(-log(rand())/n)
        if (exp(log(y)/n) <= (t + X)/t)
            break
    }
    if (SKIP_RECORDS(S))
        READ_NEXT_RECORD(int(n*rand()))
    else
        exit
    t = t + S + 1
    term = term + S + 1
    next
}

NR > n && t <= thresh {
    V = rand()
    S = 0
    t++
    num++
    quot = num/t
    while (quot > V) {
        S++
        t++
        num++
        quot = (quot*num)/t
    }
    if (SKIP_RECORDS(S))
        READ_NEXT_RECORD(int(n * rand()))
    else
        exit
    next
}

NR <= n {
    pool[NR] = $0
    places[NR - 1] = NR
    next
}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" > "/dev/
stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        newpool[ordered[i]]
}

function SKIP_RECORDS(skip) {
    retval = 1
    for (i = 0; i < skip; i++)
        retval = getline
    return retval > 0
}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR
}
 




 13 Posts in Topic:
Submitted for your review: random sampling filters in gawk
"steven.huwig@[EMAIL  2008-01-02 20:44:01 
Re: Submitted for your review: random sampling filters in gawk
"steven.huwig@[EMAIL  2008-01-02 21:00:41 
Re: Submitted for your review: random sampling filters in gawk
"Luuk" <luuk  2008-01-03 10:28:56 
Re: Submitted for your review: random sampling filters in gawk
Ed Morton <morton@[EMA  2008-01-04 15:45:48 
Re: Submitted for your review: random sampling filters in gawk
William James <w_a_x_m  2008-01-03 00:25:25 
Re: Submitted for your review: random sampling filters in gawk
Janis Papanagnou <Jani  2008-01-03 10:37:38 
Re: Submitted for your review: random sampling filters in gawk
"Anton Treuenfels&qu  2008-01-03 23:07:19 
Re: Submitted for your review: random sampling filters in gawk
"steven.huwig@[EMAIL  2008-01-03 03:25:09 
Re: Submitted for your review: random sampling filters in gawk
"Luuk" <luuk  2008-01-03 14:35:06 
Re: Submitted for your review: random sampling filters in gawk
mjc <mjcohen@[EMAIL PR  2008-01-04 12:12:40 
Re: Submitted for your review: random sampling filters in gawk
Ed Morton <morton@[EMA  2008-01-04 14:26:25 
Re: Submitted for your review: random sampling filters in gawk
gazelle@[EMAIL PROTECTED]  2008-01-04 21:54:46 
Re: Submitted for your review: random sampling filters in gawk
"steven.huwig@[EMAIL  2008-01-05 20:49:05 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri Jul 25 14:58:31 CDT 2008.