After becoming a little curious about random sampling and how to do it
without knowing the number of records in advance, I found Waterman's
Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z
at <http://doi.acm.org/10.1145/3147.3165>.
The nice features of these algorithms are that they take space
pro****tional to the sample size, and that they work without scanning
the entire file to determine how many records are present. I think
they are appropriate to be implemented as Unix stdin/stdout pipeline
filters.
I wrote both versions up in gawk and figured I'd post them somewhere
where people can poke holes in them, use them, or change them if they
like.
I know that there are a lot of useless comparisons because I used
patterns instead of explicit loops and getline for flow control, but
it's just so darn readable for me this way. :-)
On the other hand, the variable names are not readable. I took them
directly from the algorithm pseudocode to make checking the
implementation a bit easier.
My implementation preserves the input order when it outputs the random
sample. I think this behavior is more useful in shell, where you work
with things like uniq or join that require the inputs to have some
sort of order.
Here's a comparison timing on my system.
scully:~/bin steve$ time bzcat words.bz2 | vitter-sample.awk -v
n=10000 > /dev/null
real 0m54.739s
user 0m28.725s
sys 0m10.722s
scully:~/bin steve$ time bzcat words.bz2 | waterman-sample.awk -v
n=10000 > /dev/null
real 1m35.501s
user 0m50.242s
sys 0m19.479s
scully:~/bin steve$ bzcat words.bz2 | wc -l
11746850
Comments are welcome.
-- Steve
# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2
BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}
NR <= n {
pool[NR] = $0
places[NR] = NR
next
}
NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}
END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}
# Vitter's Algorithm Z for random sampling
# http://doi.acm.org/10.1145/3147.3165
#
BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
T = 10 # tuneable per (Vitter 85)
srand()
nplusone = n + 1
}
NR == nplusone {
pool[NR] = $0
places[NR - 1] = NR
t = n
thresh = T*n
num = 0
}
NR > n && t > thresh {
if (!W) {
W = exp(-log(rand())/n)
term = t - n + 1
}
while (1) {
U = rand()
X = t*(W - 1.0)
S = int(X)
lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
rhs = (((t + X)/(term + S))*term)/t
if (lhs <= rhs) {
W = rhs/lhs
break
}
y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
if (n < S) {
denom = t
numer_lim = term + S
} else {
denom = t - n + S
numer_lim = t + 1
}
for (numer = t + S; numer > numer_lim; numer--) {
y = (y*numer)/denom
denom--
}
W = exp(-log(rand())/n)
if (exp(log(y)/n) <= (t + X)/t)
break
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n*rand()))
else
exit
t = t + S + 1
term = term + S + 1
next
}
NR > n && t <= thresh {
V = rand()
S = 0
t++
num++
quot = num/t
while (quot > V) {
S++
t++
num++
quot = (quot*num)/t
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n * rand()))
else
exit
next
}
NR <= n {
pool[NR] = $0
places[NR - 1] = NR
next
}
END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}
function SKIP_RECORDS(skip) {
retval = 1
for (i = 0; i < skip; i++)
retval = getline
return retval > 0
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}


|