Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > detecting dupli...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 1 Topic 2235 of 2312
Post > Topic >>

detecting duplicate pdf files: a word count approach

by Rahul <nospam@[EMAIL PROTECTED] > May 12, 2008 at 10:40 PM

Are there any tools (based on awk etc.) to quantify the "difference" 
between two pdf files? I wanted detect duplicate journal articles that
I've 
inadverently downloaded. 

A simple diff (or md5sum etc.) does not work because: (1) Some are scanned

pdf's (essentially images) so each scan is not an "exact" replica of the 
other. (2) Even for "text" pdf's some small amount of text changes (eg. 
imprints of IP address, date downloaded, header page etc. that the journal

providers inject into the pdf)

I found no tools so far. I was thinking of writing my own (using awk,
etc.) 
based on this approach (works only for text-based pdf files): make a 
dictionary of the 1000 most common words. Get a word count on each
article. 
Compute the difference between two articles in terms of frequency in this 
"word-space" Decide based on a "fuzziness" which articles are the "same".

I don't want to reinvent the wheel. Any ideas / flaws etc. that people 
might come up with are very welcome.

-- 
Rahul
 




 1 Posts in Topic:
detecting duplicate pdf files: a word count approach
Rahul <nospam@[EMAIL P  2008-05-12 22:40:15 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Sun Jul 20 15:00:38 CDT 2008.