Are there any tools (based on awk etc.) to quantify the "difference"
between two pdf files? I wanted detect duplicate journal articles that
I've
inadverently downloaded.
A simple diff (or md5sum etc.) does not work because: (1) Some are scanned
pdf's (essentially images) so each scan is not an "exact" replica of the
other. (2) Even for "text" pdf's some small amount of text changes (eg.
imprints of IP address, date downloaded, header page etc. that the journal
providers inject into the pdf)
I found no tools so far. I was thinking of writing my own (using awk,
etc.)
based on this approach (works only for text-based pdf files): make a
dictionary of the 1000 most common words. Get a word count on each
article.
Compute the difference between two articles in terms of frequency in this
"word-space" Decide based on a "fuzziness" which articles are the "same".
I don't want to reinvent the wheel. Any ideas / flaws etc. that people
might come up with are very welcome.
--
Rahul