On 1/29/2008 4:09 PM, Seb wrote:
> Hi,
>
> I would like to check if some simple text files have been corrupted. A
> manual/visual check of a few of the files shows that some of them
> contain "garbage" characters in them. I can't directly "see" what those
> characters are, but they can be found at any part of the file. The
> information I'm after is simply the name of the file that is corrupted.
> So I thought the following would do:
>
> awk '/[^[:alnum:]]/ {print FILENAME}' *
>
> Since, if IIUC, [:alnum:] represents all alphabet letters (upper and
> lower case) and all digits, punctuation marks and symbols,
alnum = ALpha NUMeric, i.e. alphabetic and numeric characters, no
punctuation
marks, symbols or anything else.
> which are
> part of the uncorrupted files. Basically, print the file name a line
> does NOT match any of these characters. Is this a good way to spot
> those corrupted files?
Beats me since I don't know what the your files can legally contain and so
don't
know what it means for them to be "corrupted", but to detect control
characters
you'd use the "[:cntrl:]" character class:
awk '/[[:cntrl:]]/ {print FILENAME}' *
Note though, that the presence of control characters doesn't always mean a
file's been corrupted (e.g. people use control-Ls to separate functions in
source code to force form-feed to printers). Maybe some other character
class or
combination of such would be more appropriate. See
http://www.gnu.org/software/gawk/manual/gawk.html#table_002dchar_002dclasses
for
the list.
Ed.


|