Re: Search pattern for non-ASCII alphabetic characters
by Hermann Peifer <peifer@[EMAIL PROTECTED]
>
Feb 3, 2008 at 09:42 PM
Janis Papanagnou wrote:
> Hermann Peifer wrote:
>> Hi,
>>
>> Occasionally, I'd like to search for non-ASCII alphabetic characters
>> in UTF-8 encoded text documents.
>>
>> In the absence of an appropriate character class (at least I wouldn't
>> know of any), I do something like:
>>
>> awk '/[ÀÁÂÃÄÅ ...and so on... ŸŹźŻżŽž]/{ action }'
>>
>> This is perhaps not the smartest solution. Any better idea?
>>
>> TIA. Hermann
>
> I can't tell if it is a smarter solution but you could use the inverse
> logic based on the existing character classes...
>
> LANG=C awk '/[^[:alnum:][:punct:][:blank:][:cntrl:]]/'
>
> (Note: there's also the ANSI character class [:ascii:] but my GNU awk
> seems to not support it.)
>
> Janis
Thanks for the hint. This pattern also finds: N°1
Which does not exactly contain a non-ASCII *alphabetic* character, but
it's still better than the long character list I was using. I can filter
out some false positives.
Hermann