Ed Morton wrote:
> On 5/7/2008 12:50 PM, Hermann Peifer wrote:
>> Ed Morton wrote:
>>
>>> On 5/6/2008 6:16 AM, Hermann Peifer wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I am somehwat puzzled with match() results for numbers in scientific
>>>> notation. See below.
>>>>
>>>> $ cat testdata
>>>> 100
>>>> 100e-3
>>>> 100E3
>>>>
>>>> I am wondering what kind of uppercase character is matched in record
>>>> 2:
>>>>
>>>> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
>>> There may not be an uppercase character matching. [A-Z] represents the
list of
>>> characters in between the character A and the character Z in your
locale - that
>>> does NOT mean it has to be upper case characters. For example, your
locale might
>>> consider characters ordered as:
>>>
>>> aAbBcCdDeEfF....zZ
>>
>> You are right: in my locale en_GB.UTF-8, [A-Z] matches all upper and
>> lower case letters (including accented letters), except lower case a.
In
>> return [a-z] matches all upper/lower case letters, except upper case Z.
>>
>>
>>
>>> so "e" would sit between "A" and "Z". That's why you should use
character
>>> classes instead of specific ranges, e.g.:
>>>
>>> gawk '{print $1,match($1,/[[:upper:]]/)}' testdata
>>>
>>
>> I will do so. Thanks, Hermann
>
> As the other part of this thread continues over at comp.unix.shell, I
came up
> with this script which you can run to see which characters are contained
in
> which character lists (REs actually):
>
> $ cat rechars.awk
> # Prints every character that matches a given RE.
> # Originally created to print all characters in a given character list.
> #
> # usage:
> # LC_ALL=C awk -v re="[a-z]" -f rechars.awk
> # LC_ALL=en_GB awk -v re="[a-z]" -f rechars.awk
> # awk -v re="[[:upper:]]" -f rechars.awk
> #
> BEGIN{
> for (i=0;i<=1000;i++)
> chars[sprintf("%c",i)]
> for (c in chars)
> if (c ~ re)
> s=s c
> print re"="s
> }
> $ awk -v re="[A-Z]" -f rechars.awk
> [A-Z]=ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> so you can play with that if you're curious about which characters match
in
> specific locales...
>
> Ed.
>
Thanks for this one. I do however not think that sprintf("%c",i) makes
much sense with i > 255. As far as I can see: for values between 0 and
255, printf prints the (control) characters from ASCII and ISO-8859-1,
but with i > 255, the same series of characters is printed again, see
here:
$ awk 'BEGIN{printf "%c\n",65}'
A
$ awk 'BEGIN{printf "%c\n",65+256}'
A
$ awk 'BEGIN{printf "%c\n",65+256+256}'
A
Hermann


|