On Wednesday 7 May 2008 17:37, Ed Morton wrote:
>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>
>>
>> Is there a way to explicitly print out that information (or, better,
the
>> entire collating sequence in use)? I've been looking for a method to do
>> that for long time, but I have found no complete answer.
>>
>
> I expect you could use the ord() and chr() functions described here:
>
> http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions
>
> to do something like:
>
> for (i=ord("a");i<=ord("z");i++) {
> print chr(i)
> }
Take this scenario:
$ cat file
100e3
$ echo $LC_ALL
en_GB
$ awk '/[A-Z]/' file
100e3
$ LC_ALL=C awk '/[A-Z]/' file
$
(or, perhaps more elegant,
$ awk '[[:upper:]]' file
$ )
It seems that the function you point out use the mere numeric character
values and don't take locale into account. Using the proposed code for the
ord() and chr() functions, a loop to print the sequence from "A" to "Z"
always yields
A
B
C
....
Z
under many different locales, even en_GB which, as seen above, clearly
expands [A-Z] differently.
In fact, my question is not awk-specific, and is generically about how
collating sequences affect the interpretation of bracket expressions, and
thus influence how programs like grep, sort, awk, etc. work.
What I'm looking for is a command which, ideally, behaves as follows:
$ LC_ALL=C <command> '[A-C]'
ABC
$ LC_ALL=en_GB <command> '[A-C]'
AaBbCc # or whatever it's expanded to
and, ideally, also something like
$ <command> -a
# prints the entire current collating sequence, according to current
locale
Of course, I don't know whether such a command exists, or even whether
it's
possible to gather that information in some other way.
I'm setting the followup for this discussion to comp.unix.shell, since
this
is not awk-specific anymore.
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.


|