On Thursday 8 May 2008 15:58, Hermann Peifer wrote:
> On May 8, 11:25 am, pk <p...@[EMAIL PROTECTED]
> wrote:
>> What still stumps me is that in my locale (en_GB.utf8) the
>> expressions '[a-z]', '[[:lower:]]', and '[[:alpha:]]' all DO match
>> lowercase accented characters; nonetheless, using either of them with
>> your script does not show accented characters.
>
> The collating_chars.sh script uses sprintf("%c", i).
>
> This function works in fine for: i = 0 ... 127
>
> For locales with a codeset of ISO-8859-something, it will also work
> fine for i = 128 ... 255.
>
> Your locale's codeset is utf8, so '[a-z]', '[[:lower:]]', and
> '[[:alpha:]]' DO include accented characters. The script is simply not
> working properly, because of sprintf() limitations. At least this is
> how I understand the issue.
Yes, this is also my understanding. Non-utf8 locales work fine because
they
have useful characters in the 0-255 range, while utf8 uses a different
encoding and does not necessarily have useful characters in that range.
Moreover, the tests show that characters repeat themselves every 256 (or,
maybe, it is %c that is limited to these values only).
This is why I'm looking for a solution the can gather the information by
making effective use of the sources where localized features are defined.
For example, under linux it seems that each locale is defined by a series
of
files under /usr/share/i18n/locales (on my system at least).
A quick look at those files reveals that they are where the various
LC_COLLATE, LC_NUMERIC etc. values are defined (I must say that I'm still
a
bit puzzled about the syntax used, but that can be solved by carefully
reading the docs and the standard, I hope).
As an example, the file /usr/share/i18n/locales/en_GB (could not find a
en_GB.utf8 file) has inside it:
....
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
....
The file /usr/share/i18n/locales/iso14651_t1 includes the file
iso14651_t1_common, which finally has (some comments removed):
LC_COLLATE
# Déclaration des systèmes d'écriture / Declaration of scripts
script <SPECIAL>
script <LATIN>
script <TIFINAGH>
script <ARABINT>
script <ARABFOR>
script <HEBREU>
script <GREC>
script <CYRIL>
script <ARMENIAN>
# Déclaration des symboles internes / Declaration of internal symbols
#
# SYMB N° Expl.
#
collating-symbol <RES-1>
#
# <ARABINT>/<ARABFOR>
#
#
collating-symbol <ANO> # 2 normal --> voir/see <MIN>
collating-symbol <AIS> # 3 isol.
collating-symbol <AFI> # 4 final
collating-symbol <AII> # 5 initial
collating-symbol <AME> # 6 medial/m<e'>dian
....[snip]...
# Ordre des symboles internes / Order of internal symbols
#
# SYMB. N°
#
<RES-1>
....[snip]...
order_start <SPECIAL>;forward;backward;forward;forward,position
#
# Tout caractère non précisément défini sera considéré comme
caractère
# spécial et considéré uniquement au dernier niveau.
#
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB. N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _
<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>
<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)
<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>
<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -
<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,
<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;
<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :
<U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 !
<U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 ¡
<U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ?
<U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 ¿
<U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 /
....
....etc.etc.
As far as I understand, these files are read by the commands "locale-gen"
and "localedef" and used to generate a binary form of locale information,
located (on my system) under /usr/lib/locale, which is what is used by the
various locale-sensitive programs (perhaps through libc) at runtime.
Either way, my point is that, since the localization info is defined
somewhere, there should be a way to extract and display that information.
The command "locale", with its various options, seems unable to provide
the
kind of information I'm interested in (unless I overlooked something,
which
may entirely be possible).
$ locale -c LC_COLLATE -k
LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="UTF-8"
To anybody who might reply: if you feel that the discussion is OT on
comp.lang.awk, feel free to reply on comp.unix.shell or whatever group you
deem appropriate. Thanks.
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.


|