Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Gawk match(...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 23 of 43 Topic 2231 of 2236
Post > Topic >>

Re: Gawk match() and numbers in scientific notation

by pk <pk@[EMAIL PROTECTED] > May 8, 2008 at 04:22 PM

On Thursday 8 May 2008 15:58, Hermann Peifer wrote:

> On May 8, 11:25 am, pk <p...@[EMAIL PROTECTED]
> wrote:
>> What still stumps me is that in my locale (en_GB.utf8) the
>> expressions '[a-z]', '[[:lower:]]', and '[[:alpha:]]' all DO match
>> lowercase accented characters; nonetheless, using either of them with
>> your script does not show accented characters.
> 
> The collating_chars.sh script uses sprintf("%c", i).
> 
> This function works in fine for: i = 0 ... 127
> 
> For locales with a codeset of ISO-8859-something, it will also work
> fine for i = 128 ... 255.
> 
> Your locale's codeset is utf8, so '[a-z]', '[[:lower:]]', and
> '[[:alpha:]]' DO include accented characters. The script is simply not
> working properly, because of sprintf() limitations. At least this is
> how I understand the issue.

Yes, this is also my understanding. Non-utf8 locales work fine because
they
have useful characters in the 0-255 range, while utf8 uses a different
encoding and does not necessarily have useful characters in that range.
Moreover, the tests show that characters repeat themselves every 256 (or,
maybe, it is %c that is limited to these values only).

This is why I'm looking for a solution the can gather the information by
making effective use of the sources where localized features are defined.

For example, under linux it seems that each locale is defined by a series
of
files under /usr/share/i18n/locales (on my system at least).
A quick look at those files reveals that they are where the various
LC_COLLATE, LC_NUMERIC etc. values are defined (I must say that I'm still
a
bit puzzled about the syntax used, but that can be solved by carefully
reading the docs and the standard, I hope).
As an example, the file /usr/share/i18n/locales/en_GB (could not find a
en_GB.utf8 file) has inside it:

....
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
....

The file /usr/share/i18n/locales/iso14651_t1 includes the file
iso14651_t1_common, which finally has (some comments removed):

LC_COLLATE

# Déclaration des systèmes d'écriture / Declaration of scripts
script <SPECIAL>
script <LATIN>
script <TIFINAGH>
script <ARABINT>
script <ARABFOR>
script <HEBREU>
script <GREC>
script <CYRIL>
script <ARMENIAN>

# Déclaration des symboles internes / Declaration of internal symbols
#
# SYMB N° Expl.
#
collating-symbol <RES-1>
#
# <ARABINT>/<ARABFOR>
#
#
collating-symbol <ANO> # 2 normal --> voir/see <MIN>
collating-symbol <AIS> # 3 isol.
collating-symbol <AFI> # 4 final
collating-symbol <AII> # 5 initial
collating-symbol <AME> # 6 medial/m<e'>dian
....[snip]...
# Ordre des symboles internes / Order of internal symbols
#
# SYMB. N°
#
<RES-1>
....[snip]...
order_start <SPECIAL>;forward;backward;forward;forward,position
#
# Tout caractère non précisément défini sera considéré comme
caractère
# spécial et considéré uniquement au dernier niveau.
#
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB.                                N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _
<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>
<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)
<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>
<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -
<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,
<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;
<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :
<U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 !
<U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 ¡
<U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ?
<U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 ¿
<U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 /
....
....etc.etc.

As far as I understand, these files are read by the commands "locale-gen"
and "localedef" and used to generate a binary form of locale information,
located (on my system) under /usr/lib/locale, which is what is used by the
various locale-sensitive programs (perhaps through libc) at runtime.

Either way, my point is that, since the localization info is defined
somewhere, there should be a way to extract and display that information.
The command "locale", with its various options, seems unable to provide
the
kind of information I'm interested in (unless I overlooked something,
which
may entirely be possible).

$ locale -c LC_COLLATE -k
LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="UTF-8"

To anybody who might reply: if you feel that the discussion is OT on
comp.lang.awk, feel free to reply on comp.unix.shell or whatever group you
deem appropriate. Thanks.

-- 
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.




 43 Posts in Topic:
Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-06 04:16:01 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-06 13:28:06 
Re: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-07 07:11:38 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 09:18:57 
Re: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-07 19:50:11 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 13:03:32 
Re: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-07 20:39:44 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 21:48:37 
Re: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-08 19:21:58 
Re: Gawk match() and numbers in scientific notation
Janis <janis_papanagno  2008-05-07 07:59:10 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 10:20:16 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-07 17:25:24 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 10:37:01 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-07 18:04:24 
Re: Gawk match() and numbers in scientific notation
schuler.steffen@[EMAIL PR  2008-05-07 11:16:35 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-07 20:27:53 
Re: Gawk match() and numbers in scientific notation
Ed Morton <morton@[EMA  2008-05-07 21:49:51 
Re: Gawk match() and numbers in scientific notation
schuler.steffen@[EMAIL PR  2008-05-07 13:16:24 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-08 11:25:06 
[OT] collating sequences: using glibc
Steffen Schuler <schul  2008-05-09 08:51:38 
Re: [OT] collating sequences: using glibc
pk <pk@[EMAIL PROTECTE  2008-05-09 10:32:37 
Re: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-08 06:58:39 
Re: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-08 16:22:59 
OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-08 08:46:54 
Re: OT: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-08 18:11:28 
Re: OT: Gawk match() and numbers in scientific notation
Janis Papanagnou <Jani  2008-05-08 22:29:32 
Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-08 22:49:38 
Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-09 09:44:54 
Re: OT: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-09 10:24:00 
Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-08 09:45:28 
[OT] Re: OT: Gawk match() and numbers in scientific notation
Janis <janis_papanagno  2008-05-09 02:08:34 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-10 10:58:52 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-10 11:52:19 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-10 11:55:35 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-10 20:10:19 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-10 20:31:22 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Steffen Schuler <schul  2008-05-10 21:56:00 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-10 23:14:44 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Cesar Rabak <csrabak@[  2008-05-11 10:50:15 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-11 17:27:57 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
pk <pk@[EMAIL PROTECTE  2008-05-11 11:17:15 
Re: [OT] Re: OT: Gawk match() and numbers in scientific notation
Janis Papanagnou <Jani  2008-05-10 15:07:10 
Re: OT: Gawk match() and numbers in scientific notation
Hermann Peifer <peifer  2008-05-13 03:41:09 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 9:18:24 CDT 2008.