Re: Determining possible encodings of a given text
by richard@[EMAIL PROTECTED]
(Richard Tobin)
May 6, 2008 at 11:14 AM
In article
<e065097e-3238-459f-8b2d-f432210000a7@[EMAIL PROTECTED]
>,
Nordlöw <per.nordlow@[EMAIL PROTECTED]
> wrote:
>How do I efficiently determine which possible encoding(s) a given text
>is in? Can I use the iconv.h api somehow?
What do you need to know?
If it doesn't contain any bytes above 127, it's probably ascii. If it
contains lots of zeros in the even or odd positions it's probably
UTF-16. If it contains bytes above 127 *and* they're consistent with
UTF-8, then it's almost certainly UTF-8. If it contains a small
pro****tion of bytes above 127, it's quite likely some ISO-Latin-N
encoding. I don't know much about far-eastern encoding.
You might look at http://jchardet.sourceforge.net/
-- Richard
--
:wq