[nycphp-talk] Character set issues revisited
tedd
tedd at sperling.com
Thu Oct 25 10:49:21 EDT 2007
At 11:12 AM -0400 10/23/07, Michael B Allen wrote:
>On 10/23/07, tedd <tedd at sperling.com> wrote:
>> At 7:21 PM -0400 10/21/07, John Campbell wrote:
>> >The first thing to understand about character encoding is the overlap
>> >between UTF-8 and 8859-1. Below is a sample
>> >a - lower case a (Same in 8859-1 & UTF-8)
>> >à - a acute (Available in 8859-1 & UTF8 but different values..)
>> >éí - Chinese character (Not in 8859-1, in UTF-8)
>>
>> A small clarification -- it's not really overlap,
>> but rather UTF-8 is a super-set containing 8859-1
>> like both contain ASCII.
>
>Well if you want to be pedantic about it, "overlap" is more accurate.
>UTF-8 is a multibyte encoding of the Unicode charset. ISO-8859-1 is a
>single byte encoding of the ISO-8859-1 charset. So yes, Unicode is a
>superset of ISO-8859-1 but the UTF-8 encoding of values above 0x7f are
>not the same.
>
>Mike
You are free to call it what you want.
True, the code-points for the ISO-8859-1 charset
above 0x7F (the M$ spin) are not the same as
UTF-* et al, but the glyphs are still included in
UFT-8 regardless of encoding differences -- is
that not true?
If this is true, then the term "overlap" would be
less correct than "super-set" because the two
sets do not overlap with respect to all
code-points -- but the larger one still contain
all the glyphs that the smaller one does (for the
exception of Apple's spin on that set, which
included adding their logo).
That's the reason I'm free to call one a super-set of the the other.
I believe it's easier to explain char-sets and
code-points in terms of current Unicode standards
than it is to point out historical differences
that are diminishing in importance as more people
convert.
Cheers,
tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
More information about the talk
mailing list