NYCPHP Meetup

NYPHP.org

[nycphp-talk] Character set issues revisited

Michael B Allen ioplex at gmail.com
Fri Oct 19 17:54:58 EDT 2007


On 10/19/07, Cliff Hirsch <cliff at pinestream.com> wrote:
> Thanks. This is helpful. Here's another interesting puzzle. Why does the
> page info in FireFox say encoding: UTF-8 while the Content-Type is
> charset=iso-8859-1.

Note that even if you set the content-type header to ISO-8859-1 it's
still very possible generate page content in some other encoding. For
example, databases are pretty much put out what you put in so if you
have an "admin.php" putting data in in UTF-8 and all your other pages
are doing ISO-8859-1 w/ data from the DB the data from the DB will be
in UTF-8 and non-ASCII characters will not be encoded properly.

> Ah, I think I see it. The encoding is how the page was saved. And as usual,
> Microsoft butchers everything.

Actually if you right click on the page in FF and select "page info" I
think it should tell you the real encoding emitted by the server (the
content-type header).

Note that the META content-type tag in a page is basically ignored
unless the page is read from disk.

Also, note that if you're serving static content, the web server has a
default Content-Type encoding which is usually UTF-8. So if you have
some pages that are encoded in ISO-8859-1 and the web server is UTF-8
the web server will send Content-Type UTF-8 but the page will actually
still be encoded in ISO-8859-1 and the page will not be rendered
properly.

> But this is php -- the page is dynamically generated. So is the encoding
> picked up from my php script, index.php, or the template file index.tpl?

The browser will interpret the page based on the Content-Type
encoding. Period. But it's up to *you* to make sure the page is really
encoded in that encoding. It sounds like the script files are actually
the wrong encoding or contain funky characters. The easiest way to
determin if that is the case is to run hexdump or a hex editor and
look at one of the script files with a non-ASCII character in it. If
that character is encoded with one byte, then the encoding is
ISO-8850-x. If the character is encoded with two or more bytes, it's
probably UTF-8.

Note that if you have a lot of pages in the wrong encoding you might
want to look into the iconv utility found on *nix machines.

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list