NYCPHP Meetup

Wed Nov 23 10:39:07 EST 2005

Allen Shaw wrote:
> Mikko Rantalainen wrote:
> 
>>The problem is that you cannot accurately identify different 8 bit 
>>encodings from each other. Latin-1 (iso-8859-1) and Latin-9 
>>(iso-8859-15) text may contain identical byte sequences and still 
>>different content so you have no way to know which one user intended 
>>to use.
> 
>>Some 8 bit encodings have different *probabilities* for different 
>>byte sequences and you could make an educated guess which encoding 
>>the user agent really used. That would still be just a guess.
>>
>>The way I do it is that I send the html with UTF-8 encoding (I also 
>>have <form accept-charset="UTF-8" ...> in case some user agent 
>>supports that, most user agents just use the same encoding the page 
>>with the form used) and I check that the user input is valid UTF-8 
>>byte sequence. [snip...]
> 
> I'm very curious how you test this.

I'm using the following function:

function isValidUTF8String($Str)
{
# correct UTF-8 stream has every character starting with zero bit
# or first byte has <length of encoding> high bits set and all
# following bytes have highest bits set to 10.
for ($i=0; $i<strlen($Str); $i++)
{
	if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
	else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
	else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
	else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
	else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
	else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
	else return false; # invalid byte
	# verify that n bytes matching bit sequence 10bbbbbb
	# follow where bbbbbb is not 000000
	# failing this test means that input is "overlong UTF-8
	# encoding", which is not allowed.
	for ($j=0; $j<$n; $j++)
		if ((++$i == strlen($Str))
		    || ((ord($Str[$i]) & 0xC0) != 0x80))
			return false;
}
# couldn't find errors, it's probably valid UTF-8 data.
return true;
}

I put all user input through that function and if the input isn't a 
valid UTF-8 string, then all user input get's through the 
Latin1->UTF-8 conversion (utf8_encode() helps here).

You could do additional checking if the input is something else but 
Latin1 in case UTF-8 test fails, but I think it's not worth the effort.

> Also, I'm continuing to read more on all of this (and cripes, there's a 
> lot to read...), but just so I don't lose momentum here, I want to ask 
> what you think of this half-baked idea:
> 
> A form on a document with iso-8859-1 encoding will apparently (according 
> to a few quick tests) encode its user input into Latin-1 also.  If I put 
> something else in there, say that Japanese string I gave you, it gets 
> encoded into 
> "&#22823;&#38442;&#24066;&#28010;&#36895;&#21306;&#12398;&#12510;
> &#12531;&#12471;&#12519;&#12531;"

You cannot trust that behavior. Specification only says (IIRC) that 
the user agent MUST not send characters outside iso-8859-1 on such a 
form. MSIE is known to automagically convert from its internal 
character mapping to this SGML entity presentation but there's one 
major problem with it -- it doesn't differentiate in any way if user 
inputted such data verbatim or if the result was due automatic 
conversion by user agent. The only idea is that numeric entities 
always use only US-ASCII and that should be always safe.

The problem that you cannot differentiate between user input and the 
automatic conversion is just one reason why this conversion is not a 
good idea. Another one is that the user agent has no idea if the 
inputted content is going to be printed through HTML. If it's going 
to database and the get's printed to a ticket, for example, the user 
will be really surprised when he sees such code inside his surname, 
for example.

User agents that behave according to the spec are expected to send 
literal "?" for every character that cannot be send with the current 
encoding or prompt the user to decide what to do. Or to disallow 
input of any character at all that cannot be transferred.

The only reasonable safe way to get the input to the server the way 
the user intended to is to use UTF-8 encoded form. If somebody knows 
still better way, I'd be interested to hear about it, too.

As a side note, I'd like to add that in some sources it has been 
suggested that one should embed a hidden field that contains a known 
payload and the server then examines how that payload has been 
encoded. For example:
<input type="hidden" name="test" value="x&#12531;x&#xE4;x&#28010;x" />

Note that the user agent is *supposed* to convert the above 
numerical references to real characters and then submit those 
characters with the encoding it *really* uses. Then the server 
proceeds to check if value of test is "xンxäx浪x". If not, proceed 
to test if the value matches with with any other (incorrect) 
encoding of that string you know how to fix.

In the real world, user agents that don't support UTF-8 don't 
usually know how to even incorrectly represent characters outside 
iso-8859-1 either.

I guess that what I'm trying to tell you is that to *force* 
iso-8859-1 input only, you're going to have to use UTF-8 for the 
form and you'ge going to have to use UTF-8 internally. That's the 
only way you can really get in iso-8859-1 encoding the same data the 
user really tried to input.

-- 
Mikko

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input