09 January, 2007

Strength of Character(s)

When did it come to be that you need a degree in cryptography and computer science in order to build a simple form on a webpage?

I'm running into some issues lately upgrading our web software so that it can be used on international sites, and hence with international languages and character sets. It sounds easy enough, really. Translate the sites and they should be good, right? Well, not quite. As it turns out, there are some major conflicts with the character sets that get thrown around on the web. By default, things use the "latin-1" or "ISO-8859-1" character encoding. It uses one byte per character, giving you a possibility of 256 different characters. This is all well and good, for languages based on latin characters. This isn't so good, however, when you get into other language with special characters... like, say, accented characters.

"But wait!" you might say, "Spanish uses accented characters!" Yes, that is correct. That's actually what got me started on this whole train of research today. A Spanish speaking user input an accented character on a form, which apparently we weren't set up to expect, it got put into an XML document as an "unknown character", the XML ended up failing validation because of this one little accent and things blew up.

So I start reading a bit. I know that there are a number of different character sets that can be used... and I know that "UTF-8" is the one that should be used for things of this nature... it uses multiple bytes per character, allowing for a much wider range of characters to be displayed. Perfect. Surely PHP will support this natively, right? PHP is matured enough that strings should just automagically be UTF-8 and it should just work. Right? Wrong. To quote PHP's strings page:

In PHP, a character is the same as a byte, that is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some Unicode support.

Lovely. Apparently this isn't so easy. Now I'm stuck reading an entire treatise on character sets in web forms. What the hell? What makes this all so ridiculously difficult? It's obviously something that every web developer is going to have to deal with. My not just make it WORK, out of the box, no effort? What is so bloody difficult about that? Well apparently this document I've found is the one that should help me figure all of it out, from what I've been reading. Then I'll need to read up on the UTF-8 support in PHP, as well as their multibyte string functions... at least I think I'll need those.

This is just stupid.

No comments: