{"id":94,"date":"2005-11-09T15:28:09","date_gmt":"2005-11-09T15:28:09","guid":{"rendered":"http:\/\/lachy.id.au\/log\/2005\/11\/handling-character-encodings"},"modified":"2006-04-30T23:28:33","modified_gmt":"2006-04-30T23:28:33","slug":"handling-character-encodings","status":"publish","type":"post","link":"https:\/\/lachy.id.au\/log\/2005\/11\/handling-character-encodings","title":{"rendered":"Handling Character Encodings"},"content":{"rendered":"<p>Anyone who\u2019s ever written a form for user input and actually cares about ensuring\r\n\tthe correct character encoding is submitted has had trouble with users submitting\r\n\tWindows-1252, where ISO-8859-1 was expected.  Even if you were intelligent and\r\n\twere using a Unicode encoding like UTF-8 and accepting such input from your\r\n\tforms, there\u2019s still a problem with Trackbacks, since you can\u2019t have no control\r\n\tover what encoding they\u2019re sent in.<\/p>\r\n<p>This is commonly ignored by implementations and results in invalid characters\r\n\tused within HTML and you end up a few question marks (commonly shown as a U+FFFD\r\n\tReplacement Character by browsers) scattered around the text.<\/p>\r\n<p>Now there is a solution.  I\u2019ve written some PHP to first detect the most likely\r\n\tencoding as either being UTF-8, ISO-8859-1 or Windows-1252.  If it is UTF-8,\r\n\tnothing needs to be done with it.  If it\u2019s ISO-8859-1 or Windows-1252, we need\r\n\tto convert it to UTF-8.<\/p>\r\n\r\n<h3 id=\"handlecharencode-determine\">Determining the Encoding<\/h3>\r\n<p>The first 3 functions I\u2019ve written will allow you to determine what character\r\n\tencoding is used.  These are isUTF8(), isISO88591() and isCP1252() and return\r\n\ttrue if the string validates as the respective encoding.  These work by using\r\n\tregular expression that matches valid octet sequences for the encoding. \r\n\tThe regular expression for UTF-8 was adapted from the Perl code provided\r\n\tby the W3C in an <a href=\"http:\/\/www.w3.org\/International\/questions\/qa-forms-utf-8\">article\r\n\tabout multilingual forms<\/a>.<\/p>\r\n<p>My version is a little more restrictive than that, in that it will reject\r\n\tany character with a code point from 128 to 159.  Although these code points\r\n\tare valid in XML and can be validly encoded in UTF-8, they are Unicode control\r\n\tcharacters and they are invalid within HTML 4.  Additionally, the chances of\r\n\ta user legitimately submitting those characters are slim to nil, so it\u2019s better\r\n\tto reject them than try to convert them to something else.<\/p>\r\n<p>The ISO-8859-1 function works in the same way.  It too rejects characters\r\n\twith those code points, as it is far more likely that the user has submitted\r\n\tWindows-1252 than the control characters.<\/p>\r\n\r\n<h3 id=\"handlecharencode-convert\">Converting to UTF-8<\/h3>\r\n<p>In PHP, the utf8_encode() function can be used to convert from ISO-8859-1\r\n\tto UTF-8.  However, the real world forces us to handle ISO-8859-1 as Windows-1252,\r\n\tyet the utf8_encode() function will not handle that as well as we would like.<\/p>\r\n<p>Since Windows-1252 is a superset of ISO-8859-1, these can both be handled\r\n\tby the same function: utf8FromCP1252().  Internally, this makes use of the pre-existing\r\n\tutf8_encode() function.  Afterwards, it searches the newly encoded UTF-8 string\r\n\tfor characters in the offending code points and remaps them to their correct\r\n\tUnicode code points and encodes them.<\/p>\r\n<p>To do this a second function is used which accepts the Windows-1252 encoded\r\n\tcharacter, determines the code point, uses a look up table in an array to find\r\n\tthe Unicode code point and then calls a third function to generated the UTF-8\r\n\tencoded character from that code point.<\/p>\r\n<p>The third function has been adapted from Anne Van Kesteren\u2019s <a href=\"http:\/\/annevankesteren.nl\/2005\/05\/character-references\">Character\r\n\t\treferences to UTF-8 converter<\/a>, who originally adapted it from Henri Sivonen\u2019s <a href=\"http:\/\/hsivonen.iki.fi\/php-utf8\/\">UTF-8\r\n\tto Code Point Array Converter<\/a>.  The main difference with my version is that\r\n\tI renamed it and changed the variable names used to something a little more\r\n\tsensible.<\/p>\r\n\r\n<h3 id=\"handlecharencode-code\">Code and Demo<\/h3>\r\n<p>You can see it all in action on <a href=\"http:\/\/lachy.id.au\/dev\/2005\/11\/encoding-test\">the\r\n\t\tdemonstration page<\/a>. Enter some characters\r\n\tin the UTF-8 for and the ISO-8859-1 forms and see how it flawlessly handles\r\n\tthe detection and conversion of your input into valid UTF-8 output. <a href=\"http:\/\/lachy.id.au\/dev\/2005\/11\/encoding-functions-source\">The\r\n\tsource code<\/a> is available also.<\/p>","protected":false},"excerpt":{"rendered":"PHP functions to automatically detect UTF-8, ISO-8859-1 and Windows-1252 input and convert to UTF-8 encoded output.","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[14,8],"tags":[],"_links":{"self":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/94"}],"collection":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/comments?post=94"}],"version-history":[{"count":0,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/94\/revisions"}],"wp:attachment":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/media?parent=94"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/categories?post=94"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/tags?post=94"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}