{"id":57,"date":"2005-01-03T16:29:42","date_gmt":"2005-01-03T16:29:42","guid":{"rendered":"http:\/\/lachy.id.au\/log\/2005\/01\/guide-to-unicode-part-3"},"modified":"2006-04-30T23:49:59","modified_gmt":"2006-04-30T23:49:59","slug":"guide-to-unicode-part-3","status":"publish","type":"post","link":"https:\/\/lachy.id.au\/log\/2005\/01\/guide-to-unicode-part-3","title":{"rendered":"Guide to Unicode, Part 3"},"content":{"rendered":"<p>In the beginning we discussed character repertoires, code points and <abbr title=\"HyperText Markup Language\">HTML<\/abbr>\r\n\tcharacter references; together with their relationship to the Unicode standard.\r\n\tWe then looked at character encodings, examining the differences between\r\n\tsingle- and multiple-octet encodings, and how to create files in <code>UTF-8<\/code>. If you\r\n\tare unfamiliar with those concepts, I recommend that you read parts\r\n\t<a href=\"http:\/\/lachy.id.au\/blogs\/log\/2004\/12\/guide-to-unicode-part-1\" title=\"Guide to Unicode, Part 1\">1<\/a>\r\n\tand <a href=\"http:\/\/lachy.id.au\/blogs\/log\/2004\/12\/guide-to-unicode-part-2\" title=\"Guide to Unicode, Part 2\">2<\/a> \r\nof this guide first and then return to this section when you are ready.<\/p>\r\n<p>The saga continues in this third and final thrilling chapter, where we will\r\n\tlook at some of the problems encountered with the use of Unicode and, in particular,\r\n\t<code>UTF-8<\/code> and the <abbr title=\"Byte Order Mark\">BOM<\/abbr>. This will involve taking a look at the debugging of the most\r\n\tcommon problems and how the tools we\u2019ve looked at previously can help you with\r\n\tthis. Following this I will discuss the importance and practicalities of ensuring\r\n\tthat the character encoding is correctly declared. Finally I will discuss the\r\n\tpurpose of the <abbr title=\"Byte Order Mark\">BOM<\/abbr> and show the difference between the <code>UTF-16<\/code> and <code>UTF-32<\/code> variants:\r\n\t<em>Little Endian<\/em> and <em>Big Endian<\/em>.<\/p>\r\n<p>The biggest problem with using Unicode on the web is that not all editors\r\n\tsupport Unicode, although some authors don\u2019t realise this and declare the encoding\r\n\tas <code>UTF-8<\/code> anyway. This only becomes a problem when characters outside the <code>US-ASCII<\/code>\tsubset are used. However, since we have already covered the creation and editing\r\n\tof Unicode files, there is no need to discuss this problem further.<\/p>\r\n<p>The next major issue encountered is that user agents may not always display\r\n\tthe characters correctly. Ignoring the availability of Unicode fonts for now,\r\n\tpeople will often attempt to use <code>UTF-8<\/code>; yet find that some characters that they\r\n\thave used get turned into 2 or 3 seemingly random characters. After that they\r\n\ttend to give up and revert back to <code>ISO-8859-1<\/code> (or <code>Windows-1252<\/code>) and use character\r\n\treferences for the characters outside of these repertoires, or <code><a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/www\/windows-chars.html#subst\">US-ASCII<\/a><\/code><a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/www\/windows-chars.html#subst\"> substitutes<\/a>.<\/p>\r\n<p>For those of you that have read part 2; you should now recognise that the\r\n\tdisplayed characters are not random, but are in fact the multi-octet <code>UTF-8<\/code> encoded\r\n\tcharacters interpreted as a single-octet encoding. This error is usually caused\r\n\tby incorrectly declaring a single-octet encoding \u2014 most often <code>ISO-8859-1<\/code>. For\r\n\texample, the <abbr title=\"Byte Order Mark\">BOM<\/abbr> (if present) may be displayed as <samp>\u00ef\u00bb\u00bf<\/samp> when\r\n\ta <code>UTF-8<\/code> file is incorrectly declared as <code>ISO-8859-1<\/code> by the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> response headers\r\n\t(similar to the demonstration in part 2 where the character encoding was manually\r\n\toverridden).<\/p>\r\n<p>For any sequence of characters appearing incorrectly like this, you may use\r\n\t<a href=\"http:\/\/www.hixie.ch\/\">Ian Hickson<\/a>\u2019s <code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/utf8-decoder\">UTF-8<\/a><\/code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/utf8-decoder\"> decoder<\/a> to\r\n\tdetermine what the character is. Conversely, you may also use the <a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/utf8-encoder\"><code>UTF-8<\/code> encoder<\/a>\tto reveal both the <code>ISO-8859-1<\/code> representation\r\n\tas well as the hexadecimal octet values for any Unicode character.<\/p>\r\n<p>If you suspect that you are experiencing this kind of error, then the first\r\n\tthing to check is which encoding is being used by the user agent. Most popular\r\n\tUAs will allow you to view the character encoding, such as in Mozilla\u2019s Page\r\n\tInfo dialog or Opera\u2019s Info panel. The <a href=\"http:\/\/validator.w3.org\/detailed.html\" title=\"Extended Interface\">W3C\u2019s\r\n\tMarkUp Validator<\/a> in verbose mode\r\n\twill also let you know this information. If you find that the character encoding\r\n\tis incorrect, then it is then necessary to determine whence the user agent is\r\n\tacquiring this information and to correct the error. If, however, you find that\r\n\tthe character encoding is correct, then the problem is caused elsewhere and\r\n\twill be discussed later.<\/p>\r\n<p>Depending on the file format, the character encoding may be declared on a\r\n\tfile-by-file basis through, for example, the use of the\r\n\t<a href=\"http:\/\/www.w3.org\/TR\/html401\/struct\/global.html#edef-META\"><code>&lt;meta&gt;<\/code> element<\/a>\r\n\tin <abbr title=\"HyperText Markup Language\">HTML<\/abbr>,\tthe\r\n\t<a href=\"http:\/\/www.w3.org\/TR\/REC-xml\/#NT-XMLDecl\"><code>&lt;?xml?&gt;<\/code> declaration<\/a>\r\n\tin <abbr title=\"Extensible Markup Language\">XML<\/abbr> documents, the <a href=\"http:\/\/www.w3.org\/TR\/CSS21\/syndata.html#q23\"\r\n\ttitle=\"CSS 2.1 Recommendation: 4.4 CSS style sheet representation\"><code>@charset<\/code>\r\n\trule<\/a> in\r\n\tCSS or indicated by the presence of the <abbr title=\"Byte Order Mark\">BOM<\/abbr>.\r\n\tOn a server-wide basis it depends upon the server configuration and may be\r\n\tdeclared by the <code>charset<\/code> parameter of\r\n\tthe <code>Content-Type<\/code> field in the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> response\r\n\theaders. It may also be indicated by a referencing document using, for example,\r\n\tthe <a href=\"http:\/\/www.w3.org\/TR\/html401\/struct\/links.html#adef-charset\"><code>charset<\/code>\r\n\tattribute<\/a> in <abbr title=\"HyperText Markup Language\">HTML<\/abbr>\r\n\tor the encoding of the referencing document itself. Finally, in the absence\r\n\tof any of these indications, some specifications define the default encoding\r\n\tthat should be used \u2013 commonly <code>UTF-8<\/code>, <code>UTF-16<\/code>, <code>ISO-8859-1<\/code> or <code>US-ASCII<\/code>.<\/p>\r\n<p>The order of precedence for each applicable method is defined in the relevant\r\n\tspecification for the language being used (eg. <a href=\"http:\/\/www.w3.org\/TR\/html401\/charset.html#h-5.2.2\">Specifying\r\n\ta Character Encoding in <abbr title=\"HyperText Markup Language\">HTML<\/abbr> 4.01<\/a>),\r\n\toften beginning with the <code>Content-Type<\/code> header field having the\r\n\thighest precedence and ending with the charset attribute or encoding of the\r\n\treferencing document, having the lowest. Of course, it can be fun <a href=\"http:\/\/ln.hixie.ch\/?start=1037398795&amp;count=1\">when\r\n\tspecifications collide<\/a> on this issue, which illustrates why it is not only important to\r\n\tdeclare the character encoding appropriately, but to ensure that it is declared\r\n\tcorrectly.<\/p>\r\n<p>Most of those may be easily checked simply by opening the file and seeing\r\n\twhich are included. The only one you may have difficulty with is the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> headers\r\n\tsince these are not so readily viewable without the right tool. Thankfully,\r\n\tthere are several to choose from, ranging from <a href=\"http:\/\/www.mozilla.org\/\">Mozilla<\/a>, <a href=\"http:\/\/getfirefox.com\/\">Firefox<\/a>, or other browser\r\n\textensions like <a href=\"http:\/\/livehttpheaders.mozdev.org\/\">Live <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> Headers<\/a> to online tools like the <a href=\"http:\/\/cgi.w3.org\/cgi-bin\/headers\">W3C\u2019s <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> <code>HEAD<\/code> Service<\/a>.<\/p>\r\n<p>Once you have determined the method by which the encoding is being indicated,\r\n\tit is a simple matter of correcting the error (Correcting <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> Headers will\r\n\tbe discussed later). If more than one method is used (eg. the <abbr title=\"Extensible Markup Language\">XML<\/abbr> declaration,\r\n\tthe <code>meta<\/code> element and\/or the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> <code>Content-Type<\/code> header field), you should\r\n\talso ensure that all of them indicate the same encoding. For most document formats,\r\n\tthe recommended method to use is the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> headers. For <abbr title=\"Extensible Markup Language\">XML<\/abbr> documents (served\r\n\tas <abbr title=\"Extensible Markup Language\">XML<\/abbr>) however, since they are self describing file formats, it is recommended\r\n\tthat the server be configured to omit the <code>charset<\/code> parameter from the <code>Content-Type<\/code>\r\n\theader and that the <code>&lt;?xml?&gt;<\/code> declaration be used instead.<\/p>\r\n<p>One thing to note: even if the character encoding is being correctly declared,\r\n\tthe use of the <abbr title=\"Byte Order Mark\">BOM<\/abbr> may still cause problems. Although many applications do support\r\n\tthe use of the <abbr title=\"Byte Order Mark\">BOM<\/abbr> (namely those that support <code>UTF-8<\/code> properly) there are still\r\n\tmany that don\u2019t (in particular, older web browsers) and require that the file\r\n\tbe saved with the <abbr title=\"Byte Order Mark\">BOM<\/abbr> omitted. The problem is that not every editor that supports\r\n\t<code>UTF-8<\/code> has an option to control the output of the <abbr title=\"Byte Order Mark\">BOM<\/abbr>. When the file is read\r\n\tby an application that does not support <code>UTF-8<\/code>, or the <abbr title=\"Byte Order Mark\">BOM<\/abbr>, these 3 bytes may\r\n\tbe interpreted as single-octet characters. For this reason, the W3C Mark Up\r\n\tValidator will issue a warning about its use, and it is recommended that the\r\n\t<abbr title=\"Byte Order Mark\">BOM<\/abbr> be omitted from <abbr title=\"HyperText Markup Language\">HTML<\/abbr> documents. (Note: this does not apply to <abbr title=\"Extensible Markup Language\">XML<\/abbr> or <abbr title=\"Extensible HyperText Markup Language\">XHTML<\/abbr>\r\n\tdocuments served as <abbr title=\"Extensible Markup Language\">XML<\/abbr>, since <abbr title=\"Extensible Markup Language\">XML<\/abbr> user agents are required to support <code>UTF-8<\/code>\r\n\tand <code>UTF-16<\/code> fully, including the <abbr title=\"Byte Order Mark\">BOM<\/abbr>.)<\/p>\r\n<p>To modify the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> headers,\r\n\tit is necessary to edit a server configuration file. For the <a href=\"http:\/\/httpd.apache.org\/\">Apache\r\n\tHTTPd server<\/a>, most web hosting providers will allow web content authors to use\r\n\t<a href=\"http:\/\/httpd.apache.org\/docs-2.0\/howto\/htaccess.html\">.htaccess files<\/a> for this, and many other purposes. Other web servers may also\r\n\toffer similar abilities, though the method may be very different. Consult your\r\n\tserver\u2019s documentation or contact your web host for information about how\r\n\tto configure the character encoding correctly. However, for those of you\r\n\tusing an Apache server, it is simply a matter of including the appropriate\r\n\tdirectives in your .htaccess files or, if you have access to it (e.g. you\r\n\tare the server admimistrator), in the <a href=\"http:\/\/httpd.apache.org\/docs-2.0\/configuring.html\" title=\"Apache Configuration Files\">httpd.conf<\/a> file. (However, most web\r\n\thosting providers do not allow general access to that file.)<\/p>\r\n<p>The <code><a href=\"http:\/\/httpd.apache.org\/docs-2.0\/mod\/core.html#adddefaultcharset\">AddDefaultCharset<\/a><\/code> and\/or <code><a href=\"http:\/\/httpd.apache.org\/docs-2.0\/mod\/mod_mime.html#addcharset\">AddCharset<\/a><\/code> directives may be used. For example,\r\n\tyou may set the default charset to <code>UTF-8<\/code>, but still wish that some files use\r\n\t<code>ISO-8859-1<\/code>. These directives in either .htaccess or httpd.conf will accomplish\r\n\tthis:<\/p>\r\n\r\n<pre><code>AddDefaultCharset UTF-8\r\nAddCharset ISO-8859-1 .latin1<\/code><\/pre>\r\n\r\n<p>The <code>AddCharset<\/code> directive takes both a character encoding\r\n\tname and a file extension. For files encoded as <code>ISO-8859-1<\/code>, it\r\n\tsimply requires that the file contain a\r\n\t<code>.latin1<\/code> file extension. This feature is most useful when <a href=\"http:\/\/httpd.apache.org\/docs-2.0\/mod\/mod_negotiation.html\">Content\r\n\tNegotiation<\/a> is enabled, so that the <code>.latin1<\/code> file extension need not be included\r\n\tin the <abbr title=\"Uniform Resource Identifier\">URI<\/abbr>, but still takes effect to send the correct encoding.<\/p>\r\n<p>Besides the fact that some user agents do not support the <abbr title=\"Byte Order Mark\">BOM<\/abbr>,\r\n\tyou may still encounter problems with its use in files intended to be processed\r\n\tby some server-side technology, such as <abbr title=\"Recursive Acronym for PHP HyperText Processor\">PHP<\/abbr> or <abbr title=\"Java Server Pages\">JSP<\/abbr>.\r\n\tFor example, <a href=\"http:\/\/www.latext.com\/pm\/members\/profile_view_ind.php?id=1\">Pierre\r\n\tIgot<\/a> described the <a href=\"http:\/\/www.latext.com\/pm\/comments\/1278_0_1_0_C\/\" title=\"Unicode, WordPress, Panther Server and BBEdit: UTF-8 with or without BOM\">problem\r\n\the encountered<\/a> with the presence of the <abbr title=\"Byte Order Mark\">BOM<\/abbr> in\r\n\tWordPress <abbr title=\"Recursive Acronym for PHP HyperText Processor\">PHP<\/abbr> files.\r\n\tThis issue occurs because upon encountering non-whitespace character data,\r\n\tthe processor will assume that the content has begun, send out all <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> headers\r\n\tand begin to output the resultant file. If, for example, you then attempt\r\n\tto use the <a href=\"http:\/\/au2.php.net\/manual\/en\/function.header.php\"><abbr title=\"Recursive Acronym for PHP HyperText Processor\">PHP<\/abbr>\r\n\t<code>header()<\/code> function<\/a>, an error will be received indicating that\r\n\tit is too late to modify headers because the content has already begun.<\/p>\r\n<p>Other errors may not cause any server-side errors, and are thus harder to\r\n\tcatch, but they may result in invalid markup being transmitted by the server\r\n\tdue to <a href=\"http:\/\/www.w3.org\/TR\/REC-xml\/#dt-chardata\" title=\"Definition of Character Data in XML\"><dfn>character\r\n\tdata<\/dfn><\/a> appearing where character data is not allowed. For example,\r\n\tconsider an include file containing a set of <a href=\"http:\/\/www.w3.org\/TR\/html401\/struct\/links.html#edef-LINK\"><code>link<\/code> elements<\/a>\tto be included within the <a href=\"http:\/\/www.w3.org\/TR\/html401\/struct\/global.html#edef-HEAD\"><code>head<\/code> element<\/a>s of several <abbr title=\"(Extensible) HyperText Markup Language\">(X)HTML<\/abbr>\r\n\tfiles using some scripting language or <a href=\"http:\/\/httpd.apache.org\/docs-2.0\/howto\/ssi.html\"><abbr title=\"Server Side Includes\">SSI<\/abbr><\/a>:\r\n\tif the include file is encoded as <code>UTF-8<\/code> and\r\n\tbegins with the\r\n\t<abbr title=\"Byte Order Mark\">BOM<\/abbr>, yet the scripting language processor\r\n\tdoes not support <code>UTF-8<\/code> and therefore\r\n\tneither recognises the <abbr title=\"Byte Order Mark\">BOM<\/abbr> nor strips it\r\n\tfrom the output, it will end up being included within the <code>head<\/code> elements.\r\n\tSince a <code>head<\/code> element may\r\n\tnot contain character data, a markup validator will detect this and issue\r\n\tan appropriate error message.<\/p>\r\n<p>The exact error will differ depending on whether you are using <abbr title=\"HyperText Markup Language\">HTML<\/abbr> or <abbr title=\"Extensible HyperText Markup Language\">XHTML<\/abbr>,\r\n\tthough the cause will be the same. For <abbr title=\"HyperText Markup Language\">HTML<\/abbr>, the <abbr title=\"Byte Order Mark\">BOM<\/abbr> will\r\n\timplicitly end the <code>head<\/code> element,\r\n\tand begin the <a href=\"http:\/\/www.w3.org\/TR\/html401\/struct\/global.html#edef-BODY\"><code>body<\/code>\telement<\/a>. Thus, anything following will be treated\r\n\tas being within the <code>body<\/code> element, not the <code>head<\/code>, which may cause several errors\r\n\tdepending on the content. For <abbr title=\"Extensible HyperText Markup Language\">XHTML<\/abbr>, however, the <abbr title=\"Byte Order Mark\">BOM<\/abbr> will simply be treated\r\n\tas character data appearing within the head element, where no character data\r\n\tis allowed.<\/p>\r\n<p>The final problem that I have encountered is slightly more complicated and\r\n\tproved quite difficult to solve. While using <abbr title=\"Java Server Pages\">JSP<\/abbr> to\r\n\tdevelop a web site, I had encoded all files as <code>UTF-8<\/code>, omitting\r\n\tthe <abbr title=\"Byte Order Mark\">BOM<\/abbr>. The issue was that one include\r\n\tfile included the copyright symbol (<a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier?characters=&#169;\">U+00A9<\/a>),\r\n\twhich is encoded in <code>UTF-8<\/code> as the\r\n\toctets <code>C2 A9<\/code>. At first, it appeared as though the UA was interpreting the\r\n\tcharacter as a single octet encoding, thus displaying the characters <samp>\u00c2\u00a9<\/samp>.\r\n\tHowever, the <abbr title=\"HyperText Markup Language\">HTML<\/abbr> document was\r\n\tbeing correctly interpreted as <code>UTF-8<\/code>, since the\r\n\tcharacter encoding was being declared by the <abbr title=\"HyperText Transfer Protocol\">HTTP<\/abbr> <code>Content-Type<\/code> header\r\n\tfield.<\/p>\r\n<p>After much investigation, I found that if I encoded\r\n\tthe include file as <code>ISO-8859-1<\/code>, but the main file as <code>UTF-8<\/code>, the desired <code>UTF-8<\/code>\r\n\toutput was received. It turned out that the <abbr title=\"Java Server Pages\">JSP<\/abbr> processor, Apache Tomcat with\r\n\tJBoss, thought that the include file was to be interpreted as <code>ISO-8859-1<\/code> (the\r\n\tdefault for <abbr title=\"Java Server Pages\">JSP<\/abbr>); however the output was required to be <code>UTF-8<\/code>. Because of this,\r\n\tthe <abbr title=\"Java Server Pages\">JSP<\/abbr> processor was attempting to convert the character encoding of the include\r\n\tfile into <code>UTF-8<\/code> on the fly.<\/p>\r\n<p>Thus, when it encountered the octets <code>C2 A9<\/code>, it interpreted them as <code>ISO-8859-1<\/code>\r\n\tcharacters, which map to the Unicode characters: U+00C2 and U+00A9. These characters,\r\n\twhen encoded as <code>UTF-8<\/code>, form the octets <code>C3 82<\/code> and <code>C2\r\n\tA9<\/code>, respectively, which\r\n\tis the output I was receiving in the <abbr title=\"HyperText Markup Language\">HTML<\/abbr> document. I ended up solving this\r\n\tproblem by correctly informing the <abbr title=\"Java Server Pages\">JSP<\/abbr> processor that the include files were\r\n\talso encoded as <code>UTF-8<\/code>, and not the default <code>ISO-8859-1<\/code>.<\/p>\r\n<p>Up until now, we have looked at the <abbr title=\"Byte Order Mark\">BOM<\/abbr>, discussed how it is encoded in <code>UTF-8<\/code>\r\n\tand some of the problems it may cause; however we have not looked at its purpose.\r\n\tAs mentioned, the <abbr title=\"Byte Order Mark\">BOM<\/abbr> is optional in <code>UTF-8<\/code>; but it is required in <code>UTF-16<\/code> and\r\n\t<code>UTF-32<\/code> to <a href=\"http:\/\/www.w3.org\/TR\/REC-xml\/#sec-guessing-no-ext-info\" title=\"XML 1.0: F.1 Detection Without External Encoding Information\">indicate\r\n\tthe order of octets<\/a> for each character as either <a href=\"http:\/\/foldoc.doc.ic.ac.uk\/foldoc\/foldoc.cgi?little-endian\"><dfn>Little\r\n\tEndian<\/dfn><\/a>,\r\n\twhere the least significant byte appears first, and <a href=\"http:\/\/foldoc.doc.ic.ac.uk\/foldoc\/foldoc.cgi?big-endian\"><dfn>Big\r\n\tEndian<\/dfn><\/a> where the most\r\n\tsignificant byte appears first. For this guide, we will only look at <code>UTF-16<\/code>,\r\n\tbut a similar technique still applies to <code>UTF-32<\/code> documents.<\/p>\r\n<p> In <code>UTF-16LE<\/code> the <abbr title=\"Byte Order Mark\">BOM<\/abbr> (<a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier?characters=&#239;&#187;&#191;\">U+FEFF<\/a>) will appear in the file as the octet sequence\r\n\t<code>FF FE<\/code> (least significant byte first), but in <code>UTF-16BE<\/code> it appears as <code>FE FF<\/code> (most\r\n\tsignificant byte first). For <code>UTF-16<\/code>, because the reversed sequence U+FFFE is defined never\r\n\tto be a character in Unicode, that sequence of octets (<code>FF FE<\/code>) can be safely\r\n\tused to detect the encoding as <code>UTF-16LE<\/code>, and not a <code>UTF-16BE<\/code> file\r\n\tstarting with the character U+FFFE, or vice versa.<\/p>\r\n<p>Now that you have a rather more in-depth understanding of Unicode (including:\r\n\tthe character repertoire, code points and the various encoding forms; the\r\n\tability to create and edit Unicode encoded files; and an understanding\r\n\tof several of the problems that may be encountered) it is time to go forth\r\n\tand prosper \u2014 to make use of Unicode to its full potential, to simplify the\r\n\tuse of non-<code>US-ASCII<\/code> chracters and to help promote the <a href=\"http:\/\/www.w3.org\/International\/\"><abbr title=\"Internationalisation\">i18n<\/abbr>\r\n\tof the web<\/a>.  However,\r\n\tas with <em>everything else<\/em>, there\u2019s always more to be learned. Although\r\n\tit may seem that I have covered much, I\u2019m sure you\r\n\twill find that I have only just scratched the surface. So, to help you out,\r\n\tI\u2019ve compiled a short but comprehensive list of additional resources that\r\n\twill provide further information:<\/p>\r\n\t<ul>\r\n\t\t<li><a href=\"http:\/\/www.w3.org\/International\/questions\/#chars\"><abbr title=\"World Wide Web Consortium\">W3C<\/abbr> <abbr title=\"Internationalisation\">I18N<\/abbr> <abbr title=\"Frequently Asked Questions\">FAQ<\/abbr>:\r\n\t\t\tCharacters &amp; Encodings<\/a><\/li>\r\n\t\t<li><a href=\"http:\/\/www.w3.org\/International\/questions\/#chars\"><abbr title=\"World Wide Web Consortium\">W3C<\/abbr> <abbr title=\"Internationalisation\">I18N<\/abbr> Tutorial:\r\n\t\t\tCharacters &amp; Encodings<\/a><\/li>\t\r\n\t\t<li><a href=\"http:\/\/www.cs.tut.fi\/~jkorpela\/www.html#char\">Jukka Korpela: Character problems in Web authoring<\/a><\/li>\r\n\t<li><a href=\"http:\/\/www.cs.tut.fi\/~jkorpela\/chars\/\">Jukka Korpela: Characters and Encodings<\/a><\/li>\r\n\t<li><a href=\"http:\/\/ln.hixie.ch\/?start=1064324988&amp;count=1\">Ian Hickson: \r\nA crash course in UTF-8 mathematics<\/a><\/li>\r\n\t<li><a href=\"http:\/\/ln.hixie.ch\/?start=1066145333&amp;count=1\">Ian Hickson:\r\n\t\t\t The Absolute Minimum Every <q><cite>The Absolute Minimum Every\r\n\t\tSoftware Developer Absolutely, Positively Must Know About Unicode and Character\r\n\t\tSets (No Excuses!)<\/cite><\/q> Author Absolutely, Positively Must Know About Unicode\r\n\t\tand Character Sets (No Excuses!)<\/a><\/li>\r\n\t<\/ul>\r\n","protected":false},"excerpt":{"rendered":"In the beginning we discussed character repertoires, code points and HTML character references; together with their relationship to the Unicode standard. We then looked at character encodings, examining the differences between single- and multiple-octet encodings, and how to create files in UTF-8. If you are unfamiliar with those concepts, I recommend that you read parts &hellip; <a href=\"https:\/\/lachy.id.au\/log\/2005\/01\/guide-to-unicode-part-3\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Guide to Unicode, Part 3<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[14,7],"tags":[],"_links":{"self":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/57"}],"collection":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/comments?post=57"}],"version-history":[{"count":0,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/57\/revisions"}],"wp:attachment":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/media?parent=57"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/categories?post=57"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/tags?post=57"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}