{"id":45,"date":"2004-12-19T21:40:21","date_gmt":"2004-12-19T21:40:21","guid":{"rendered":"http:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-1"},"modified":"2006-04-30T23:35:52","modified_gmt":"2006-04-30T23:35:52","slug":"guide-to-unicode-part-1","status":"publish","type":"post","link":"https:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-1","title":{"rendered":"Guide to Unicode, Part 1"},"content":{"rendered":"<p><a href=\"http:\/\/www.unicode.org\/\">Unicode<\/a>, as some of you may know, is\r\n\ta universal character set comprising most of the world\u2019s characters. Since\r\n\tversion 1.1, the Unicode standard has remained fully compatible with <abbr title=\"International Organisation for Standardisation\">ISO<\/abbr>\/<abbr title=\"International Electrotechnical Commission\">IEC<\/abbr> 10646: <cite>Universal\r\n\tMultiple-Octet Coded Character Set<\/cite>. The <abbr title=\"International Organisation for Standardisation\">ISO<\/abbr>\/<abbr title=\"International Electrotechnical Commission\">IEC<\/abbr> 10646\r\n\tstandard defines a <a href=\"http:\/\/www.cs.tut.fi\/~jkorpela\/chars.html#repertoire\"><dfn>character\r\n\trepertoire<\/dfn><\/a> and <a href=\"http:\/\/www.cs.tut.fi\/~jkorpela\/chars.html#code\"><dfn>character\r\n\tcode points<\/dfn><\/a> (or code positions), as well as two character encodings,\r\n\t<abbr title=\"Universal Character Set\">UCS<\/abbr>-2 and <abbr title=\"Universal Character Set\">UCS<\/abbr>-4,\r\n\tallowing for up to 2<sup>32<\/sup> code points.  Though there\r\n\tare restrictions imposed by the Unicode standard, and the total number of\r\n\tcode points is only 1,114,112. However, the details of why this is the case\r\n\twill not be covered in this guide.<\/p>\r\n<p> The Unicode standard further defines character encodings (<code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-8<\/code>, <code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-16<\/code>\r\n\tand <code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-32<\/code>), and is a restricted subset of the <abbr title=\"International Organisation for Standardisation\">ISO<\/abbr>\/<abbr title=\"International Electrotechnical Commission\">IEC<\/abbr> 10646 standard, As\r\n\ta result, any conformant implementation of Unicode, is also conformant with\r\n\t<abbr title=\"International Organisation for Standardisation\">ISO<\/abbr>\/<abbr title=\"International Electrotechnical Commission\">IEC<\/abbr> 10646. However, due to the additional restrictions imposed by Unicode,\r\n\tthe same is not necessarily true the other way around. Despite these differences,\r\n\tthe most important point, at least for the purposes of this guide, is that\r\n\tthe character sets defined by both standards are, code for code, identical\r\n\tin every way. <\/p>\r\n<p> One thing that neither the Unicode standard, nor <abbr title=\"International Organisation for Standardisation\">ISO<\/abbr>\/<abbr title=\"International Electrotechnical Commission\">IEC<\/abbr> 10646 defines is the\r\n\tglyphs (the visual representation) for each character. Although\r\n\tthe Unicode specification does provide example glyphs for every character,\r\n\tit is expected that the glyphs from different fonts may look very different<\/p>\r\n<p> Before I get into the details about the practical use of Unicode, there is\r\n\tan important distinction that must be made, in relation to <abbr title=\"HyperText Markup Language\">HTML<\/abbr> and character\r\n\tsets. The <abbr title=\"HyperText Markup Language\">HTML<\/abbr> 4.01 specification states, in <a href=\"http:\/\/www.w3.org\/TR\/html401\/charset.html#h-5.1\">section\r\n\t5.1: The Document Character Set<\/a>:<\/p>\r\n\r\n<blockquote>\r\n\t<p>To promote interoperability, SGML requires that each application (including\r\n\t\t<abbr title=\"HyperText Markup Language\">HTML<\/abbr>) specify its document character set. A document character set consists\r\n\t\tof:<\/p>\r\n\t<ul>\r\n        <li>A Repertoire: A set of abstract characters, such as the Latin letter &#8220;A&#8221;,\r\n        \tthe Cyrillic letter &#8220;I&#8221;, the Chinese character meaning &#8220;water&#8221;, etc.<\/li>\r\n        <li>Code positions: A set of integer references to characters in the\r\n        \trepertoire.<\/li>\r\n    <\/ul>\r\n<\/blockquote>\r\n\r\n\r\n<p>The document character set is different from the character encoding of the\r\n\tfile, and, in HTML, is defined to be ISO 10646, which (for the purposes of HTML)\r\n\tis equivalent to Unicode. The document character set is used for decoding <a href=\"The%20document%20character%20set%20is%20different%20from%20the%20character%20encoding%20of%20the%20file,%20and,%20in%20HTML,%20is%20defined%20to%20be%20ISO%2010646,%20which%20(for%20the%20purposes%20of%20HTML)%20is%20equivalent%20to%20Unicode.%20%20The%20document%20character%20set%20is%20used%20for%20decoding%20numeric%20character%20references%20and%20the%20code%20point%20given%20refers%20to%20the%20Unicode%20code%20point%20for%20the%20character,%20not%20the%20code%20point%20within%20the%20documents%20character%20encoding%20(unless%20the%20character%20encoding%20also%20happens%20to%20be%20a%20Unicode%20variant).\">numeric\r\n\tcharacter references<\/a> and the code point given refers to the Unicode code point\r\n\tfor the character, not the code point within the documents character encoding\r\n(unless the character encoding also happens to be a Unicode variant).<\/p>\r\n<p>This is a common mistake made by many, and is most often seen with character\r\n\treferences made to Windows-1252 code points, in the range from 128 to 159.\r\n\t(eg. <code>&amp;#147;<\/code>\r\n\tfor a left double quotation mark) In Unicode, these code points are reserved\r\n\tas control characters, and are invalid. On the other hand, the <a href=\"http:\/\/www.w3.org\/TR\/html401\/charset.html#h-5.2\">character\r\n\tencoding<\/a>\trefers to the actual encoding of the characters in the file.\r\n\tThis is most often <code>ISO-8859-1<\/code>, or (sadly) <code>Windows-1252<\/code> (though, often, and\r\n\tusually incorrectly declared as <code>ISO-8859-1<\/code> anyway).<\/p>\r\n<p>Regardless of the character encoding of the file, due to the document character\r\n\tset being Unicode, it is always possible to include any character you wish in\r\n\tyour document using numeric character references, as long as it exists in the\r\n\tUnicode character repertoire. To do so, it is only necessary to know the code\r\n\tpoint of the character, and to use either the decimal or hexadecimal numeric\r\n\tcharacter reference. To actually view the character, the user must have a font\r\n\tavailable to the user agent from which it can use the appropriate glyph.<\/p>\r\n<p>For example, to use a character such as em-dash (<samp>\u2014<\/samp>) or left (<samp>\u201c<\/samp>)\r\n\tand right (<samp>\u201d<\/samp>) double quotation marks, They may be encoded as hexadecimal\r\n\tusing <code>&amp;#x2014;<\/code>, <code>&amp;#x201C;<\/code> and <code>&amp;#x201D;<\/code> or decimal using <code>&amp;#8212;<\/code>, <code>&amp;#8220;<\/code>\r\n\tand <code>&amp;#8221;<\/code> respectively.<\/p>\r\n<p>It is also possible to use the named character entity references defined in\r\n\tthe <abbr title=\"HyperText Markup Language\">HTML<\/abbr> <abbr title=\"Document Type Definition\">DTD<\/abbr>, which are also mapped to their respective characters in the Unicode\r\n\tcharacter repertoire, but for the purposes of this guide, they will be ignored.<\/p>\r\n<p>So, one question you\u2019re probably asking (assuming you\u2019re one of the many that\r\n\tdon\u2019t already know the answer to this), is how do I find the character I want,\r\n\tand what the code point is? Well, that\u2019s easy since all the characters in the\r\n\tUnicode character repertoire are listed in the <a href=\"http:\/\/www.unicode.org\/charts\/\">Unicode\r\n\tCode Charts<\/a>, grouped\r\n\tinto 124 categories, and ordered by the code point value. The only problem is\r\n\tthat they\u2019re PDF files, which may take a while to load, but never fear, there\r\n\tare easier ways which will be discussed later. However, first things first\u2026<\/p>\r\n<p>Some of the category names may not always make it obvious to you, as to which\r\n\tcharacters the group contains, but knowing what character you\u2019re looking for,\r\n\tit\u2019s usually possible to narrow down the field to around 2 or 3 possibilities.\r\n\tTake, for example, looking for the Greek letter\/Mathematical symbol for Pi (?),\r\n\tused to represent the number 3.1415926535897932384626433832795\u2026 Take a look\r\n\tat the names of the <a href=\"http:\/\/www.unicode.org\/charts\/\">code charts<\/a>, and narrow it down to a few possibilities.<\/p>\r\n<p>For those of you that didn\u2019t bother to look for yourself, or to verify your\r\n\tguesses for those of you that did, I think it can be reasonably assumed that\r\n\tthe character we\u2019re looking for will exist in either <a href=\"http:\/\/www.unicode.org\/charts\/PDF\/U0370.pdf\">Greek\r\n\tand Coptic<\/a>, <a href=\"http:\/\/www.unicode.org\/charts\/PDF\/U1F00.pdf\">Greek\r\n\tExtended<\/a>, <a href=\"http:\/\/www.unicode.org\/charts\/PDF\/U27C0.pdf\">Miscellaneous\r\n\tMathematical Symbols-A<\/a> or <a href=\"http:\/\/www.unicode.org\/charts\/PDF\/U2980.pdf\">Miscellaneous\r\n\tMathematical Symbols-B<\/a>. Before reading the next paragraph, take a look through each of\r\n\tthem to see if you can find the character. Skim through both the table showing\r\n\tall the glyphs for the characters, and the list of names and descriptions following\r\n\tthe table.<\/p>\r\n<p>If you followed instructions, then you may have found the characters for Pi\r\n\tin the Greek and Coptic category, but which one are we interested in? There\r\n\tis both a capital letter Pi (<code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier?characters=%CE%A0\">U+03A0<\/a><\/code> \u2013 ?), and a small, lowercase letter\r\n\tPi (<code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier?characters=%CF%80;\">U+03C0<\/a><\/code> \u2013 ?). If you read the descriptions, you should have noticed\r\n\tthe bullet points following the character name. The description for the lowercase\r\n\tletter mentions the math constant we are interested in, and therefore, that\r\n\tis the character we are after.<\/p>\r\n<p>Having found the character, all that is left is to write the character reference\r\n\tin the <abbr title=\"HyperText Markup Language\">HTML<\/abbr> file using either hexadecimal (<code>&amp;#x03C0;<\/code>) or decimal (<code>&amp;#960;<\/code>)\r\n\tformat. If you create a small <abbr title=\"HyperText Markup Language\">HTML<\/abbr> file, containing that character reference,\r\n\tthen you should (assuming your computer has a font with the glyph available)\r\n\tsee the character displayed like this: <samp>&#x03C0;<\/samp>.  If not, you will see a question\r\n\tmark, box or other place holder that your user agent uses. Try this with any\r\n\tcharacter you like, get a feel for finding characters and writing the character\r\n\treferences for them.<\/p>\r\n<p>As you\u2019ve probably already figured out, searching through the PDF files all\r\n\tthe time is very time consuming, and the inquisitive minds that some of you\r\n\twill have noticed the <a href=\"http:\/\/www.unicode.org\/charts\/charindex.html\">character\r\n\tnames index<\/a> provided, which I\u2019ll leave for\r\n\tyou to explore in your own time \u2013 it\u2019s too boring for me to walk you through\r\n\tit.<\/p>\r\n<p>A much faster way, as I\u2019m sure anyone can guess, is to use a search engine.\r\n\tWell, thanks to <a href=\"http:\/\/www.hixie.ch\/\" title=\"Ian Hickson\">Hixie<\/a>,\r\n\tyou can do just that with his <a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier\">Character\r\n\tFinder<\/a>. Another\r\n\tuseful feature is that it also calculates the decimal, octal and binary representations\r\n\tof the code point for you, though it\u2019s not hard to do for yourself with a\r\n\tcalculator anyway.<\/p>\r\n<p> The Windows Character Map tool also provides some simple\r\n\tsearch facilities, and is also good for finding some characters quickly,\r\n\tbut it\u2019s not perfect either, and only searches within the currently selected\r\n\tfont. If you&rsquo;re using Windows, I&rsquo;ll leave the character map for\r\n\tyou to explore in your own time, for now (though, it will be revisited later\r\n\tin a future part of this guide).<\/p>\r\n<p>So, now that you  have a brief understanding of Unicode, the character\r\n\trepertoire and code points; and also know how to use those characters with\r\n\tcharacter references, The next thing to learn is about <a href=\"http:\/\/www.cs.tut.fi\/~jkorpela\/chars.html#encoding\"><dfn>character\r\n\tencodings<\/dfn><\/a>, and in particular\r\n\tusing <code>UTF-8<\/code>, <code>UTF-16<\/code> or <code>UTF-32<\/code>, and inserting\r\n\tthe characters directly into your file without having to use a character reference.\r\n\tAll that and more will be explained in the <a href=\"http:\/\/lachy.id.au\/blogs\/log\/2004\/12\/guide-to-unicode-part-2\">Guide to Unicode, Part 2<\/a>.<\/p>\r\n","protected":false},"excerpt":{"rendered":"Unicode, as some of you may know, is a universal character set comprising most of the world\u2019s characters. Since version 1.1, the Unicode standard has remained fully compatible with ISO\/IEC 10646: Universal Multiple-Octet Coded Character Set. The ISO\/IEC 10646 standard defines a character repertoire and character code points (or code positions), as well as two &hellip; <a href=\"https:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-1\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Guide to Unicode, Part 1<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[14,2,7],"tags":[],"_links":{"self":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/45"}],"collection":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/comments?post=45"}],"version-history":[{"count":0,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/45\/revisions"}],"wp:attachment":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/media?parent=45"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/categories?post=45"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/tags?post=45"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}