{"id":44,"date":"2004-12-24T14:15:27","date_gmt":"2004-12-24T14:15:27","guid":{"rendered":"http:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-2"},"modified":"2006-04-30T23:35:47","modified_gmt":"2006-04-30T23:35:47","slug":"guide-to-unicode-part-2","status":"publish","type":"post","link":"https:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-2","title":{"rendered":"Guide to Unicode, Part 2"},"content":{"rendered":"<p>When you write a document in one of the Unicode character encodings\r\n\t(<code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-8<\/code>,\r\n\t<code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-16<\/code> or\r\n\t<code><abbr title=\"UCS Transformation Format\">UTF<\/abbr>-32<\/code>),\r\n\tyou can use any character from any language that exists in the Unicode character\r\n\trepertoire all in the same file with no need to use <a href=\"http:\/\/www.w3.org\/TR\/html401\/charset.html#h-5.3\"\r\n\ttitle=\"HTML 4.01 Recommendation: HTML Document Representation - Character References\">HTML\r\n\tcharacter references<\/a>\tor other special escape sequences. This chapter assumes\r\n\tyou have read the <a href=\"http:\/\/lachy.id.au\/blogs\/log\/2004\/12\/guide-to-unicode-part-1\">Guide\r\n\tto Unicode, Part 1<\/a>; or you are at least familiar with\r\n\tthe concepts of character repertoires, code points, looking up Unicode characters\r\n\tand writing numeric character references for them in HTML. If not, take a look\r\n\tat part 1 and come back when you\u2019re ready.<\/p>\r\n<p>In part 1, I mentioned character encodings; but I didn\u2019t really discuss\r\n\twhat they are and how they relate to the character repertoire and the code\r\n\tpoints. A <a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/chars.html#encoding\"><dfn>character\r\n\tencoding<\/dfn><\/a> is basically a method of representing code points\r\nas a sequence of octets (or bytes).<\/p>\r\n<p>In the simplest case of encoding, each octet maps to an integer from 0 to\r\n\t255 which translates to a code point in the character repertoire for that encoding,\r\n\tas is the case for single-octet encodings like US-ASCII or the ISO-8859 series.\r\n\tHowever, for more complex character repertoires, such as Unicode, it is impossible\r\n\tto represent all the characters with only the 256 values available in a single\r\n\toctet and, therefore, requires a multiple-octet encoding.<\/p>\r\n<p>Some multi-octet encodings assign a fixed number of octets to every character,\r\n\twhile others use more complex algorithms to assign a variable number. For example,\r\n\t<code>UTF-32<\/code> assigns 4 octets (32 bits) to every character, while <code>UTF-8<\/code> assigns anywhere\r\n\tfrom 1 to 6. The advantages and disadvantages of these different encoding methods\r\n\tare discussed in <a href=\"http:\/\/www.unicode.org\/versions\/Unicode4.0.0\/ch02.pdf\">section\r\n\t2.5, encoding forms<\/a> of the Unicode specification.<\/p>\r\n<p>The names of the many character encodings are registered with <a href=\"http:\/\/www.iana.org\/\">IANA<\/a>. Some of\r\n\tthe common character-encoding names include <code>ISO-8859-1<\/code>, <code>Windows-1252<\/code>, <code>Shift_JIS<\/code>\r\n\tand <code>Big5<\/code>. Many of the encodings also have various aliases and other information\r\n\tabout them, which can be looked up in the <a href=\"http:\/\/www.iana.org\/assignments\/character-sets\">IANA character set assignment list<\/a>.<\/p>\r\n<p>When the Unicode character repertoire was designed, the characters from many\r\n\tof the major character sets were incorporated and mapped to the Unicode code\r\n\tpoints. The <a href=\"http:\/\/www.unicode.org\/Public\/MAPPINGS\/\" title=\"Unicode Public Mappings\">mappings<\/a> for some are available and each character is mapped to\r\n\tand from the Unicode code points. This is important, as you will see later,\r\n\tbecause it means that other character encodings can be converted to and from\r\nUnicode encodings without any loss of information.<\/p>\r\n<p>To use any character encoding, it\u2019s not necessary to understand the algorithm\r\n\tused to encode and decode characters because that is the job of the editor \u2013\r\n\tbut when learning Unicode, it does help to have a basic understanding of the\r\n\tconcepts of multi-octet versus single-octet encodings, especially when debugging\r\n\tcharacter encoding problems, which will be discussed later in part 3.<\/p>\r\n<p>As mentioned previously, encoding a file using one the Unicode encodings makes\r\n\tit possible to use any character without the need for character references\r\n\tor other special escape sequences. Using the real characters instead of character\r\n\treferences makes the file easier to read and can also significantly reduce\r\n\tthe size of the file, especially in cases where a lot of character references\r\n\twere needed (since it generally takes more octets to encode the character\r\n\treference, than for the <code>UTF-8<\/code> encoded character). There are many other reasons\r\n\tfor choosing Unicode, which I will discuss in part 3. But for now, it\u2019s time\r\n\tto start using Unicode.<\/p>\r\n<p>The first thing you\u2019ll need is an editor that supports Unicode character encodings\r\n\t\u2013 in particular, <code>UTF-8<\/code>. If you\u2019re using Windows 2000 or XP, then Notepad will\r\n\tdo the trick for most of these exercises. If not, or if you would like a slightly\r\n\tfancier editor anyway, then I find editors like <a href=\"http:\/\/www.wolosoft.com\/en\/superedi\/\">SuperEdi<\/a> or <a href=\"http:\/\/www.macromedia.com\/software\/dreamweaver\/\">Macromedia\r\n\tDreamweaver<\/a>\tto be quite good. If you\u2019re using a Mac or Linux, I\u2019m sure there are many choices\r\n\tavailable, though I am unfamiliar with those platforms and the editors available\r\n\tfor them. Take a look through the settings and\/or documentation for your editor\r\n\tand ensure that your file is being saved as <code>UTF-8<\/code> (not <code>UTF-16<\/code> or <code>UTF-32<\/code> at this\r\n\tstage). For Notepad users, this setting is in the Save As\u2026 dialog. For other\u2019s,\r\n\tit may be there also, or in the Options\/Preferences\/Settings dialog. Note: If\r\n\tyour editor provides an option for whether or not to output the Byte Order Mark\r\n\t(<abbr title=\"Byte Order Mark\">BOM<\/abbr>), leave it enabled for now so that it does. The <abbr title=\"Byte Order Mark\">BOM<\/abbr> will be discussed later,\r\n\tand the problems it can cause will be discussed in part 3.<\/p>\r\n<p>The first issue you\u2019re probably asking about is how to enter characters that\r\n\tdon\u2019t appear on your keyboard into the editor. It\u2019s a common question, and one\r\n\tthat I struggled with while I was still learning about Unicode. However, those\r\n\tof you with intuitive minds, that have read part 1 of this guide, have probably\r\n\tjust figured out why I went to so much effort to teach you about looking up\r\n\tcode points and writing character references in HTML as a method of outputting\r\n\tthe characters. While the main reason was to teach you about code points, it\u2019s\r\n\talso because one way to enter the characters that will work for all editors\r\nand platforms is to copy and paste them from your browser (or other source).<\/p>\r\n<p>Try it now.  You may look up a few characters in Unicode that don\u2019t appear\r\n\ton your keyboard, create a small HTML file and generate them using character\r\n\treferences. Be sure to include random characters, including some from the <code>US-ASCII<\/code>\r\n\tsubset (from <code>U+0000<\/code> to <code>U+007F<\/code>) and others outside that range. Afterwards, open\r\n\tthe page in your browser and then copy and paste them into a new, plain text\r\n\t(not HTML) file in your editor. However, to save you some time and effort, here\r\n\tare some characters for you to copy and paste: <kbd>\u2018 \u2019 \u2014 ? \u00d7 { } \u00a9 ? ?<\/kbd>.\r\n\t(Include the spaces between the characters.)<\/p>\r\n<p>When you open the file in your browser, if the <abbr title=\"Byte Order Mark\">BOM<\/abbr> is present, the file will\r\n\tbe automatically detected as <code>UTF-8<\/code> in modern browsers and the characters will\r\n\tbe displayed correctly. Confirm that the browser is interpreting the file as\r\n\t<code>UTF-8<\/code> by looking at the character encoding options, which are commonly available\r\n\tfrom within the View menu. Configure your browser to interpret the file as <code>Windows-1252<\/code>\r\n\tor <code>ISO-8859-1<\/code> and you will notice that the string of characters you entered\r\n\twill become a mess of seemingly random characters. For example, using the characters\r\n\tI provided earlier, you should see: <samp>\u00ef\u00bb\u00bf\u00e2\u20ac\u02dc \u00e2\u20ac\u2122 \u00e2\u20ac\u201d \u00cf\u20ac \u00c3\u2014 { } \u00c2\u00a9 \u00e4\u00bd\u02c6 \u00d0\u00b1<\/samp><\/p>\r\n<p>This output represents the <code>UTF-8<\/code> encoded characters when interpreted as a\r\n\tsingle-octet encoding, thus each character in the output represents 1 octet\r\nin the file.<\/p>\r\n<p>Notice the first three characters: <samp>\u00ef\u00bb\u00bf<\/samp>.  These characters form\r\n\tthe <code>UTF-8<\/code> <abbr title=\"Byte Order Mark\">BOM<\/abbr>. If your attempt\r\n\tdid not show these characters, but the rest is the same, never mind \u2013 it just\r\n\tmeans that your editor omitted it. The <abbr title=\"Byte Order Mark\">BOM<\/abbr> is\r\n\tthe character <a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier?characters=&#239;&#187;&#191;\"><code>U+FEFF<\/code><\/a> \u2013\r\n\tthe <code>ZERO WIDTH NO-BREAK SPACE<\/code> (ZWNBSP).\r\n\t In <code>UTF-8<\/code>, the <abbr title=\"Byte Order Mark\">BOM<\/abbr> is optional\r\n\t (hence why some editors allow you to decide whether or not to output it). In <code>UTF-16<\/code>,\r\n\t however, it is required so that the user agent can accurately determine the\r\n\t order of octets for each character. This will be discussed in more detail later\r\n\t in part 3.<\/p>\r\n<p>Because each character was separated by a space, you should be able to easily\r\n\tnotice that the number of octets used for each character in the file varied\r\n\tfrom 1 to 3 in this example. The characters from the <code>US-ASCII<\/code> subset appeared\r\n\tas single octets, but characters outside of this range appeared as 2 or more.\r\n\tThis is part of the design of <code>UTF-8<\/code> to help ensure compatibility with older\r\n\teditors and text processing software. Thus it is possible to view and edit <code>UTF-8<\/code>\r\n\tfiles relatively easily with editors that don\u2019t support <code>UTF-8<\/code>, especially where\r\n\tthe file comprises mainly of characters from the <code>US-ASCII<\/code> subset. Though, for\r\n\tobvious reasons, it becomes much harder where the file comprises mainly characters\r\noutside that range.<\/p>\r\n<p>If you would like to know exactly which characters were chosen, <a href=\"http:\/\/www.hixie.ch\/\">Ian\r\n\t\tHickson<\/a>\thas provided two tools to help you out. The first is the <a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/character-identifier\">character\r\n\t\tidentifier<\/a>.\r\n\tYou will have noticed this form when you looked at the character finder in part\r\n\t1. Copy and paste the first set of characters that I provided into the form\r\n\tand submit. The results provide information such as the character names, code\r\n\tpoint and various other useful pieces of information. As you become more experienced\r\n\twith Unicode, and use it more often, I\u2019m quite sure you will find this tool\r\n\tquite invaluable; and I will leave it for you to explore and understand all\r\n\tthe useful information it provides, in your own time.<\/p>\r\n<p>The second is the <code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/utf8-decoder\">UTF-8<\/a><\/code><a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/utf8-decoder\"> Decoder<\/a>. This tool will decode encoded characters,\r\n\tsuch as the <code>Windows-1252<\/code> output I provided earlier. The results indicate which\r\n\tcharacters are represented. If you copy the sample <code>Windows-1252<\/code> output into\r\n\tthe <code>UTF-8<\/code> Decoder, and select \u201cUTF-8 interpreted as <code>Windows-1252<\/code>\u201d from the Input\r\n\tType list, then submit, the characters will be decoded for you and lots of useful\r\n\tinformation will be provided, much like the character identifier you looked\r\n\tat previously. To verify the characters were decoded correctly, compare the\r\n\tresults of the <code>UTF-8<\/code> decoder with those from the character identifier. The list\r\n\tof identified characters should both contain the same character names, except\r\n\tfor the addition of the <abbr title=\"Byte Order Mark\">BOM<\/abbr> in the <code>Windows-1252<\/code> encoded form.<\/p>\r\n<p>As I mentioned in part 1, creating an HTML file, looking up the character\r\n\tand then writing the character reference can become very time consuming, and\r\n\tthere are much faster and more convenient ways to generate the characters. Firstly,\r\n\tfor Windows users, the Character Map (usually available under Accessories or\r\n\tSystem Tools in the start menu) provides a somewhat useful interface for browsing\r\n\tcharacters and fonts. In Windows 2000 and XP versions, the character map provides\r\n\tboth the character name and the Unicode code point for every character available\r\n\tin the selected font. In all versions of windows, it also provides the <code>Windows-1252<\/code>\r\n\tcode point for those characters that exist in the <a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/www\/windows-chars.html\"\r\n\ttitle=\"Jukka Korpela: On the use of some MS Windows characters in HTML\"><code>Windows-1252<\/code> character\r\n\trepertoire<\/a>.<\/p>\r\n<p>The <code>Windows-1252<\/code> code point is used for the keystroke that takes the form: <kbd>Alt+0<var>###<\/var><\/kbd>.\r\n\t(where <var>###<\/var> represents the code point as a decimal number that needs\r\n\tto be entered on the numeric keypad of your keyboard). While it is obviously\r\n\tpossible to copy and paste characters from the character map, it is also possible,\r\n\tfor the characters in the <code>Windows-1252<\/code> character repertoire, to enter them using\r\n\tthe given keystroke without the need to even open the character map. This useful\r\n\tfeature will save you a lot of time for entering commonly used characters that\r\n\tdon\u2019t appear on your keyboard, but do appear in the <code>Windows-1252<\/code> character repertoire.<\/p>\r\n<p>Even though the character is being entered using the <code>Windows-1252<\/code> code\r\n\tpoint, the characters are mapped to the Unicode code points using the mappings\r\n\tI mentioned previously. For example, the code point for the left single quotation\r\n\tmark in\r\n\t<code>Windows-1252<\/code> is 0\u00d792 (decimal 146), which maps to the Unicode code\r\n\tpoint U+2018. This and all other <code>Windows-1252<\/code> characters are listed\r\n\tin the <code><a href=\"http:\/\/www.unicode.org\/Public\/MAPPINGS\/VENDORS\/MICSFT\/WINDOWS\/CP1252.TXT\">Windows-1252<\/a><\/code><a href=\"http:\/\/www.unicode.org\/Public\/MAPPINGS\/VENDORS\/MICSFT\/WINDOWS\/CP1252.TXT\"> mapping<\/a> from\r\n\tthe Unicode website.<\/p>\r\n<p><a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/\">Jukka Korpela<\/a> also provides a useful JavaScript application called <a href=\"http:\/\/www.cs.tut.fi\/%7Ejkorpela\/gwrite\/\">gwrite<\/a>\t\u2013 a virtual Unicode keyboard, from which you can select and copy many characters.\r\n\tFinally, I have reproduced Ian Hickson\u2019s very useful <a href=\"http:\/\/software.hixie.ch\/utilities\/cgi\/unicode-decoder\/\">Unicode\r\n\ttools<\/a> in my copy\r\n\tof the <a href=\"http:\/\/lachy.id.au\/blogs\/log\/2004\/10\/devedge-sidebar\">DevEdge\r\n\tSidebar<\/a>. I also added a character generator, that will generate\r\n\tthe character by typing in the Unicode code point in either decimal, hexadecimal\r\n\tor octal.<\/p>\r\n<p>Next, in <a href=\"http:\/\/lachy.id.au\/blogs\/log\/2005\/01\/guide-to-unicode-part-3\">part 3<\/a>, we will look at some of the issues caused by the <abbr title=\"Byte Order Mark\">BOM<\/abbr> and\r\n\tother difficulties with Unicode, as well as debugging some of the common problems.\r\n\tWe will also take a closer look at how the octets are encoded in <code>UTF-8<\/code>, and\r\n\thow to determine the exact octets used, which is useful when using a binary\r\n\teditor. In addition, we will look at <code>UTF-16<\/code> and <code>UTF-32<\/code> and discuss their advantages\r\n\tand disadvantages in relation to the web.<\/p>\r\n","protected":false},"excerpt":{"rendered":"When you write a document in one of the Unicode character encodings (UTF-8, UTF-16 or UTF-32), you can use any character from any language that exists in the Unicode character repertoire all in the same file with no need to use HTML character references or other special escape sequences. This chapter assumes you have read &hellip; <a href=\"https:\/\/lachy.id.au\/log\/2004\/12\/guide-to-unicode-part-2\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Guide to Unicode, Part 2<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[14,2,7],"tags":[],"_links":{"self":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/44"}],"collection":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/comments?post=44"}],"version-history":[{"count":0,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/posts\/44\/revisions"}],"wp:attachment":[{"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/media?parent=44"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/categories?post=44"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lachy.id.au\/log\/wp-json\/wp\/v2\/tags?post=44"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}