Guide to Unicode, Part 1

Unicode, as some of you may know, is a universal character set comprising most of the world’s characters. Since version 1.1, the Unicode standard has remained fully compatible with ISO/IEC 10646: Universal Multiple-Octet Coded Character Set. The ISO/IEC 10646 standard defines a character repertoire and character code points (or code positions), as well as two character encodings, UCS-2 and UCS-4, allowing for up to 2³² code points. Though there are restrictions imposed by the Unicode standard, and the total number of code points is only 1,114,112. However, the details of why this is the case will not be covered in this guide.

The Unicode standard further defines character encodings (UTF-8, UTF-16 and UTF-32), and is a restricted subset of the ISO/IEC 10646 standard, As a result, any conformant implementation of Unicode, is also conformant with ISO/IEC 10646. However, due to the additional restrictions imposed by Unicode, the same is not necessarily true the other way around. Despite these differences, the most important point, at least for the purposes of this guide, is that the character sets defined by both standards are, code for code, identical in every way.

One thing that neither the Unicode standard, nor ISO/IEC 10646 defines is the glyphs (the visual representation) for each character. Although the Unicode specification does provide example glyphs for every character, it is expected that the glyphs from different fonts may look very different

Before I get into the details about the practical use of Unicode, there is an important distinction that must be made, in relation to HTML and character sets. The HTML 4.01 specification states, in section 5.1: The Document Character Set:

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

A Repertoire: A set of abstract characters, such as the Latin letter “A”, the Cyrillic letter “I”, the Chinese character meaning “water”, etc.

Code positions: A set of integer references to characters in the repertoire.

The document character set is different from the character encoding of the file, and, in HTML, is defined to be ISO 10646, which (for the purposes of HTML) is equivalent to Unicode. The document character set is used for decoding numeric character references and the code point given refers to the Unicode code point for the character, not the code point within the documents character encoding (unless the character encoding also happens to be a Unicode variant).

This is a common mistake made by many, and is most often seen with character references made to Windows-1252 code points, in the range from 128 to 159. (eg.  for a left double quotation mark) In Unicode, these code points are reserved as control characters, and are invalid. On the other hand, the character encoding refers to the actual encoding of the characters in the file. This is most often ISO-8859-1, or (sadly) Windows-1252 (though, often, and usually incorrectly declared as ISO-8859-1 anyway).

Regardless of the character encoding of the file, due to the document character set being Unicode, it is always possible to include any character you wish in your document using numeric character references, as long as it exists in the Unicode character repertoire. To do so, it is only necessary to know the code point of the character, and to use either the decimal or hexadecimal numeric character reference. To actually view the character, the user must have a font available to the user agent from which it can use the appropriate glyph.

For example, to use a character such as em-dash (—) or left (“) and right (”) double quotation marks, They may be encoded as hexadecimal using —, “ and ” or decimal using —, “ and ” respectively.

It is also possible to use the named character entity references defined in the HTML DTD, which are also mapped to their respective characters in the Unicode character repertoire, but for the purposes of this guide, they will be ignored.

So, one question you’re probably asking (assuming you’re one of the many that don’t already know the answer to this), is how do I find the character I want, and what the code point is? Well, that’s easy since all the characters in the Unicode character repertoire are listed in the Unicode Code Charts, grouped into 124 categories, and ordered by the code point value. The only problem is that they’re PDF files, which may take a while to load, but never fear, there are easier ways which will be discussed later. However, first things first…

Some of the category names may not always make it obvious to you, as to which characters the group contains, but knowing what character you’re looking for, it’s usually possible to narrow down the field to around 2 or 3 possibilities. Take, for example, looking for the Greek letter/Mathematical symbol for Pi (?), used to represent the number 3.1415926535897932384626433832795… Take a look at the names of the code charts, and narrow it down to a few possibilities.

For those of you that didn’t bother to look for yourself, or to verify your guesses for those of you that did, I think it can be reasonably assumed that the character we’re looking for will exist in either Greek and Coptic, Greek Extended, Miscellaneous Mathematical Symbols-A or Miscellaneous Mathematical Symbols-B. Before reading the next paragraph, take a look through each of them to see if you can find the character. Skim through both the table showing all the glyphs for the characters, and the list of names and descriptions following the table.

If you followed instructions, then you may have found the characters for Pi in the Greek and Coptic category, but which one are we interested in? There is both a capital letter Pi (U+03A0 – ?), and a small, lowercase letter Pi (U+03C0 – ?). If you read the descriptions, you should have noticed the bullet points following the character name. The description for the lowercase letter mentions the math constant we are interested in, and therefore, that is the character we are after.

Having found the character, all that is left is to write the character reference in the HTML file using either hexadecimal (π) or decimal (π) format. If you create a small HTML file, containing that character reference, then you should (assuming your computer has a font with the glyph available) see the character displayed like this: π. If not, you will see a question mark, box or other place holder that your user agent uses. Try this with any character you like, get a feel for finding characters and writing the character references for them.

As you’ve probably already figured out, searching through the PDF files all the time is very time consuming, and the inquisitive minds that some of you will have noticed the character names index provided, which I’ll leave for you to explore in your own time – it’s too boring for me to walk you through it.

A much faster way, as I’m sure anyone can guess, is to use a search engine. Well, thanks to Hixie, you can do just that with his Character Finder. Another useful feature is that it also calculates the decimal, octal and binary representations of the code point for you, though it’s not hard to do for yourself with a calculator anyway.

The Windows Character Map tool also provides some simple search facilities, and is also good for finding some characters quickly, but it’s not perfect either, and only searches within the currently selected font. If you’re using Windows, I’ll leave the character map for you to explore in your own time, for now (though, it will be revisited later in a future part of this guide).

So, now that you have a brief understanding of Unicode, the character repertoire and code points; and also know how to use those characters with character references, The next thing to learn is about character encodings, and in particular using UTF-8, UTF-16 or UTF-32, and inserting the characters directly into your file without having to use a character reference. All that and more will be explained in the Guide to Unicode, Part 2.

2 thoughts on “Guide to Unicode, Part 1”

matthom says:

2004-12-22 at 06:38

Excellent. I am very interested in learning more about unicode, and character encodings. Looking forward to more articles pertaining to this.
ssp says:

2005-01-04 at 06:39

For Mac users interested in exploring Unicode, I can recommend (our own and free) utility UnicodeChecker. Besides giving you the different encodings, it also has a ‘find’ feature which lets you search for characters by their names just as it is described in the text.

Comments are closed.

Lachy’s Log

If I start now, I'll be finished later!

Guide to Unicode, Part 1

2 thoughts on “Guide to Unicode, Part 1”