Category Archives: Standards

Standards, protocols, recommendations and guidelines.

Guide to Unicode, Part 3

In the beginning we discussed character repertoires, code points and HTML character references; together with their relationship to the Unicode standard. We then looked at character encodings, examining the differences between single- and multiple-octet encodings, and how to create files in UTF-8. If you are unfamiliar with those concepts, I recommend that you read parts 1 and 2 of this guide first and then return to this section when you are ready.

The saga continues in this third and final thrilling chapter, where we will look at some of the problems encountered with the use of Unicode and, in particular, UTF-8 and the BOM. This will involve taking a look at the debugging of the most common problems and how the tools we’ve looked at previously can help you with this. Following this I will discuss the importance and practicalities of ensuring that the character encoding is correctly declared. Finally I will discuss the purpose of the BOM and show the difference between the UTF-16 and UTF-32 variants: Little Endian and Big Endian.

The biggest problem with using Unicode on the web is that not all editors support Unicode, although some authors don’t realise this and declare the encoding as UTF-8 anyway. This only becomes a problem when characters outside the US-ASCII subset are used. However, since we have already covered the creation and editing of Unicode files, there is no need to discuss this problem further.

The next major issue encountered is that user agents may not always display the characters correctly. Ignoring the availability of Unicode fonts for now, people will often attempt to use UTF-8; yet find that some characters that they have used get turned into 2 or 3 seemingly random characters. After that they tend to give up and revert back to ISO-8859-1 (or Windows-1252) and use character references for the characters outside of these repertoires, or US-ASCII substitutes.

For those of you that have read part 2; you should now recognise that the displayed characters are not random, but are in fact the multi-octet UTF-8 encoded characters interpreted as a single-octet encoding. This error is usually caused by incorrectly declaring a single-octet encoding — most often ISO-8859-1. For example, the BOM (if present) may be displayed as  when a UTF-8 file is incorrectly declared as ISO-8859-1 by the HTTP response headers (similar to the demonstration in part 2 where the character encoding was manually overridden).

For any sequence of characters appearing incorrectly like this, you may use Ian Hickson’s UTF-8 decoder to determine what the character is. Conversely, you may also use the UTF-8 encoder to reveal both the ISO-8859-1 representation as well as the hexadecimal octet values for any Unicode character.

If you suspect that you are experiencing this kind of error, then the first thing to check is which encoding is being used by the user agent. Most popular UAs will allow you to view the character encoding, such as in Mozilla’s Page Info dialog or Opera’s Info panel. The W3C’s MarkUp Validator in verbose mode will also let you know this information. If you find that the character encoding is incorrect, then it is then necessary to determine whence the user agent is acquiring this information and to correct the error. If, however, you find that the character encoding is correct, then the problem is caused elsewhere and will be discussed later.

Depending on the file format, the character encoding may be declared on a file-by-file basis through, for example, the use of the <meta> element in HTML, the <?xml?> declaration in XML documents, the @charset rule in CSS or indicated by the presence of the BOM. On a server-wide basis it depends upon the server configuration and may be declared by the charset parameter of the Content-Type field in the HTTP response headers. It may also be indicated by a referencing document using, for example, the charset attribute in HTML or the encoding of the referencing document itself. Finally, in the absence of any of these indications, some specifications define the default encoding that should be used – commonly UTF-8, UTF-16, ISO-8859-1 or US-ASCII.

The order of precedence for each applicable method is defined in the relevant specification for the language being used (eg. Specifying a Character Encoding in HTML 4.01), often beginning with the Content-Type header field having the highest precedence and ending with the charset attribute or encoding of the referencing document, having the lowest. Of course, it can be fun when specifications collide on this issue, which illustrates why it is not only important to declare the character encoding appropriately, but to ensure that it is declared correctly.

Most of those may be easily checked simply by opening the file and seeing which are included. The only one you may have difficulty with is the HTTP headers since these are not so readily viewable without the right tool. Thankfully, there are several to choose from, ranging from Mozilla, Firefox, or other browser extensions like Live HTTP Headers to online tools like the W3C’s HTTP HEAD Service.

Once you have determined the method by which the encoding is being indicated, it is a simple matter of correcting the error (Correcting HTTP Headers will be discussed later). If more than one method is used (eg. the XML declaration, the meta element and/or the HTTP Content-Type header field), you should also ensure that all of them indicate the same encoding. For most document formats, the recommended method to use is the HTTP headers. For XML documents (served as XML) however, since they are self describing file formats, it is recommended that the server be configured to omit the charset parameter from the Content-Type header and that the <?xml?> declaration be used instead.

One thing to note: even if the character encoding is being correctly declared, the use of the BOM may still cause problems. Although many applications do support the use of the BOM (namely those that support UTF-8 properly) there are still many that don’t (in particular, older web browsers) and require that the file be saved with the BOM omitted. The problem is that not every editor that supports UTF-8 has an option to control the output of the BOM. When the file is read by an application that does not support UTF-8, or the BOM, these 3 bytes may be interpreted as single-octet characters. For this reason, the W3C Mark Up Validator will issue a warning about its use, and it is recommended that the BOM be omitted from HTML documents. (Note: this does not apply to XML or XHTML documents served as XML, since XML user agents are required to support UTF-8 and UTF-16 fully, including the BOM.)

To modify the HTTP headers, it is necessary to edit a server configuration file. For the Apache HTTPd server, most web hosting providers will allow web content authors to use .htaccess files for this, and many other purposes. Other web servers may also offer similar abilities, though the method may be very different. Consult your server’s documentation or contact your web host for information about how to configure the character encoding correctly. However, for those of you using an Apache server, it is simply a matter of including the appropriate directives in your .htaccess files or, if you have access to it (e.g. you are the server admimistrator), in the httpd.conf file. (However, most web hosting providers do not allow general access to that file.)

The AddDefaultCharset and/or AddCharset directives may be used. For example, you may set the default charset to UTF-8, but still wish that some files use ISO-8859-1. These directives in either .htaccess or httpd.conf will accomplish this:

AddDefaultCharset UTF-8
AddCharset ISO-8859-1 .latin1

The AddCharset directive takes both a character encoding name and a file extension. For files encoded as ISO-8859-1, it simply requires that the file contain a .latin1 file extension. This feature is most useful when Content Negotiation is enabled, so that the .latin1 file extension need not be included in the URI, but still takes effect to send the correct encoding.

Besides the fact that some user agents do not support the BOM, you may still encounter problems with its use in files intended to be processed by some server-side technology, such as PHP or JSP. For example, Pierre Igot described the problem he encountered with the presence of the BOM in WordPress PHP files. This issue occurs because upon encountering non-whitespace character data, the processor will assume that the content has begun, send out all HTTP headers and begin to output the resultant file. If, for example, you then attempt to use the PHP header() function, an error will be received indicating that it is too late to modify headers because the content has already begun.

Other errors may not cause any server-side errors, and are thus harder to catch, but they may result in invalid markup being transmitted by the server due to character data appearing where character data is not allowed. For example, consider an include file containing a set of link elements to be included within the head elements of several (X)HTML files using some scripting language or SSI: if the include file is encoded as UTF-8 and begins with the BOM, yet the scripting language processor does not support UTF-8 and therefore neither recognises the BOM nor strips it from the output, it will end up being included within the head elements. Since a head element may not contain character data, a markup validator will detect this and issue an appropriate error message.

The exact error will differ depending on whether you are using HTML or XHTML, though the cause will be the same. For HTML, the BOM will implicitly end the head element, and begin the body element. Thus, anything following will be treated as being within the body element, not the head, which may cause several errors depending on the content. For XHTML, however, the BOM will simply be treated as character data appearing within the head element, where no character data is allowed.

The final problem that I have encountered is slightly more complicated and proved quite difficult to solve. While using JSP to develop a web site, I had encoded all files as UTF-8, omitting the BOM. The issue was that one include file included the copyright symbol (U+00A9), which is encoded in UTF-8 as the octets C2 A9. At first, it appeared as though the UA was interpreting the character as a single octet encoding, thus displaying the characters ©. However, the HTML document was being correctly interpreted as UTF-8, since the character encoding was being declared by the HTTP Content-Type header field.

After much investigation, I found that if I encoded the include file as ISO-8859-1, but the main file as UTF-8, the desired UTF-8 output was received. It turned out that the JSP processor, Apache Tomcat with JBoss, thought that the include file was to be interpreted as ISO-8859-1 (the default for JSP); however the output was required to be UTF-8. Because of this, the JSP processor was attempting to convert the character encoding of the include file into UTF-8 on the fly.

Thus, when it encountered the octets C2 A9, it interpreted them as ISO-8859-1 characters, which map to the Unicode characters: U+00C2 and U+00A9. These characters, when encoded as UTF-8, form the octets C3 82 and C2 A9, respectively, which is the output I was receiving in the HTML document. I ended up solving this problem by correctly informing the JSP processor that the include files were also encoded as UTF-8, and not the default ISO-8859-1.

Up until now, we have looked at the BOM, discussed how it is encoded in UTF-8 and some of the problems it may cause; however we have not looked at its purpose. As mentioned, the BOM is optional in UTF-8; but it is required in UTF-16 and UTF-32 to indicate the order of octets for each character as either Little Endian, where the least significant byte appears first, and Big Endian where the most significant byte appears first. For this guide, we will only look at UTF-16, but a similar technique still applies to UTF-32 documents.

In UTF-16LE the BOM (U+FEFF) will appear in the file as the octet sequence FF FE (least significant byte first), but in UTF-16BE it appears as FE FF (most significant byte first). For UTF-16, because the reversed sequence U+FFFE is defined never to be a character in Unicode, that sequence of octets (FF FE) can be safely used to detect the encoding as UTF-16LE, and not a UTF-16BE file starting with the character U+FFFE, or vice versa.

Now that you have a rather more in-depth understanding of Unicode (including: the character repertoire, code points and the various encoding forms; the ability to create and edit Unicode encoded files; and an understanding of several of the problems that may be encountered) it is time to go forth and prosper — to make use of Unicode to its full potential, to simplify the use of non-US-ASCII chracters and to help promote the i18n of the web. However, as with everything else, there’s always more to be learned. Although it may seem that I have covered much, I’m sure you will find that I have only just scratched the surface. So, to help you out, I’ve compiled a short but comprehensive list of additional resources that will provide further information:

Guide to Unicode, Part 2

When you write a document in one of the Unicode character encodings (UTF-8, UTF-16 or UTF-32), you can use any character from any language that exists in the Unicode character repertoire all in the same file with no need to use HTML character references or other special escape sequences. This chapter assumes you have read the Guide to Unicode, Part 1; or you are at least familiar with the concepts of character repertoires, code points, looking up Unicode characters and writing numeric character references for them in HTML. If not, take a look at part 1 and come back when you’re ready.

In part 1, I mentioned character encodings; but I didn’t really discuss what they are and how they relate to the character repertoire and the code points. A character encoding is basically a method of representing code points as a sequence of octets (or bytes).

In the simplest case of encoding, each octet maps to an integer from 0 to 255 which translates to a code point in the character repertoire for that encoding, as is the case for single-octet encodings like US-ASCII or the ISO-8859 series. However, for more complex character repertoires, such as Unicode, it is impossible to represent all the characters with only the 256 values available in a single octet and, therefore, requires a multiple-octet encoding.

Some multi-octet encodings assign a fixed number of octets to every character, while others use more complex algorithms to assign a variable number. For example, UTF-32 assigns 4 octets (32 bits) to every character, while UTF-8 assigns anywhere from 1 to 6. The advantages and disadvantages of these different encoding methods are discussed in section 2.5, encoding forms of the Unicode specification.

The names of the many character encodings are registered with IANA. Some of the common character-encoding names include ISO-8859-1, Windows-1252, Shift_JIS and Big5. Many of the encodings also have various aliases and other information about them, which can be looked up in the IANA character set assignment list.

When the Unicode character repertoire was designed, the characters from many of the major character sets were incorporated and mapped to the Unicode code points. The mappings for some are available and each character is mapped to and from the Unicode code points. This is important, as you will see later, because it means that other character encodings can be converted to and from Unicode encodings without any loss of information.

To use any character encoding, it’s not necessary to understand the algorithm used to encode and decode characters because that is the job of the editor – but when learning Unicode, it does help to have a basic understanding of the concepts of multi-octet versus single-octet encodings, especially when debugging character encoding problems, which will be discussed later in part 3.

As mentioned previously, encoding a file using one the Unicode encodings makes it possible to use any character without the need for character references or other special escape sequences. Using the real characters instead of character references makes the file easier to read and can also significantly reduce the size of the file, especially in cases where a lot of character references were needed (since it generally takes more octets to encode the character reference, than for the UTF-8 encoded character). There are many other reasons for choosing Unicode, which I will discuss in part 3. But for now, it’s time to start using Unicode.

The first thing you’ll need is an editor that supports Unicode character encodings – in particular, UTF-8. If you’re using Windows 2000 or XP, then Notepad will do the trick for most of these exercises. If not, or if you would like a slightly fancier editor anyway, then I find editors like SuperEdi or Macromedia Dreamweaver to be quite good. If you’re using a Mac or Linux, I’m sure there are many choices available, though I am unfamiliar with those platforms and the editors available for them. Take a look through the settings and/or documentation for your editor and ensure that your file is being saved as UTF-8 (not UTF-16 or UTF-32 at this stage). For Notepad users, this setting is in the Save As… dialog. For other’s, it may be there also, or in the Options/Preferences/Settings dialog. Note: If your editor provides an option for whether or not to output the Byte Order Mark (BOM), leave it enabled for now so that it does. The BOM will be discussed later, and the problems it can cause will be discussed in part 3.

The first issue you’re probably asking about is how to enter characters that don’t appear on your keyboard into the editor. It’s a common question, and one that I struggled with while I was still learning about Unicode. However, those of you with intuitive minds, that have read part 1 of this guide, have probably just figured out why I went to so much effort to teach you about looking up code points and writing character references in HTML as a method of outputting the characters. While the main reason was to teach you about code points, it’s also because one way to enter the characters that will work for all editors and platforms is to copy and paste them from your browser (or other source).

Try it now. You may look up a few characters in Unicode that don’t appear on your keyboard, create a small HTML file and generate them using character references. Be sure to include random characters, including some from the US-ASCII subset (from U+0000 to U+007F) and others outside that range. Afterwards, open the page in your browser and then copy and paste them into a new, plain text (not HTML) file in your editor. However, to save you some time and effort, here are some characters for you to copy and paste: ‘ ’ — ? × { } © ? ?. (Include the spaces between the characters.)

When you open the file in your browser, if the BOM is present, the file will be automatically detected as UTF-8 in modern browsers and the characters will be displayed correctly. Confirm that the browser is interpreting the file as UTF-8 by looking at the character encoding options, which are commonly available from within the View menu. Configure your browser to interpret the file as Windows-1252 or ISO-8859-1 and you will notice that the string of characters you entered will become a mess of seemingly random characters. For example, using the characters I provided earlier, you should see: ‘ ’ — π × { } © 佈 б

This output represents the UTF-8 encoded characters when interpreted as a single-octet encoding, thus each character in the output represents 1 octet in the file.

Notice the first three characters: . These characters form the UTF-8 BOM. If your attempt did not show these characters, but the rest is the same, never mind – it just means that your editor omitted it. The BOM is the character U+FEFF – the ZERO WIDTH NO-BREAK SPACE (ZWNBSP). In UTF-8, the BOM is optional (hence why some editors allow you to decide whether or not to output it). In UTF-16, however, it is required so that the user agent can accurately determine the order of octets for each character. This will be discussed in more detail later in part 3.

Because each character was separated by a space, you should be able to easily notice that the number of octets used for each character in the file varied from 1 to 3 in this example. The characters from the US-ASCII subset appeared as single octets, but characters outside of this range appeared as 2 or more. This is part of the design of UTF-8 to help ensure compatibility with older editors and text processing software. Thus it is possible to view and edit UTF-8 files relatively easily with editors that don’t support UTF-8, especially where the file comprises mainly of characters from the US-ASCII subset. Though, for obvious reasons, it becomes much harder where the file comprises mainly characters outside that range.

If you would like to know exactly which characters were chosen, Ian Hickson has provided two tools to help you out. The first is the character identifier. You will have noticed this form when you looked at the character finder in part 1. Copy and paste the first set of characters that I provided into the form and submit. The results provide information such as the character names, code point and various other useful pieces of information. As you become more experienced with Unicode, and use it more often, I’m quite sure you will find this tool quite invaluable; and I will leave it for you to explore and understand all the useful information it provides, in your own time.

The second is the UTF-8 Decoder. This tool will decode encoded characters, such as the Windows-1252 output I provided earlier. The results indicate which characters are represented. If you copy the sample Windows-1252 output into the UTF-8 Decoder, and select “UTF-8 interpreted as Windows-1252” from the Input Type list, then submit, the characters will be decoded for you and lots of useful information will be provided, much like the character identifier you looked at previously. To verify the characters were decoded correctly, compare the results of the UTF-8 decoder with those from the character identifier. The list of identified characters should both contain the same character names, except for the addition of the BOM in the Windows-1252 encoded form.

As I mentioned in part 1, creating an HTML file, looking up the character and then writing the character reference can become very time consuming, and there are much faster and more convenient ways to generate the characters. Firstly, for Windows users, the Character Map (usually available under Accessories or System Tools in the start menu) provides a somewhat useful interface for browsing characters and fonts. In Windows 2000 and XP versions, the character map provides both the character name and the Unicode code point for every character available in the selected font. In all versions of windows, it also provides the Windows-1252 code point for those characters that exist in the Windows-1252 character repertoire.

The Windows-1252 code point is used for the keystroke that takes the form: Alt+0###. (where ### represents the code point as a decimal number that needs to be entered on the numeric keypad of your keyboard). While it is obviously possible to copy and paste characters from the character map, it is also possible, for the characters in the Windows-1252 character repertoire, to enter them using the given keystroke without the need to even open the character map. This useful feature will save you a lot of time for entering commonly used characters that don’t appear on your keyboard, but do appear in the Windows-1252 character repertoire.

Even though the character is being entered using the Windows-1252 code point, the characters are mapped to the Unicode code points using the mappings I mentioned previously. For example, the code point for the left single quotation mark in Windows-1252 is 0×92 (decimal 146), which maps to the Unicode code point U+2018. This and all other Windows-1252 characters are listed in the Windows-1252 mapping from the Unicode website.

Jukka Korpela also provides a useful JavaScript application called gwrite – a virtual Unicode keyboard, from which you can select and copy many characters. Finally, I have reproduced Ian Hickson’s very useful Unicode tools in my copy of the DevEdge Sidebar. I also added a character generator, that will generate the character by typing in the Unicode code point in either decimal, hexadecimal or octal.

Next, in part 3, we will look at some of the issues caused by the BOM and other difficulties with Unicode, as well as debugging some of the common problems. We will also take a closer look at how the octets are encoded in UTF-8, and how to determine the exact octets used, which is useful when using a binary editor. In addition, we will look at UTF-16 and UTF-32 and discuss their advantages and disadvantages in relation to the web.

Guide to Unicode, Part 1

Unicode, as some of you may know, is a universal character set comprising most of the world’s characters. Since version 1.1, the Unicode standard has remained fully compatible with ISO/IEC 10646: Universal Multiple-Octet Coded Character Set. The ISO/IEC 10646 standard defines a character repertoire and character code points (or code positions), as well as two character encodings, UCS-2 and UCS-4, allowing for up to 232 code points. Though there are restrictions imposed by the Unicode standard, and the total number of code points is only 1,114,112. However, the details of why this is the case will not be covered in this guide.

The Unicode standard further defines character encodings (UTF-8, UTF-16 and UTF-32), and is a restricted subset of the ISO/IEC 10646 standard, As a result, any conformant implementation of Unicode, is also conformant with ISO/IEC 10646. However, due to the additional restrictions imposed by Unicode, the same is not necessarily true the other way around. Despite these differences, the most important point, at least for the purposes of this guide, is that the character sets defined by both standards are, code for code, identical in every way.

One thing that neither the Unicode standard, nor ISO/IEC 10646 defines is the glyphs (the visual representation) for each character. Although the Unicode specification does provide example glyphs for every character, it is expected that the glyphs from different fonts may look very different

Before I get into the details about the practical use of Unicode, there is an important distinction that must be made, in relation to HTML and character sets. The HTML 4.01 specification states, in section 5.1: The Document Character Set:

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

  • A Repertoire: A set of abstract characters, such as the Latin letter “A”, the Cyrillic letter “I”, the Chinese character meaning “water”, etc.
  • Code positions: A set of integer references to characters in the repertoire.

The document character set is different from the character encoding of the file, and, in HTML, is defined to be ISO 10646, which (for the purposes of HTML) is equivalent to Unicode. The document character set is used for decoding numeric character references and the code point given refers to the Unicode code point for the character, not the code point within the documents character encoding (unless the character encoding also happens to be a Unicode variant).

This is a common mistake made by many, and is most often seen with character references made to Windows-1252 code points, in the range from 128 to 159. (eg. &#147; for a left double quotation mark) In Unicode, these code points are reserved as control characters, and are invalid. On the other hand, the character encoding refers to the actual encoding of the characters in the file. This is most often ISO-8859-1, or (sadly) Windows-1252 (though, often, and usually incorrectly declared as ISO-8859-1 anyway).

Regardless of the character encoding of the file, due to the document character set being Unicode, it is always possible to include any character you wish in your document using numeric character references, as long as it exists in the Unicode character repertoire. To do so, it is only necessary to know the code point of the character, and to use either the decimal or hexadecimal numeric character reference. To actually view the character, the user must have a font available to the user agent from which it can use the appropriate glyph.

For example, to use a character such as em-dash () or left () and right () double quotation marks, They may be encoded as hexadecimal using &#x2014;, &#x201C; and &#x201D; or decimal using &#8212;, &#8220; and &#8221; respectively.

It is also possible to use the named character entity references defined in the HTML DTD, which are also mapped to their respective characters in the Unicode character repertoire, but for the purposes of this guide, they will be ignored.

So, one question you’re probably asking (assuming you’re one of the many that don’t already know the answer to this), is how do I find the character I want, and what the code point is? Well, that’s easy since all the characters in the Unicode character repertoire are listed in the Unicode Code Charts, grouped into 124 categories, and ordered by the code point value. The only problem is that they’re PDF files, which may take a while to load, but never fear, there are easier ways which will be discussed later. However, first things first…

Some of the category names may not always make it obvious to you, as to which characters the group contains, but knowing what character you’re looking for, it’s usually possible to narrow down the field to around 2 or 3 possibilities. Take, for example, looking for the Greek letter/Mathematical symbol for Pi (?), used to represent the number 3.1415926535897932384626433832795… Take a look at the names of the code charts, and narrow it down to a few possibilities.

For those of you that didn’t bother to look for yourself, or to verify your guesses for those of you that did, I think it can be reasonably assumed that the character we’re looking for will exist in either Greek and Coptic, Greek Extended, Miscellaneous Mathematical Symbols-A or Miscellaneous Mathematical Symbols-B. Before reading the next paragraph, take a look through each of them to see if you can find the character. Skim through both the table showing all the glyphs for the characters, and the list of names and descriptions following the table.

If you followed instructions, then you may have found the characters for Pi in the Greek and Coptic category, but which one are we interested in? There is both a capital letter Pi (U+03A0 – ?), and a small, lowercase letter Pi (U+03C0 – ?). If you read the descriptions, you should have noticed the bullet points following the character name. The description for the lowercase letter mentions the math constant we are interested in, and therefore, that is the character we are after.

Having found the character, all that is left is to write the character reference in the HTML file using either hexadecimal (&#x03C0;) or decimal (&#960;) format. If you create a small HTML file, containing that character reference, then you should (assuming your computer has a font with the glyph available) see the character displayed like this: π. If not, you will see a question mark, box or other place holder that your user agent uses. Try this with any character you like, get a feel for finding characters and writing the character references for them.

As you’ve probably already figured out, searching through the PDF files all the time is very time consuming, and the inquisitive minds that some of you will have noticed the character names index provided, which I’ll leave for you to explore in your own time – it’s too boring for me to walk you through it.

A much faster way, as I’m sure anyone can guess, is to use a search engine. Well, thanks to Hixie, you can do just that with his Character Finder. Another useful feature is that it also calculates the decimal, octal and binary representations of the code point for you, though it’s not hard to do for yourself with a calculator anyway.

The Windows Character Map tool also provides some simple search facilities, and is also good for finding some characters quickly, but it’s not perfect either, and only searches within the currently selected font. If you’re using Windows, I’ll leave the character map for you to explore in your own time, for now (though, it will be revisited later in a future part of this guide).

So, now that you have a brief understanding of Unicode, the character repertoire and code points; and also know how to use those characters with character references, The next thing to learn is about character encodings, and in particular using UTF-8, UTF-16 or UTF-32, and inserting the characters directly into your file without having to use a character reference. All that and more will be explained in the Guide to Unicode, Part 2.