In the beginning we discussed character repertoires, code points and HTML
character references; together with their relationship to the Unicode standard.
We then looked at character encodings, examining the differences between
single- and multiple-octet encodings, and how to create files in UTF-8
. If you
are unfamiliar with those concepts, I recommend that you read parts
1
and 2
of this guide first and then return to this section when you are ready.
The saga continues in this third and final thrilling chapter, where we will
look at some of the problems encountered with the use of Unicode and, in particular,
UTF-8
and the BOM. This will involve taking a look at the debugging of the most
common problems and how the tools we’ve looked at previously can help you with
this. Following this I will discuss the importance and practicalities of ensuring
that the character encoding is correctly declared. Finally I will discuss the
purpose of the BOM and show the difference between the UTF-16
and UTF-32
variants:
Little Endian and Big Endian.
The biggest problem with using Unicode on the web is that not all editors
support Unicode, although some authors don’t realise this and declare the encoding
as UTF-8
anyway. This only becomes a problem when characters outside the US-ASCII
subset are used. However, since we have already covered the creation and editing
of Unicode files, there is no need to discuss this problem further.
The next major issue encountered is that user agents may not always display
the characters correctly. Ignoring the availability of Unicode fonts for now,
people will often attempt to use UTF-8
; yet find that some characters that they
have used get turned into 2 or 3 seemingly random characters. After that they
tend to give up and revert back to ISO-8859-1
(or Windows-1252
) and use character
references for the characters outside of these repertoires, or US-ASCII
substitutes.
For those of you that have read part 2; you should now recognise that the
displayed characters are not random, but are in fact the multi-octet UTF-8
encoded
characters interpreted as a single-octet encoding. This error is usually caused
by incorrectly declaring a single-octet encoding — most often ISO-8859-1
. For
example, the BOM (if present) may be displayed as  when
a UTF-8
file is incorrectly declared as ISO-8859-1
by the HTTP response headers
(similar to the demonstration in part 2 where the character encoding was manually
overridden).
For any sequence of characters appearing incorrectly like this, you may use
Ian Hickson’s UTF-8
decoder to
determine what the character is. Conversely, you may also use the UTF-8
encoder to reveal both the ISO-8859-1
representation
as well as the hexadecimal octet values for any Unicode character.
If you suspect that you are experiencing this kind of error, then the first thing to check is which encoding is being used by the user agent. Most popular UAs will allow you to view the character encoding, such as in Mozilla’s Page Info dialog or Opera’s Info panel. The W3C’s MarkUp Validator in verbose mode will also let you know this information. If you find that the character encoding is incorrect, then it is then necessary to determine whence the user agent is acquiring this information and to correct the error. If, however, you find that the character encoding is correct, then the problem is caused elsewhere and will be discussed later.
Depending on the file format, the character encoding may be declared on a
file-by-file basis through, for example, the use of the
<meta>
element
in HTML, the
<?xml?>
declaration
in XML documents, the @charset
rule in
CSS or indicated by the presence of the BOM.
On a server-wide basis it depends upon the server configuration and may be
declared by the charset
parameter of
the Content-Type
field in the HTTP response
headers. It may also be indicated by a referencing document using, for example,
the charset
attribute in HTML
or the encoding of the referencing document itself. Finally, in the absence
of any of these indications, some specifications define the default encoding
that should be used – commonly UTF-8
, UTF-16
, ISO-8859-1
or US-ASCII
.
The order of precedence for each applicable method is defined in the relevant
specification for the language being used (eg. Specifying
a Character Encoding in HTML 4.01),
often beginning with the Content-Type
header field having the
highest precedence and ending with the charset attribute or encoding of the
referencing document, having the lowest. Of course, it can be fun when
specifications collide on this issue, which illustrates why it is not only important to
declare the character encoding appropriately, but to ensure that it is declared
correctly.
Most of those may be easily checked simply by opening the file and seeing
which are included. The only one you may have difficulty with is the HTTP headers
since these are not so readily viewable without the right tool. Thankfully,
there are several to choose from, ranging from Mozilla, Firefox, or other browser
extensions like Live HTTP Headers to online tools like the W3C’s HTTP HEAD
Service.
Once you have determined the method by which the encoding is being indicated,
it is a simple matter of correcting the error (Correcting HTTP Headers will
be discussed later). If more than one method is used (eg. the XML declaration,
the meta
element and/or the HTTP Content-Type
header field), you should
also ensure that all of them indicate the same encoding. For most document formats,
the recommended method to use is the HTTP headers. For XML documents (served
as XML) however, since they are self describing file formats, it is recommended
that the server be configured to omit the charset
parameter from the Content-Type
header and that the <?xml?>
declaration be used instead.
One thing to note: even if the character encoding is being correctly declared,
the use of the BOM may still cause problems. Although many applications do support
the use of the BOM (namely those that support UTF-8
properly) there are still
many that don’t (in particular, older web browsers) and require that the file
be saved with the BOM omitted. The problem is that not every editor that supports
UTF-8
has an option to control the output of the BOM. When the file is read
by an application that does not support UTF-8
, or the BOM, these 3 bytes may
be interpreted as single-octet characters. For this reason, the W3C Mark Up
Validator will issue a warning about its use, and it is recommended that the
BOM be omitted from HTML documents. (Note: this does not apply to XML or XHTML
documents served as XML, since XML user agents are required to support UTF-8
and UTF-16
fully, including the BOM.)
To modify the HTTP headers, it is necessary to edit a server configuration file. For the Apache HTTPd server, most web hosting providers will allow web content authors to use .htaccess files for this, and many other purposes. Other web servers may also offer similar abilities, though the method may be very different. Consult your server’s documentation or contact your web host for information about how to configure the character encoding correctly. However, for those of you using an Apache server, it is simply a matter of including the appropriate directives in your .htaccess files or, if you have access to it (e.g. you are the server admimistrator), in the httpd.conf file. (However, most web hosting providers do not allow general access to that file.)
The AddDefaultCharset
and/or AddCharset
directives may be used. For example,
you may set the default charset to UTF-8
, but still wish that some files use
ISO-8859-1
. These directives in either .htaccess or httpd.conf will accomplish
this:
AddDefaultCharset UTF-8
AddCharset ISO-8859-1 .latin1
The AddCharset
directive takes both a character encoding
name and a file extension. For files encoded as ISO-8859-1
, it
simply requires that the file contain a
.latin1
file extension. This feature is most useful when Content
Negotiation is enabled, so that the .latin1
file extension need not be included
in the URI, but still takes effect to send the correct encoding.
Besides the fact that some user agents do not support the BOM,
you may still encounter problems with its use in files intended to be processed
by some server-side technology, such as PHP or JSP.
For example, Pierre
Igot described the problem
he encountered with the presence of the BOM in
WordPress PHP files.
This issue occurs because upon encountering non-whitespace character data,
the processor will assume that the content has begun, send out all HTTP headers
and begin to output the resultant file. If, for example, you then attempt
to use the PHP
header()
function, an error will be received indicating that
it is too late to modify headers because the content has already begun.
Other errors may not cause any server-side errors, and are thus harder to
catch, but they may result in invalid markup being transmitted by the server
due to character
data appearing where character data is not allowed. For example,
consider an include file containing a set of link
elements to be included within the head
elements of several (X)HTML
files using some scripting language or SSI:
if the include file is encoded as UTF-8
and
begins with the
BOM, yet the scripting language processor
does not support UTF-8
and therefore
neither recognises the BOM nor strips it
from the output, it will end up being included within the head
elements.
Since a head
element may
not contain character data, a markup validator will detect this and issue
an appropriate error message.
The exact error will differ depending on whether you are using HTML or XHTML,
though the cause will be the same. For HTML, the BOM will
implicitly end the head
element,
and begin the body
element. Thus, anything following will be treated
as being within the body
element, not the head
, which may cause several errors
depending on the content. For XHTML, however, the BOM will simply be treated
as character data appearing within the head element, where no character data
is allowed.
The final problem that I have encountered is slightly more complicated and
proved quite difficult to solve. While using JSP to
develop a web site, I had encoded all files as UTF-8
, omitting
the BOM. The issue was that one include
file included the copyright symbol (U+00A9),
which is encoded in UTF-8
as the
octets C2 A9
. At first, it appeared as though the UA was interpreting the
character as a single octet encoding, thus displaying the characters ©.
However, the HTML document was
being correctly interpreted as UTF-8
, since the
character encoding was being declared by the HTTP Content-Type
header
field.
After much investigation, I found that if I encoded
the include file as ISO-8859-1
, but the main file as UTF-8
, the desired UTF-8
output was received. It turned out that the JSP processor, Apache Tomcat with
JBoss, thought that the include file was to be interpreted as ISO-8859-1
(the
default for JSP); however the output was required to be UTF-8
. Because of this,
the JSP processor was attempting to convert the character encoding of the include
file into UTF-8
on the fly.
Thus, when it encountered the octets C2 A9
, it interpreted them as ISO-8859-1
characters, which map to the Unicode characters: U+00C2 and U+00A9. These characters,
when encoded as UTF-8
, form the octets C3 82
and C2
A9
, respectively, which
is the output I was receiving in the HTML document. I ended up solving this
problem by correctly informing the JSP processor that the include files were
also encoded as UTF-8
, and not the default ISO-8859-1
.
Up until now, we have looked at the BOM, discussed how it is encoded in UTF-8
and some of the problems it may cause; however we have not looked at its purpose.
As mentioned, the BOM is optional in UTF-8
; but it is required in UTF-16
and
UTF-32
to indicate
the order of octets for each character as either Little
Endian,
where the least significant byte appears first, and Big
Endian where the most
significant byte appears first. For this guide, we will only look at UTF-16
,
but a similar technique still applies to UTF-32
documents.
In UTF-16LE
the BOM (U+FEFF) will appear in the file as the octet sequence
FF FE
(least significant byte first), but in UTF-16BE
it appears as FE FF
(most
significant byte first). For UTF-16
, because the reversed sequence U+FFFE is defined never
to be a character in Unicode, that sequence of octets (FF FE
) can be safely
used to detect the encoding as UTF-16LE
, and not a UTF-16BE
file
starting with the character U+FFFE, or vice versa.
Now that you have a rather more in-depth understanding of Unicode (including:
the character repertoire, code points and the various encoding forms; the
ability to create and edit Unicode encoded files; and an understanding
of several of the problems that may be encountered) it is time to go forth
and prosper — to make use of Unicode to its full potential, to simplify the
use of non-US-ASCII
chracters and to help promote the i18n
of the web. However,
as with everything else, there’s always more to be learned. Although
it may seem that I have covered much, I’m sure you
will find that I have only just scratched the surface. So, to help you out,
I’ve compiled a short but comprehensive list of additional resources that
will provide further information:
- W3C I18N FAQ: Characters & Encodings
- W3C I18N Tutorial: Characters & Encodings
- Jukka Korpela: Character problems in Web authoring
- Jukka Korpela: Characters and Encodings
- Ian Hickson: A crash course in UTF-8 mathematics
- Ian Hickson:
The Absolute Minimum Every
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Author Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)