When you write a document in one of the Unicode character encodings
(UTF-8
,
UTF-16
or
UTF-32
),
you can use any character from any language that exists in the Unicode character
repertoire all in the same file with no need to use HTML
character references or other special escape sequences. This chapter assumes
you have read the Guide
to Unicode, Part 1; or you are at least familiar with
the concepts of character repertoires, code points, looking up Unicode characters
and writing numeric character references for them in HTML. If not, take a look
at part 1 and come back when you’re ready.
In part 1, I mentioned character encodings; but I didn’t really discuss what they are and how they relate to the character repertoire and the code points. A character encoding is basically a method of representing code points as a sequence of octets (or bytes).
In the simplest case of encoding, each octet maps to an integer from 0 to 255 which translates to a code point in the character repertoire for that encoding, as is the case for single-octet encodings like US-ASCII or the ISO-8859 series. However, for more complex character repertoires, such as Unicode, it is impossible to represent all the characters with only the 256 values available in a single octet and, therefore, requires a multiple-octet encoding.
Some multi-octet encodings assign a fixed number of octets to every character,
while others use more complex algorithms to assign a variable number. For example,
UTF-32
assigns 4 octets (32 bits) to every character, while UTF-8
assigns anywhere
from 1 to 6. The advantages and disadvantages of these different encoding methods
are discussed in section
2.5, encoding forms of the Unicode specification.
The names of the many character encodings are registered with IANA. Some of
the common character-encoding names include ISO-8859-1
, Windows-1252
, Shift_JIS
and Big5
. Many of the encodings also have various aliases and other information
about them, which can be looked up in the IANA character set assignment list.
When the Unicode character repertoire was designed, the characters from many of the major character sets were incorporated and mapped to the Unicode code points. The mappings for some are available and each character is mapped to and from the Unicode code points. This is important, as you will see later, because it means that other character encodings can be converted to and from Unicode encodings without any loss of information.
To use any character encoding, it’s not necessary to understand the algorithm used to encode and decode characters because that is the job of the editor – but when learning Unicode, it does help to have a basic understanding of the concepts of multi-octet versus single-octet encodings, especially when debugging character encoding problems, which will be discussed later in part 3.
As mentioned previously, encoding a file using one the Unicode encodings makes
it possible to use any character without the need for character references
or other special escape sequences. Using the real characters instead of character
references makes the file easier to read and can also significantly reduce
the size of the file, especially in cases where a lot of character references
were needed (since it generally takes more octets to encode the character
reference, than for the UTF-8
encoded character). There are many other reasons
for choosing Unicode, which I will discuss in part 3. But for now, it’s time
to start using Unicode.
The first thing you’ll need is an editor that supports Unicode character encodings
– in particular, UTF-8
. If you’re using Windows 2000 or XP, then Notepad will
do the trick for most of these exercises. If not, or if you would like a slightly
fancier editor anyway, then I find editors like SuperEdi or Macromedia
Dreamweaver to be quite good. If you’re using a Mac or Linux, I’m sure there are many choices
available, though I am unfamiliar with those platforms and the editors available
for them. Take a look through the settings and/or documentation for your editor
and ensure that your file is being saved as UTF-8
(not UTF-16
or UTF-32
at this
stage). For Notepad users, this setting is in the Save As… dialog. For other’s,
it may be there also, or in the Options/Preferences/Settings dialog. Note: If
your editor provides an option for whether or not to output the Byte Order Mark
(BOM), leave it enabled for now so that it does. The BOM will be discussed later,
and the problems it can cause will be discussed in part 3.
The first issue you’re probably asking about is how to enter characters that don’t appear on your keyboard into the editor. It’s a common question, and one that I struggled with while I was still learning about Unicode. However, those of you with intuitive minds, that have read part 1 of this guide, have probably just figured out why I went to so much effort to teach you about looking up code points and writing character references in HTML as a method of outputting the characters. While the main reason was to teach you about code points, it’s also because one way to enter the characters that will work for all editors and platforms is to copy and paste them from your browser (or other source).
Try it now. You may look up a few characters in Unicode that don’t appear
on your keyboard, create a small HTML file and generate them using character
references. Be sure to include random characters, including some from the US-ASCII
subset (from U+0000
to U+007F
) and others outside that range. Afterwards, open
the page in your browser and then copy and paste them into a new, plain text
(not HTML) file in your editor. However, to save you some time and effort, here
are some characters for you to copy and paste: ‘ ’ — ? × { } © ? ?.
(Include the spaces between the characters.)
When you open the file in your browser, if the BOM is present, the file will
be automatically detected as UTF-8
in modern browsers and the characters will
be displayed correctly. Confirm that the browser is interpreting the file as
UTF-8
by looking at the character encoding options, which are commonly available
from within the View menu. Configure your browser to interpret the file as Windows-1252
or ISO-8859-1
and you will notice that the string of characters you entered
will become a mess of seemingly random characters. For example, using the characters
I provided earlier, you should see: ‘ ’ — π × { } © 佈 б
This output represents the UTF-8
encoded characters when interpreted as a
single-octet encoding, thus each character in the output represents 1 octet
in the file.
Notice the first three characters: . These characters form
the UTF-8
BOM. If your attempt
did not show these characters, but the rest is the same, never mind – it just
means that your editor omitted it. The BOM is
the character U+FEFF
–
the ZERO WIDTH NO-BREAK SPACE
(ZWNBSP).
In UTF-8
, the BOM is optional
(hence why some editors allow you to decide whether or not to output it). In UTF-16
,
however, it is required so that the user agent can accurately determine the
order of octets for each character. This will be discussed in more detail later
in part 3.
Because each character was separated by a space, you should be able to easily
notice that the number of octets used for each character in the file varied
from 1 to 3 in this example. The characters from the US-ASCII
subset appeared
as single octets, but characters outside of this range appeared as 2 or more.
This is part of the design of UTF-8
to help ensure compatibility with older
editors and text processing software. Thus it is possible to view and edit UTF-8
files relatively easily with editors that don’t support UTF-8
, especially where
the file comprises mainly of characters from the US-ASCII
subset. Though, for
obvious reasons, it becomes much harder where the file comprises mainly characters
outside that range.
If you would like to know exactly which characters were chosen, Ian Hickson has provided two tools to help you out. The first is the character identifier. You will have noticed this form when you looked at the character finder in part 1. Copy and paste the first set of characters that I provided into the form and submit. The results provide information such as the character names, code point and various other useful pieces of information. As you become more experienced with Unicode, and use it more often, I’m quite sure you will find this tool quite invaluable; and I will leave it for you to explore and understand all the useful information it provides, in your own time.
The second is the UTF-8
Decoder. This tool will decode encoded characters,
such as the Windows-1252
output I provided earlier. The results indicate which
characters are represented. If you copy the sample Windows-1252
output into
the UTF-8
Decoder, and select “UTF-8 interpreted as Windows-1252
” from the Input
Type list, then submit, the characters will be decoded for you and lots of useful
information will be provided, much like the character identifier you looked
at previously. To verify the characters were decoded correctly, compare the
results of the UTF-8
decoder with those from the character identifier. The list
of identified characters should both contain the same character names, except
for the addition of the BOM in the Windows-1252
encoded form.
As I mentioned in part 1, creating an HTML file, looking up the character
and then writing the character reference can become very time consuming, and
there are much faster and more convenient ways to generate the characters. Firstly,
for Windows users, the Character Map (usually available under Accessories or
System Tools in the start menu) provides a somewhat useful interface for browsing
characters and fonts. In Windows 2000 and XP versions, the character map provides
both the character name and the Unicode code point for every character available
in the selected font. In all versions of windows, it also provides the Windows-1252
code point for those characters that exist in the Windows-1252
character
repertoire.
The Windows-1252
code point is used for the keystroke that takes the form: Alt+0###.
(where ### represents the code point as a decimal number that needs
to be entered on the numeric keypad of your keyboard). While it is obviously
possible to copy and paste characters from the character map, it is also possible,
for the characters in the Windows-1252
character repertoire, to enter them using
the given keystroke without the need to even open the character map. This useful
feature will save you a lot of time for entering commonly used characters that
don’t appear on your keyboard, but do appear in the Windows-1252
character repertoire.
Even though the character is being entered using the Windows-1252
code
point, the characters are mapped to the Unicode code points using the mappings
I mentioned previously. For example, the code point for the left single quotation
mark in
Windows-1252
is 0×92 (decimal 146), which maps to the Unicode code
point U+2018. This and all other Windows-1252
characters are listed
in the Windows-1252
mapping from
the Unicode website.
Jukka Korpela also provides a useful JavaScript application called gwrite – a virtual Unicode keyboard, from which you can select and copy many characters. Finally, I have reproduced Ian Hickson’s very useful Unicode tools in my copy of the DevEdge Sidebar. I also added a character generator, that will generate the character by typing in the Unicode code point in either decimal, hexadecimal or octal.
Next, in part 3, we will look at some of the issues caused by the BOM and
other difficulties with Unicode, as well as debugging some of the common problems.
We will also take a closer look at how the octets are encoded in UTF-8
, and
how to determine the exact octets used, which is useful when using a binary
editor. In addition, we will look at UTF-16
and UTF-32
and discuss their advantages
and disadvantages in relation to the web.
UTF-8 no longer supports characters longer than 4 code points.
( http://www.ietf.org/rfc/rfc3629.txt )
I found this other editor to be really easy to use (and free, of course). I use it so much I can’t avoid reccomending it:
http://www.esperanto.mv.ru/UniRed/ENG/
I noted a funny thing, trying to print your article it only print two pages, after the copyright sign the printer stops, this is the same for both FF and IE so it is not browser specific. Could this be a nice way to make pages non printable 😉
So fix this would to remove the ? sign. I copied the text to word and the removed it and the it worked fine to print.