It seems that many people aren’t as well informed about character references
as they should be, so I’m going to clearly explain everything you need to
know (and some things you probably don’t) about them, primarily for HTML
and XML. There are two types of character references: numeric
character references and character entity references.
Numeric Character References
There are two forms of numeric character references: decimal and hexadecimal.
Syntax
Decimal take the form &#nnnn;
, where nnnn is
the reference to the Unicode code point for the character in decimal [0-9]
.
Hexadecimal takes the form &#xhhhh;
where hhhh is
the code point in hexadecimal notation [0-9a-fA-F]
.
For hexadecimal
character references in HTML, the x
is case insensitive but in XML it must be lower case. So, for
example,  
is valid in HTML but invalid in XML.
In HTML, the reference end is either:
- a reference close (REFC) delimiter (a semicolon by default)
- a record end (RE) function code
- a character, such as a space, which is not part of a valid entity name.
That basically means that the semi-colon may be omitted from the end in certain
circumstances. However, remembering exactly when it can and can’t be omitted
is difficult; and to avoid the chance of error as much as possible, it’s
very good practice to always include it. In XML, the semi-colon is required.
These same rules apply the character entity references as well.
Character Repertoire
It is most important to note that regardless of the document’s character encoding, numeric
character references always refer to the code position in the Unicode
character repertoire .
In SGML terms, this is referred to as the Document
Character Set (DCS) and, for HTML, is defined in the SGML Declaration.
For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You
should note that there is a difference between the DCS and the
file’s character
encoding, which can be anything, including ISO-8859-1
, UTF-8
, UTF-16
, Shift_JIS
,
or whatever else; but the encoding does not affect numeric character
references in any way, they always refer to a character within the DCS
which is, as stated, defined as Unicode.
For example: if you want to include the right single quotation mark (’) using
a numeric character reference, you need to know the Unicode code point. In
this case, it is U+2019
or,
in decimal, 8217
. Thus, the hexadecimal and decimal character references
will be, respectively, ’
and ’
.
Common Mistakes
A very common mistake is to use the code point from the Windows-1252
character repertoire instead. You should be aware that this is not correct,
even though browsers have been forced to support them simply because
IE does. The
problematic code points range from 128 to 159. In Windows-1252, the right
single quotation mark falls within this range at the code position 0x92
,
or, more commonly known in decimal as 146
. However, you cannot
include the character using this code point, as in ’
or ’
,
because doing so would actually refer to the Unicode code point, not Windows-1252
,
and characters in this range are defined as control characters.
These control characters are defined as UNUSED
in the SGML
Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML
characters. According to the SGML handbook in section 13.1.1 that simply
means that no
meaning is assigned to that character
, but explicitly states in section
9.2 that a non-SGML character can be entered as a data character within
an SGML entity by using a character reference
.
Validation Issues
Strictly speaking, although these characters cannot be included
as data characters within an HTML document, it is not invalid to
refer to these characters with character references. The problem is a
combination of the fact that the meaning of a non-SGML character is nothing
short of obscure and, in Unicode, these characters are non-printable control
characters.
Because of this, the validator will only issue a warning; but, although
its use is still technically valid, it should be treated as
an error as it almost certainly does not mean what the
author intended — in fact, its meaning is undefined.
To clarify the validation issues a little more, compare the results
of using a non-SGML
character directly within the markup and
a
reference to a non-SGML character. The first will fail validation with
an error; while the second will pass, but with a warning,
even though they’re essentially using the same character.
You should note that this is not the case for XML (including XHTML). Technically,
this range (from 128 to 159) is perfectly valid according the the production
for Char in XML, but they still refer to Unicode control characters, their
meaning is undefined in the context of the document and thus should
not be used. Although the W3C Validator will issue the same error
and warning for equivalent XHTML documents, this is a symptom of its
origin as an SGML validator patched to work in a sort-of XML mode. However,
validating with a true XML validator (like that provided by Page
Valet) will not result in any errors or warnings at all.
It is important to realise that, in XML, using or referring to a character
that does not match the production for Char violates a well-formedness constraint.
For example, using a control character in the range from 0 to 31 (except
for tab, newline and carriage return) either directly or
with
a numeric character reference
results in a well-formedness error.
Character Entity References
Character entity references use symbolic names instead of numbers, and take
the form &name;
. All entity references are case
sensitive. So, for example, å
and Å
refer
to two separate characters in HTML: å and Å, respectively. The rules for
the reference end are the same as for numeric character references (discussed
above).
Some of
the well known entity references in HTML include &
, <
, >
and "
.
Interestingly, "
was actually removed from HTML 3.2, but
this was later realised to be a mistake and added back again in HTML 4.
Predefined Entity References in XML
In
XML, those are 4 of the 5 predefined entity references that can be used in
any XML document, without needing them to be being defined in a DTD. The
5th predefined entity reference in XML is '
, but
the reason I mention it separately from the others is that it is not defined
in HTML and, as a result, it is also not supported in IE for HTML. However,
it is rare that one actually needs to use it, as it is only required within
an attribute value delimited by single quotes (‘), rather than the more conventional
double quotes (“). In such cases, a numeric character reference can always be
used in its place.
External Entity References
HTML 4, XHTML 1.x and MathML define many other character entity references
in their respective DTDs. These are called external entity references. In
HTML, they are divided into three groups: ISO-8859-1
characters, symbols,
mathematical symbols, and Greek letters and markup-significant
and internationalization characters. Digital Media Minute have provided
a useful character
entity reference chart containing all of these. If you’re interested in
the MathML entities, see chapter
6 of MathML 2.0.
Because these are defined in the DTD, technically none of them
can be used in an HTML document without an appropriate DOCTYPE
declaration
referencing an appropriate HTML DTD; although since browsers don’t read the
DTD anyway, browsers will support them, regardless. However, in XHTML and
MathML (served with an XML MIME type), the DOCTYPE
is
required for practical reasons to use any entity, other than the 5 predefined
ones.
For example,
and ’
are defined
in the XHTML DTD, they are not predefined in XML and so require the DTD to
be used. Without it, their use violates a well-formedness contraint, but
it should be noted that using externally defined entities is unsafe in XML because
it requires a validating XML parser to read the DTD. The Mozilla
Web Developer FAQ notes:
In older versions of Mozilla as well as in old Mozilla-based products, there
is no pseudo-DTD catalog and the use of externally defined character entities
(other than the five pre-defined ones) leads to an XML parsing error. There
are also other XHTML user agents that do not support externally defined character
entities (other than the five pre-defined ones). Since non-validating XML processors
are not required to support externally defined character entities (other than
the five pre-defined ones), the use of externally defined character entities
(other than the five pre-defined ones) is inherently unsafe in XML documents
intended for the Web. The best practice is to use straight UTF-8 instead of
entities. (Numeric character references are safe, too.).
As noted, the alternative is to just use a numeric character reference instead,
but the best option is to just use a Unicode encoding, such as UTF-8
or UTF-16
,
and enter the real character (see my Guide
to Unicode for more information). Arguably, if you’re using a Unicode encoding,
one of the only times when it is useful to use a character reference instead
of the real character is for non-printable characters, such as non-breaking
space (
or, preferably,  
),
Em-space, En-space, zero-width characters, etc. The main reason for that
is to be able to clearly identify them when you’re reading/editing the source
code.
Summary
Numeric character references, both decimal and hexadecimal, can be safely
used in (X)HTML and XML, but you need to be careful that you’re referencing
the character’s code point from the Unicode character repertoire, not Windows-1252
(especially
in the range from 128 to 159).
Character entity references can be used in HTML and in XML; but for XML, other
than the 5 predefined entities, need to be defined in a DTD (such as with XHTML
and MathML). The 5 predefined entities in XML are: &
, <
, >
, "
and '
.
Of these, you should note that '
is not defined
in HTML. The use of other entities in XML requires a validating parser, which
makes them inherently unsafe for use on the web. It is recommended that you
stick with the 5 predefined entity references and numeric character references,
or use a Unicode encoding.