It seems that many people aren’t as well informed about character references as they should be, so I’m going to clearly explain everything you need to know (and some things you probably don’t) about them, primarily for HTML and XML. There are two types of character references: numeric character references and character entity references.
Numeric Character References
There are two forms of numeric character references: decimal and hexadecimal.
Syntax
Decimal take the form &#nnnn;
, where nnnn is
the reference to the Unicode code point for the character in decimal [0-9]
.
Hexadecimal takes the form &#xhhhh;
where hhhh is
the code point in hexadecimal notation [0-9a-fA-F]
.
For hexadecimal
character references in HTML, the x
is case insensitive but in XML it must be lower case. So, for
example,  
is valid in HTML but invalid in XML.
In HTML, the reference end is either:
- a reference close (REFC) delimiter (a semicolon by default)
- a record end (RE) function code
- a character, such as a space, which is not part of a valid entity name.
That basically means that the semi-colon may be omitted from the end in certain circumstances. However, remembering exactly when it can and can’t be omitted is difficult; and to avoid the chance of error as much as possible, it’s very good practice to always include it. In XML, the semi-colon is required. These same rules apply the character entity references as well.
Character Repertoire
It is most important to note that regardless of the document’s character encoding, numeric
character references always refer to the code position in the Unicode
character repertoire .
In SGML terms, this is referred to as the Document
Character Set (DCS) and, for HTML, is defined in the SGML Declaration.
For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You
should note that there is a difference between the DCS and the
file’s character
encoding, which can be anything, including ISO-8859-1
, UTF-8
, UTF-16
, Shift_JIS
,
or whatever else; but the encoding does not affect numeric character
references in any way, they always refer to a character within the DCS
which is, as stated, defined as Unicode.
For example: if you want to include the right single quotation mark (’) using
a numeric character reference, you need to know the Unicode code point. In
this case, it is U+2019
or,
in decimal, 8217
. Thus, the hexadecimal and decimal character references
will be, respectively, ’
and ’
.
Common Mistakes
A very common mistake is to use the code point from the Windows-1252
character repertoire instead. You should be aware that this is not correct,
even though browsers have been forced to support them simply because
IE does. The
problematic code points range from 128 to 159. In Windows-1252, the right
single quotation mark falls within this range at the code position 0x92
,
or, more commonly known in decimal as 146
. However, you cannot
include the character using this code point, as in ’
or ’
,
because doing so would actually refer to the Unicode code point, not Windows-1252
,
and characters in this range are defined as control characters.
These control characters are defined as UNUSED
in the SGML
Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML
characters. According to the SGML handbook in section 13.1.1 that simply
means that no
meaning is assigned to that character
, but explicitly states in section
9.2 that a non-SGML character can be entered as a data character within
an SGML entity by using a character reference
.
Validation Issues
Strictly speaking, although these characters cannot be included as data characters within an HTML document, it is not invalid to refer to these characters with character references. The problem is a combination of the fact that the meaning of a non-SGML character is nothing short of obscure and, in Unicode, these characters are non-printable control characters.
Because of this, the validator will only issue a warning; but, although its use is still technically valid, it should be treated as an error as it almost certainly does not mean what the author intended — in fact, its meaning is undefined. To clarify the validation issues a little more, compare the results of using a non-SGML character directly within the markup and a reference to a non-SGML character. The first will fail validation with an error; while the second will pass, but with a warning, even though they’re essentially using the same character.
You should note that this is not the case for XML (including XHTML). Technically, this range (from 128 to 159) is perfectly valid according the the production for Char in XML, but they still refer to Unicode control characters, their meaning is undefined in the context of the document and thus should not be used. Although the W3C Validator will issue the same error and warning for equivalent XHTML documents, this is a symptom of its origin as an SGML validator patched to work in a sort-of XML mode. However, validating with a true XML validator (like that provided by Page Valet) will not result in any errors or warnings at all.
It is important to realise that, in XML, using or referring to a character that does not match the production for Char violates a well-formedness constraint. For example, using a control character in the range from 0 to 31 (except for tab, newline and carriage return) either directly or with a numeric character reference results in a well-formedness error.
Character Entity References
Character entity references use symbolic names instead of numbers, and take
the form &name;
. All entity references are case
sensitive. So, for example, å
and Å
refer
to two separate characters in HTML: å and Å, respectively. The rules for
the reference end are the same as for numeric character references (discussed
above).
Some of
the well known entity references in HTML include &
, <
, >
and "
.
Interestingly, "
was actually removed from HTML 3.2, but
this was later realised to be a mistake and added back again in HTML 4.
Predefined Entity References in XML
In
XML, those are 4 of the 5 predefined entity references that can be used in
any XML document, without needing them to be being defined in a DTD. The
5th predefined entity reference in XML is '
, but
the reason I mention it separately from the others is that it is not defined
in HTML and, as a result, it is also not supported in IE for HTML. However,
it is rare that one actually needs to use it, as it is only required within
an attribute value delimited by single quotes (‘), rather than the more conventional
double quotes (“). In such cases, a numeric character reference can always be
used in its place.
External Entity References
HTML 4, XHTML 1.x and MathML define many other character entity references in their respective DTDs. These are called external entity references. In HTML, they are divided into three groups: ISO-8859-1 characters, symbols, mathematical symbols, and Greek letters and markup-significant and internationalization characters. Digital Media Minute have provided a useful character entity reference chart containing all of these. If you’re interested in the MathML entities, see chapter 6 of MathML 2.0.
Because these are defined in the DTD, technically none of them
can be used in an HTML document without an appropriate DOCTYPE
declaration
referencing an appropriate HTML DTD; although since browsers don’t read the
DTD anyway, browsers will support them, regardless. However, in XHTML and
MathML (served with an XML MIME type), the DOCTYPE
is
required for practical reasons to use any entity, other than the 5 predefined
ones.
For example,
and ’
are defined
in the XHTML DTD, they are not predefined in XML and so require the DTD to
be used. Without it, their use violates a well-formedness contraint, but
it should be noted that using externally defined entities is unsafe in XML because
it requires a validating XML parser to read the DTD. The Mozilla
Web Developer FAQ notes:
In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of externally defined character entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support externally defined character entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support externally defined character entities (other than the five pre-defined ones), the use of externally defined character entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.).
As noted, the alternative is to just use a numeric character reference instead,
but the best option is to just use a Unicode encoding, such as UTF-8
or UTF-16
,
and enter the real character (see my Guide
to Unicode for more information). Arguably, if you’re using a Unicode encoding,
one of the only times when it is useful to use a character reference instead
of the real character is for non-printable characters, such as non-breaking
space (
or, preferably,  
),
Em-space, En-space, zero-width characters, etc. The main reason for that
is to be able to clearly identify them when you’re reading/editing the source
code.
Summary
Numeric character references, both decimal and hexadecimal, can be safely
used in (X)HTML and XML, but you need to be careful that you’re referencing
the character’s code point from the Unicode character repertoire, not Windows-1252
(especially
in the range from 128 to 159).
Character entity references can be used in HTML and in XML; but for XML, other
than the 5 predefined entities, need to be defined in a DTD (such as with XHTML
and MathML). The 5 predefined entities in XML are: &
, <
, >
, "
and '
.
Of these, you should note that '
is not defined
in HTML. The use of other entities in XML requires a validating parser, which
makes them inherently unsafe for use on the web. It is recommended that you
stick with the 5 predefined entity references and numeric character references,
or use a Unicode encoding.
I’m sorry for being overly pedantic here Lachlan 🙂
Actually, there is nothing called character entity references. Entity references are either general entity references or parameter entity references.
Also, what you call decimal character reference is called numeric character reference in the standard.
There are also hex character references as you say, but they are not a “subgroup” of numeric character reference, they are a subgroup of character reference.
There are also a third subgroup of character reference—named character references, like &#TAB; and &#SPACE;, but I’ve never seen them used, and as far as I rememeber from testing them, no browser supports them (which isn’t a surprise).
IOW:
Entity references consists of general entity references and parameter entity references.
Character references consists of named character references, numeric character references and hex character references
As I said, sorry for being overly pedantic here.
David: For simplicity, I was using the terminology from the HTML 4 rec, which calls them numeric character references and character entity references, respectively, rather than the more confusing, though technically more accurate, SGML terminology.
Confusingly, HTTP refers to the character encoding as the character set, so when you see Content-Type: text/html;charset=UTF-8 it’s actually defining the character encoding, and the character set remains unchanged.
Thank you Lachlan for taking the time to gather and organize all these useful informations.
It’s very practical and focused, which is no easy feat considering the possible ramifications.
PS : I really hate the little Voight-Kampf test for the comments.