Character References Explained

It seems that many people aren’t as well informed about character references as they should be, so I’m going to clearly explain everything you need to know (and some things you probably don’t) about them, primarily for HTML and XML. There are two types of character references: numeric character references and character entity references.

Numeric Character References

There are two forms of numeric character references: decimal and hexadecimal.

Syntax

Decimal take the form &#nnnn;, where nnnn is the reference to the Unicode code point for the character in decimal [0-9]. Hexadecimal takes the form &#xhhhh; where hhhh is the code point in hexadecimal notation [0-9a-fA-F].

For hexadecimal character references in HTML, the x is case insensitive but in XML it must be lower case. So, for example,   is valid in HTML but invalid in XML.

In HTML, the reference end is either:

  • a reference close (REFC) delimiter (a semicolon by default)
  • a record end (RE) function code
  • a character, such as a space, which is not part of a valid entity name.

That basically means that the semi-colon may be omitted from the end in certain circumstances. However, remembering exactly when it can and can’t be omitted is difficult; and to avoid the chance of error as much as possible, it’s very good practice to always include it. In XML, the semi-colon is required. These same rules apply the character entity references as well.

Character Repertoire

It is most important to note that regardless of the document’s character encoding, numeric character references always refer to the code position in the Unicode character repertoire . In SGML terms, this is referred to as the Document Character Set (DCS) and, for HTML, is defined in the SGML Declaration. For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You should note that there is a difference between the DCS and the file’s character encoding, which can be anything, including ISO-8859-1, UTF-8, UTF-16, Shift_JIS, or whatever else; but the encoding does not affect numeric character references in any way, they always refer to a character within the DCS which is, as stated, defined as Unicode.

For example: if you want to include the right single quotation mark (’) using a numeric character reference, you need to know the Unicode code point. In this case, it is U+2019 or, in decimal, 8217. Thus, the hexadecimal and decimal character references will be, respectively, ’ and ’.

Common Mistakes

A very common mistake is to use the code point from the Windows-1252 character repertoire instead. You should be aware that this is not correct, even though browsers have been forced to support them simply because IE does. The problematic code points range from 128 to 159. In Windows-1252, the right single quotation mark falls within this range at the code position 0x92, or, more commonly known in decimal as 146. However, you cannot include the character using this code point, as in ’ or ’, because doing so would actually refer to the Unicode code point, not Windows-1252, and characters in this range are defined as control characters.

These control characters are defined as UNUSED in the SGML Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML characters. According to the SGML handbook in section 13.1.1 that simply means that no meaning is assigned to that character, but explicitly states in section 9.2 that a non-SGML character can be entered as a data character within an SGML entity by using a character reference.

Validation Issues

Strictly speaking, although these characters cannot be included as data characters within an HTML document, it is not invalid to refer to these characters with character references. The problem is a combination of the fact that the meaning of a non-SGML character is nothing short of obscure and, in Unicode, these characters are non-printable control characters.

Because of this, the validator will only issue a warning; but, although its use is still technically valid, it should be treated as an error as it almost certainly does not mean what the author intended — in fact, its meaning is undefined. To clarify the validation issues a little more, compare the results of using a non-SGML character directly within the markup and a reference to a non-SGML character. The first will fail validation with an error; while the second will pass, but with a warning, even though they’re essentially using the same character.

You should note that this is not the case for XML (including XHTML). Technically, this range (from 128 to 159) is perfectly valid according the the production for Char in XML, but they still refer to Unicode control characters, their meaning is undefined in the context of the document and thus should not be used. Although the W3C Validator will issue the same error and warning for equivalent XHTML documents, this is a symptom of its origin as an SGML validator patched to work in a sort-of XML mode. However, validating with a true XML validator (like that provided by Page Valet) will not result in any errors or warnings at all.

It is important to realise that, in XML, using or referring to a character that does not match the production for Char violates a well-formedness constraint. For example, using a control character in the range from 0 to 31 (except for tab, newline and carriage return) either directly or with a numeric character reference results in a well-formedness error.

Character Entity References

Character entity references use symbolic names instead of numbers, and take the form &name;. All entity references are case sensitive. So, for example, å and Å refer to two separate characters in HTML: å and Å, respectively. The rules for the reference end are the same as for numeric character references (discussed above).

Some of the well known entity references in HTML include &, <, > and ". Interestingly, " was actually removed from HTML 3.2, but this was later realised to be a mistake and added back again in HTML 4.

Predefined Entity References in XML

In XML, those are 4 of the 5 predefined entity references that can be used in any XML document, without needing them to be being defined in a DTD. The 5th predefined entity reference in XML is ', but the reason I mention it separately from the others is that it is not defined in HTML and, as a result, it is also not supported in IE for HTML. However, it is rare that one actually needs to use it, as it is only required within an attribute value delimited by single quotes (‘), rather than the more conventional double quotes (“). In such cases, a numeric character reference can always be used in its place.

External Entity References

HTML 4, XHTML 1.x and MathML define many other character entity references in their respective DTDs. These are called external entity references. In HTML, they are divided into three groups: ISO-8859-1 characters, symbols, mathematical symbols, and Greek letters and markup-significant and internationalization characters. Digital Media Minute have provided a useful character entity reference chart containing all of these. If you’re interested in the MathML entities, see chapter 6 of MathML 2.0.

Because these are defined in the DTD, technically none of them can be used in an HTML document without an appropriate DOCTYPE declaration referencing an appropriate HTML DTD; although since browsers don’t read the DTD anyway, browsers will support them, regardless. However, in XHTML and MathML (served with an XML MIME type), the DOCTYPE is required for practical reasons to use any entity, other than the 5 predefined ones.

For example,   and ’ are defined in the XHTML DTD, they are not predefined in XML and so require the DTD to be used. Without it, their use violates a well-formedness contraint, but it should be noted that using externally defined entities is unsafe in XML because it requires a validating XML parser to read the DTD. The Mozilla Web Developer FAQ notes:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of externally defined character entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support externally defined character entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support externally defined character entities (other than the five pre-defined ones), the use of externally defined character entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.).

As noted, the alternative is to just use a numeric character reference instead, but the best option is to just use a Unicode encoding, such as UTF-8 or UTF-16, and enter the real character (see my Guide to Unicode for more information). Arguably, if you’re using a Unicode encoding, one of the only times when it is useful to use a character reference instead of the real character is for non-printable characters, such as non-breaking space (  or, preferably,  ), Em-space, En-space, zero-width characters, etc. The main reason for that is to be able to clearly identify them when you’re reading/editing the source code.

Summary

Numeric character references, both decimal and hexadecimal, can be safely used in (X)HTML and XML, but you need to be careful that you’re referencing the character’s code point from the Unicode character repertoire, not Windows-1252 (especially in the range from 128 to 159).

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &, <, >, " and '. Of these, you should note that ' is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.

5 thoughts on “Character References Explained

  1. I’m sorry for being overly pedantic here Lachlan :-)

    Actually, there is nothing called character entity references. Entity references are either general entity references or parameter entity references.

    Also, what you call decimal character reference is called numeric character reference in the standard.
    There are also hex character references as you say, but they are not a “subgroup” of numeric character reference, they are a subgroup of character reference.

    There are also a third subgroup of character reference—named character references, like &#TAB; and &#SPACE;, but I’ve never seen them used, and as far as I rememeber from testing them, no browser supports them (which isn’t a surprise).

    IOW:
    Entity references consists of general entity references and parameter entity references.
    Character references consists of named character references, numeric character references and hex character references

    As I said, sorry for being overly pedantic here.

  2. Confusingly, HTTP refers to the character encoding as the character set, so when you see Content-Type: text/html;charset=UTF-8 it’s actually defining the character encoding, and the character set remains unchanged.

  3. Thank you Lachlan for taking the time to gather and organize all these useful informations.

    It’s very practical and focused, which is no easy feat considering the possible ramifications.

    PS : I really hate the little Voight-Kampf test for the comments.

Comments are closed.