Category Archives: Characters

Character encodings, repertoires and related issues, including Unicode.

Handling Character Encodings

Anyone who’s ever written a form for user input and actually cares about ensuring the correct character encoding is submitted has had trouble with users submitting Windows-1252, where ISO-8859-1 was expected. Even if you were intelligent and were using a Unicode encoding like UTF-8 and accepting such input from your forms, there’s still a problem with Trackbacks, since you can’t have no control over what encoding they’re sent in.

This is commonly ignored by implementations and results in invalid characters used within HTML and you end up a few question marks (commonly shown as a U+FFFD Replacement Character by browsers) scattered around the text.

Now there is a solution. I’ve written some PHP to first detect the most likely encoding as either being UTF-8, ISO-8859-1 or Windows-1252. If it is UTF-8, nothing needs to be done with it. If it’s ISO-8859-1 or Windows-1252, we need to convert it to UTF-8.

Determining the Encoding

The first 3 functions I’ve written will allow you to determine what character encoding is used. These are isUTF8(), isISO88591() and isCP1252() and return true if the string validates as the respective encoding. These work by using regular expression that matches valid octet sequences for the encoding. The regular expression for UTF-8 was adapted from the Perl code provided by the W3C in an article about multilingual forms.

My version is a little more restrictive than that, in that it will reject any character with a code point from 128 to 159. Although these code points are valid in XML and can be validly encoded in UTF-8, they are Unicode control characters and they are invalid within HTML 4. Additionally, the chances of a user legitimately submitting those characters are slim to nil, so it’s better to reject them than try to convert them to something else.

The ISO-8859-1 function works in the same way. It too rejects characters with those code points, as it is far more likely that the user has submitted Windows-1252 than the control characters.

Converting to UTF-8

In PHP, the utf8_encode() function can be used to convert from ISO-8859-1 to UTF-8. However, the real world forces us to handle ISO-8859-1 as Windows-1252, yet the utf8_encode() function will not handle that as well as we would like.

Since Windows-1252 is a superset of ISO-8859-1, these can both be handled by the same function: utf8FromCP1252(). Internally, this makes use of the pre-existing utf8_encode() function. Afterwards, it searches the newly encoded UTF-8 string for characters in the offending code points and remaps them to their correct Unicode code points and encodes them.

To do this a second function is used which accepts the Windows-1252 encoded character, determines the code point, uses a look up table in an array to find the Unicode code point and then calls a third function to generated the UTF-8 encoded character from that code point.

The third function has been adapted from Anne Van Kesteren’s Character references to UTF-8 converter, who originally adapted it from Henri Sivonen’s UTF-8 to Code Point Array Converter. The main difference with my version is that I renamed it and changed the variable names used to something a little more sensible.

Code and Demo

You can see it all in action on the demonstration page. Enter some characters in the UTF-8 for and the ISO-8859-1 forms and see how it flawlessly handles the detection and conversion of your input into valid UTF-8 output. The source code is available also.

Character References Explained

It seems that many people aren’t as well informed about character references as they should be, so I’m going to clearly explain everything you need to know (and some things you probably don’t) about them, primarily for HTML and XML. There are two types of character references: numeric character references and character entity references.

Numeric Character References

There are two forms of numeric character references: decimal and hexadecimal.

Syntax

Decimal take the form &#nnnn;, where nnnn is the reference to the Unicode code point for the character in decimal [0-9]. Hexadecimal takes the form &#xhhhh; where hhhh is the code point in hexadecimal notation [0-9a-fA-F].

For hexadecimal character references in HTML, the x is case insensitive but in XML it must be lower case. So, for example,   is valid in HTML but invalid in XML.

In HTML, the reference end is either:

  • a reference close (REFC) delimiter (a semicolon by default)
  • a record end (RE) function code
  • a character, such as a space, which is not part of a valid entity name.

That basically means that the semi-colon may be omitted from the end in certain circumstances. However, remembering exactly when it can and can’t be omitted is difficult; and to avoid the chance of error as much as possible, it’s very good practice to always include it. In XML, the semi-colon is required. These same rules apply the character entity references as well.

Character Repertoire

It is most important to note that regardless of the document’s character encoding, numeric character references always refer to the code position in the Unicode character repertoire . In SGML terms, this is referred to as the Document Character Set (DCS) and, for HTML, is defined in the SGML Declaration. For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You should note that there is a difference between the DCS and the file’s character encoding, which can be anything, including ISO-8859-1, UTF-8, UTF-16, Shift_JIS, or whatever else; but the encoding does not affect numeric character references in any way, they always refer to a character within the DCS which is, as stated, defined as Unicode.

For example: if you want to include the right single quotation mark (’) using a numeric character reference, you need to know the Unicode code point. In this case, it is U+2019 or, in decimal, 8217. Thus, the hexadecimal and decimal character references will be, respectively, ’ and ’.

Common Mistakes

A very common mistake is to use the code point from the Windows-1252 character repertoire instead. You should be aware that this is not correct, even though browsers have been forced to support them simply because IE does. The problematic code points range from 128 to 159. In Windows-1252, the right single quotation mark falls within this range at the code position 0x92, or, more commonly known in decimal as 146. However, you cannot include the character using this code point, as in ’ or ’, because doing so would actually refer to the Unicode code point, not Windows-1252, and characters in this range are defined as control characters.

These control characters are defined as UNUSED in the SGML Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML characters. According to the SGML handbook in section 13.1.1 that simply means that no meaning is assigned to that character, but explicitly states in section 9.2 that a non-SGML character can be entered as a data character within an SGML entity by using a character reference.

Validation Issues

Strictly speaking, although these characters cannot be included as data characters within an HTML document, it is not invalid to refer to these characters with character references. The problem is a combination of the fact that the meaning of a non-SGML character is nothing short of obscure and, in Unicode, these characters are non-printable control characters.

Because of this, the validator will only issue a warning; but, although its use is still technically valid, it should be treated as an error as it almost certainly does not mean what the author intended — in fact, its meaning is undefined. To clarify the validation issues a little more, compare the results of using a non-SGML character directly within the markup and a reference to a non-SGML character. The first will fail validation with an error; while the second will pass, but with a warning, even though they’re essentially using the same character.

You should note that this is not the case for XML (including XHTML). Technically, this range (from 128 to 159) is perfectly valid according the the production for Char in XML, but they still refer to Unicode control characters, their meaning is undefined in the context of the document and thus should not be used. Although the W3C Validator will issue the same error and warning for equivalent XHTML documents, this is a symptom of its origin as an SGML validator patched to work in a sort-of XML mode. However, validating with a true XML validator (like that provided by Page Valet) will not result in any errors or warnings at all.

It is important to realise that, in XML, using or referring to a character that does not match the production for Char violates a well-formedness constraint. For example, using a control character in the range from 0 to 31 (except for tab, newline and carriage return) either directly or with a numeric character reference results in a well-formedness error.

Character Entity References

Character entity references use symbolic names instead of numbers, and take the form &name;. All entity references are case sensitive. So, for example, å and Å refer to two separate characters in HTML: å and Å, respectively. The rules for the reference end are the same as for numeric character references (discussed above).

Some of the well known entity references in HTML include &, <, > and ". Interestingly, " was actually removed from HTML 3.2, but this was later realised to be a mistake and added back again in HTML 4.

Predefined Entity References in XML

In XML, those are 4 of the 5 predefined entity references that can be used in any XML document, without needing them to be being defined in a DTD. The 5th predefined entity reference in XML is ', but the reason I mention it separately from the others is that it is not defined in HTML and, as a result, it is also not supported in IE for HTML. However, it is rare that one actually needs to use it, as it is only required within an attribute value delimited by single quotes (‘), rather than the more conventional double quotes (“). In such cases, a numeric character reference can always be used in its place.

External Entity References

HTML 4, XHTML 1.x and MathML define many other character entity references in their respective DTDs. These are called external entity references. In HTML, they are divided into three groups: ISO-8859-1 characters, symbols, mathematical symbols, and Greek letters and markup-significant and internationalization characters. Digital Media Minute have provided a useful character entity reference chart containing all of these. If you’re interested in the MathML entities, see chapter 6 of MathML 2.0.

Because these are defined in the DTD, technically none of them can be used in an HTML document without an appropriate DOCTYPE declaration referencing an appropriate HTML DTD; although since browsers don’t read the DTD anyway, browsers will support them, regardless. However, in XHTML and MathML (served with an XML MIME type), the DOCTYPE is required for practical reasons to use any entity, other than the 5 predefined ones.

For example,   and ’ are defined in the XHTML DTD, they are not predefined in XML and so require the DTD to be used. Without it, their use violates a well-formedness contraint, but it should be noted that using externally defined entities is unsafe in XML because it requires a validating XML parser to read the DTD. The Mozilla Web Developer FAQ notes:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of externally defined character entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support externally defined character entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support externally defined character entities (other than the five pre-defined ones), the use of externally defined character entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.).

As noted, the alternative is to just use a numeric character reference instead, but the best option is to just use a Unicode encoding, such as UTF-8 or UTF-16, and enter the real character (see my Guide to Unicode for more information). Arguably, if you’re using a Unicode encoding, one of the only times when it is useful to use a character reference instead of the real character is for non-printable characters, such as non-breaking space (  or, preferably,  ), Em-space, En-space, zero-width characters, etc. The main reason for that is to be able to clearly identify them when you’re reading/editing the source code.

Summary

Numeric character references, both decimal and hexadecimal, can be safely used in (X)HTML and XML, but you need to be careful that you’re referencing the character’s code point from the Unicode character repertoire, not Windows-1252 (especially in the range from 128 to 159).

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &, <, >, " and '. Of these, you should note that ' is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.

Web Developer Quiz Answers

These are the answers to last week’s Web Developer Quiz. If you have not attempted the quiz yourself, I recommend you do so before reading the following answers. All the responses to the quiz from ealier this week were made public earlier today.

Validation

There is only one error within the sample document, validate it and see for yourself:

Line 4, column 11: there is no attribute “ALIGN”

The align attribute is not valid in HTML 4.01 Strict because it is deprecated. It is valid in HTML 4.01 Transitional. For information about why line 7 isn’t an error, refer to the validation quiz and associated answers I published earlier.

Elements in the DOM

There are 3 p elements within the document. The syntax: <> in an empty start-tag, an unsupported SHORTTAG feature from SGML. It basically means to open the most recent unclosed element. Similarly, </> is an empty end-tag which ends the most recent open element.

The em element will not be present because, despite appearances to the contrary, it is actually commented out. The head and body elements will still be present, even though their start- and end-tags have been omitted.

Validate it and look at the Parse Tree to confirm these answers.

Semantics

The unordered list (option 3) is the most semantically correct. A stylesheet may be used to style it in any way desired.

The <h1> element without the style attribute or the class attribute with a presentational class name is the most appropriate markup for a document title. An external stylesheet may be used, and is the recommended way, to horizontally centre it in a visual medium using a large, bold font. The use of the style attribute or the presentational class name is not recommended because it fails to separate the markup from the presentation.

Everyone got these 2 questions correct. Well done. In hindsight, I wish I had made these more difficult, but since semantics is not an exact science, I found that (in general) the more complicated the question, the less specific the answer could be. So, I settled for relatively easy questions for things that beginners tend to markup poorly.

Character References

  • For an HTML 4.01 document: the numeric character reference: &#146; and the character entity reference: &apos; are invalid.
  • For an XHTML 1.0 document: technically, none of them are invalid; however the numeric character reference: &#146;, while it is not prohibited in XML, refers to a Unicode control character and should not be used anyway.
  • For a generic XML document with no DTD, only the character entity reference: &rsquo; is invalid. &apos; is valid because it is one of the 5 predefined entities in XML.

Since few people correctly answered these questions, I will be providing more information about this in tomorrow’s post.

Media Types (MIME)

An XHTML 1.1 document SHOULD NOT be served with the text/html MIME type. See the XHTML Media Types Note for more information.

An XHTML 1.0 document MAY be served as text/html when the document conforms to the Appendix C HTML Compatibility Guidelines in the XHTML 1.0 Recommendation. Those who pointed out that this is ludicrous get a bonus point.

If any of you have any questions or comments regarding this quiz, please feel free to let me know. The feedback I have recieved, or will recieve, regarding this quiz will help me a lot with the next one I’m planning, which will most likely be a CSS quiz of some kind, possibly followed by a JavaScript/DOM quiz if I have time. Beyond that, well, you’ll have to wait and see.