Category Archives: MarkUp

SGML, (X)HTML, XML and other markup languages.

XHTML is not for Beginners

As web standards advocates, many of us participate in numerous online communities such as mailing lists, forums, newsgroups and even blogs (both our own and comments on others). In these communities, we often encounter beginners who are either just starting out with HTML, or have been doing HTML for a while, but are new to the concept of developing with standards.

Invariably, such beginners face the eternal question of HTML or XHTML; and today, I intend to answer this question (as it applies to beginners) once and for all. For experienced users, the answer may be different, this only applies to beginners and to those of us teaching them.

I don’t particularly want to start up the XHTML vs. HTML debate again, nor simply reiterate that XHTML as text/html is extremely harmful; and I must stress that both HTML and XHTML have their uses and it’s important to use the right tool for the job. But for beginners, there needs to be a clear answer with a clear learning path, and those of us teaching them need to be united in our position. For if beginners are hearing different answers from different parties, only confusion will result and we may end up losing them to dark side of the force forever.

Let me start off by saying that XHTML is not for beginners. We must start with HTML and have a clear learning path towards the future with XHTML. It has been argued, that since the future lies with XHTML (although that is yet to be seen), we should be teaching XHTML from the ground up. That sounds nice in theory, but the reality is that we’re still teaching in a predominately text/html environment, and the fact is: trying to teach XHTML under HTML (tag-soup) conditions is like trying to teach a child to swim by throwing them in the deep end and not realising they’re drowning until it’s too late. When it comes to XHTML: there is far too much for a beginner to learn, not to mention the significant issues of browser support, that we must simply accept that they’re not ready and teach them HTML instead.

XHTML is not merely HTML 4 in XML syntax, it comes packaged with all the XML handling requirements as well, with great big “Fragile” and “Handle with Care” stickers on the front of the box. Despite all the myths surrounding the ability to use XHTML as text/html and then simply make the switch to XML when browser support improves, there is significant evidence to show that XHTML developed in a text/html environment will not survive the transition to XML.

The sheer number of tag-soup pages claiming to be XHTML are a direct result of pushing it upon newcomers while leaving out all the extremely important details, most of which they won’t understand yet anyway, but do actually need to learn before using it. I won’t go into the details here, but these issues with XHTML include, among others, the following; and I guarantee that if you ask a beginner (who learned XHTML under HTML conditions) about any of them, they’ll look at you blankly, without a clue what your talking about.

General Markup Issues

  • Internet Explorer 7 and below do not support XHTML at all, not even limited support. Anyone who says otherwise is either ignorant or lying. (It is expected, but not guaranteed, that IE8 will finally support it).
  • Well-formedness errors are fatal.
  • The namespace (xmlns attribute) must be declared in the root element, despite the validator not issuing an error if it’s omitted.
  • Use of named entity references may be fatal for non-validating parsers (except for amp, lt, gt, quot and apos).
  • Use xml:lang instead of lang.
  • The meaning of the XML empty element syntax has a different meaning in SGML and HTML, though browsers don’t support it.
  • DTDs do not support validation of mixed namespace documents very well.
  • When served as XML, the DOCTYPE is not required to trigger standards mode in browsers.
  • The XML declaration will trigger quirks mode in IE6 when served as text/html, it should be omitted in such cases (but see the next few points).

MIME and Encoding

  • MIME type must be declared appropriately in the HTTP headers (application/xhtml+xml (preferred), application/xml (acceptable) or text/xml (not recommended)).
  • Encoding should be declared within the XML declaration, rather than the HTTP headers, since XML is a self describing format. (This does not apply to text/xml).
  • For text/xml, unless specified at the protocol level, US-ASCII must be used.
  • When the XML declaration is omitted, UTF-8 or UTF-16 must be used, unless specified in a higher level protocol.
  • The meta element is is useless for specifying the character encoding and MIME type.

Scripts and Stylesheets

  • script and style elements are parsed differently, the traditional HTML comment-like syntax within script and style elements must not be used for the purpose of hiding from obsolete browsers.
  • document.write() and document.writeln() do not work.
  • innerHTML (non-standard property) is not supported by some XHTML UAs.
  • DOM requires the use of namespace aware methods, where applicable.
  • DOM methods are case sensitive.
  • Element and attribute names from DOM methods are exposed case sensitively (lowercase), compared with uppercase in HTML.
  • XML rules for CSS Stylesheets are applied and they differ significantly from HTML rules. e.g. No special treatment for the body element.
  • Case sensitivity of CSS selectors depends on the markup language, and are thus case sensitive for XHTML.

I’m quite sure that isn’t a complete list of differences between HTML and XHTML, but each and every one of them (plus any that I’ve missed) needs to be learned by anyone who is learning XHTML properly.

The vast majority of those do not apply, or are at least not exposed well, under HTML conditions. Therefore, because of all of this and the fact that most beginners will be learning under HTML conditions, XHTML is not safe for beginners to learn. By teaching XHTML to beginners, we’re really only teaching a new form of tag soup under the guise of “standards based development” and it is doing significantly more harm than good.

Experienced users who are competent enough to understand all of these issues and make an informed decision about whether to use HTML or XHTML may do so, but we cannot expect the same from beginners. So, let me reiterate that we must be united on this issue and we must encourage beginners to start with HTML, not XHTML.

Character References Explained

It seems that many people aren’t as well informed about character references as they should be, so I’m going to clearly explain everything you need to know (and some things you probably don’t) about them, primarily for HTML and XML. There are two types of character references: numeric character references and character entity references.

Numeric Character References

There are two forms of numeric character references: decimal and hexadecimal.

Syntax

Decimal take the form &#nnnn;, where nnnn is the reference to the Unicode code point for the character in decimal [0-9]. Hexadecimal takes the form &#xhhhh; where hhhh is the code point in hexadecimal notation [0-9a-fA-F].

For hexadecimal character references in HTML, the x is case insensitive but in XML it must be lower case. So, for example,   is valid in HTML but invalid in XML.

In HTML, the reference end is either:

  • a reference close (REFC) delimiter (a semicolon by default)
  • a record end (RE) function code
  • a character, such as a space, which is not part of a valid entity name.

That basically means that the semi-colon may be omitted from the end in certain circumstances. However, remembering exactly when it can and can’t be omitted is difficult; and to avoid the chance of error as much as possible, it’s very good practice to always include it. In XML, the semi-colon is required. These same rules apply the character entity references as well.

Character Repertoire

It is most important to note that regardless of the document’s character encoding, numeric character references always refer to the code position in the Unicode character repertoire . In SGML terms, this is referred to as the Document Character Set (DCS) and, for HTML, is defined in the SGML Declaration. For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You should note that there is a difference between the DCS and the file’s character encoding, which can be anything, including ISO-8859-1, UTF-8, UTF-16, Shift_JIS, or whatever else; but the encoding does not affect numeric character references in any way, they always refer to a character within the DCS which is, as stated, defined as Unicode.

For example: if you want to include the right single quotation mark (’) using a numeric character reference, you need to know the Unicode code point. In this case, it is U+2019 or, in decimal, 8217. Thus, the hexadecimal and decimal character references will be, respectively, ’ and ’.

Common Mistakes

A very common mistake is to use the code point from the Windows-1252 character repertoire instead. You should be aware that this is not correct, even though browsers have been forced to support them simply because IE does. The problematic code points range from 128 to 159. In Windows-1252, the right single quotation mark falls within this range at the code position 0x92, or, more commonly known in decimal as 146. However, you cannot include the character using this code point, as in ’ or ’, because doing so would actually refer to the Unicode code point, not Windows-1252, and characters in this range are defined as control characters.

These control characters are defined as UNUSED in the SGML Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML characters. According to the SGML handbook in section 13.1.1 that simply means that no meaning is assigned to that character, but explicitly states in section 9.2 that a non-SGML character can be entered as a data character within an SGML entity by using a character reference.

Validation Issues

Strictly speaking, although these characters cannot be included as data characters within an HTML document, it is not invalid to refer to these characters with character references. The problem is a combination of the fact that the meaning of a non-SGML character is nothing short of obscure and, in Unicode, these characters are non-printable control characters.

Because of this, the validator will only issue a warning; but, although its use is still technically valid, it should be treated as an error as it almost certainly does not mean what the author intended — in fact, its meaning is undefined. To clarify the validation issues a little more, compare the results of using a non-SGML character directly within the markup and a reference to a non-SGML character. The first will fail validation with an error; while the second will pass, but with a warning, even though they’re essentially using the same character.

You should note that this is not the case for XML (including XHTML). Technically, this range (from 128 to 159) is perfectly valid according the the production for Char in XML, but they still refer to Unicode control characters, their meaning is undefined in the context of the document and thus should not be used. Although the W3C Validator will issue the same error and warning for equivalent XHTML documents, this is a symptom of its origin as an SGML validator patched to work in a sort-of XML mode. However, validating with a true XML validator (like that provided by Page Valet) will not result in any errors or warnings at all.

It is important to realise that, in XML, using or referring to a character that does not match the production for Char violates a well-formedness constraint. For example, using a control character in the range from 0 to 31 (except for tab, newline and carriage return) either directly or with a numeric character reference results in a well-formedness error.

Character Entity References

Character entity references use symbolic names instead of numbers, and take the form &name;. All entity references are case sensitive. So, for example, å and Å refer to two separate characters in HTML: å and Å, respectively. The rules for the reference end are the same as for numeric character references (discussed above).

Some of the well known entity references in HTML include &, <, > and ". Interestingly, " was actually removed from HTML 3.2, but this was later realised to be a mistake and added back again in HTML 4.

Predefined Entity References in XML

In XML, those are 4 of the 5 predefined entity references that can be used in any XML document, without needing them to be being defined in a DTD. The 5th predefined entity reference in XML is ', but the reason I mention it separately from the others is that it is not defined in HTML and, as a result, it is also not supported in IE for HTML. However, it is rare that one actually needs to use it, as it is only required within an attribute value delimited by single quotes (‘), rather than the more conventional double quotes (“). In such cases, a numeric character reference can always be used in its place.

External Entity References

HTML 4, XHTML 1.x and MathML define many other character entity references in their respective DTDs. These are called external entity references. In HTML, they are divided into three groups: ISO-8859-1 characters, symbols, mathematical symbols, and Greek letters and markup-significant and internationalization characters. Digital Media Minute have provided a useful character entity reference chart containing all of these. If you’re interested in the MathML entities, see chapter 6 of MathML 2.0.

Because these are defined in the DTD, technically none of them can be used in an HTML document without an appropriate DOCTYPE declaration referencing an appropriate HTML DTD; although since browsers don’t read the DTD anyway, browsers will support them, regardless. However, in XHTML and MathML (served with an XML MIME type), the DOCTYPE is required for practical reasons to use any entity, other than the 5 predefined ones.

For example,   and ’ are defined in the XHTML DTD, they are not predefined in XML and so require the DTD to be used. Without it, their use violates a well-formedness contraint, but it should be noted that using externally defined entities is unsafe in XML because it requires a validating XML parser to read the DTD. The Mozilla Web Developer FAQ notes:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of externally defined character entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support externally defined character entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support externally defined character entities (other than the five pre-defined ones), the use of externally defined character entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.).

As noted, the alternative is to just use a numeric character reference instead, but the best option is to just use a Unicode encoding, such as UTF-8 or UTF-16, and enter the real character (see my Guide to Unicode for more information). Arguably, if you’re using a Unicode encoding, one of the only times when it is useful to use a character reference instead of the real character is for non-printable characters, such as non-breaking space (  or, preferably,  ), Em-space, En-space, zero-width characters, etc. The main reason for that is to be able to clearly identify them when you’re reading/editing the source code.

Summary

Numeric character references, both decimal and hexadecimal, can be safely used in (X)HTML and XML, but you need to be careful that you’re referencing the character’s code point from the Unicode character repertoire, not Windows-1252 (especially in the range from 128 to 159).

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &, <, >, " and '. Of these, you should note that ' is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.