Category Archives: MarkUp

SGML, (X)HTML, XML and other markup languages.

Validation Quiz

Let’s say you’ve been writing HTML and XHTML for years. Being a standards activist, you always write well formed, valid markup. You meticulously validate every document you write. Not only that, but you’ve installed the web developer toolbar in Firefox or Mozilla and, as a hobby, you run the validator on every site you visit. With years of experience under your belt, you think you can handle any error the validator throws at you, and you’re confident you can fix whatever it is in under a minute.

If that description fits you, then I hereby challenge you to find the one and only real validation error within the following sample HTML or XHTML document (I’m not telling you which, you figure it out). Do you think the validator will help? Go ahead and test it! I’ve exploited some known bugs in the validator to ensure you can’t cheat quite so easily. The validator will, in its current state, issue 80 errors; none of which are real!

There is one, and only one, true validation error within this document. The first person to comment with the correct answer and explanation will be featured in a follow up post to them give some recognition for their hard work. Feel free to discuss and ask questions here in the comments (or wherever else you like). This is designed to be a fun exercise for you to realise just how much you really don’t know about HTML and/or XHTML.

Do you think you’re ready to take the quiz? Do you think this will be a walk in the park, and you’ll be the first across the line with the right answer? Ok, here it is, and remember, have fun!

Assume the HTTP headers contain: Content-Type: text/html;charset=UTF-8

<?xml version="1.0" comment="Find the Error!" ?>
<!-- -- -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- -- -->
<!doctype html public "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile   CDATA "?" -- U+263A WHITE SMILING FACE -->
]>
<html lang="en">
<title/validation quiz/
</head>
<p>In this document, there &exist;s a single validation error.  It makes
use of some <strong<em/very/</strong> uncommon & unsupported markup techiques
designed to fool the faint hearted.
<>This exploits some known bugs in <a href=http://validator.w3.org/
to both help prevent cheaters and confuse even the most experienced
authors.</>
<form method="get" action="http://validator.w3.org/check"
<table
<tr
<td<input text checked id=uri name=uri size=40/>
<><label for=uri>Is this test too hard?</label></>
<><td<button button>Don't Cheat!</>
</tbody
></table>
<ul/
<li><![CDATA[
<li Oops<!-- ?]]> -->
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ---->
<blink>I'll bet this is &#147;annoying&#148;!</blink>
<p align="right">Remeber, it's a Strict DOCTYPE!
<!-- ------ Don't give up now! ----- >
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<p>Is the error here --><li>?/
<p/>The question is: Is this<br>HTML or<br/>XHTML
served as text/html? &smile</></></>

HTML Comments in Scripts

It is common practice when including inline scripts within HTML markup, to surround the entire script within what appears to be an HTML comment like this:

Example 1:

<script type="text/javascript"><!--
    ...
//--></script> 

This technique is discussed in HTML 4.01, section 18.3.2 Hiding script data from user agents. However, few people really understand its purpose, what it really is, the problems it creates within XHTML documents, nor the theoretical problems it creates for HTML.

In this article, the term legacy user agents, legacy UAs (or equivalent) is used to refer only to user agents implemented before the script and style elements were introduced as place holders in HTML 3.2.

Note: Although this document will be primarily focussing on the script element, the concepts presented apply equally to the style element as well.

Purpose of the Comment

The purpose of the comment is to allow documents to degrade gracefully in legacy user agents by preventing them from outputting the script as content for the user to read. Since HTML user agents output the content of any unknown elements and because it would be unwise to do so for a script, the use of the comment is designed to ensure that this does not occur within legacy user agents.

It should be noted that there are no user agents in use today that don’t support the script element (regardless of whether they support the actual script or not), so using this technique on the web today seems rather superfluous. Yet it’s interesting to note that so many sites still make use of this old technique that was designed for accessibility reasons, despite the fact that so many of these sites choke in many other ways in browsers with scripts disabled or unsupported.

It’s Not Really a Comment

The content model of the script element in HTML is declared as CDATA, which stands for character data and means that the content within the element is not processed as markup, but as plain text. The only piece of markup which is recognised is the end-tag open (ETAGO) delimiter: </. Where the ETAGO occurs, it must be for the element’s end-tag ( </script> in this case). This is actually the cause of a really common validation error in scripts for people that use the document.write() function or the innerHTML property.

Because no other markup is recognised, the comment declaration is not really a comment; but rather plain text that looks like a comment. It’s designed for legacy UAs that don’t read the DTD and are, therefore, unaware that the script element actually contains CDATA. Because it is designed for backwards compatibility with legacy UAs processing it as markup, there are certain considerations that should be made as a result of this.

There are two small theoretical problems which most people are unaware of but, since no browser has ever been a strictly conforming SGML parser, neither of which are of any practical concern. As you will see, it is in fact the bugs in the legacy UAs for which this technique is designed, that ensures it always works as intended.

For legacy browsers encountering the unknown script element, they may or may not know the content model, depending on whether or not they’ve read the DTD or obtained the information from elsewhere.

In the case of the unknown content model, the parser would treat the element’s content as markup and hide the comment as expected, in most cases. However, there may be a problem caused by the presence of two hyphens within the script that is contained within the comment. Consider the following example:

Example 2:

<script type="text/javascript"><!--
 var i;
 for (i = 10; i > 0; i--) {
 // do something
 }
 //--></script>

Although that is perfect valid HTML 4.01, given that it is not really a comment, this creates a problem for a legacy user agent that process that as a comment. A comment delimiter in SGML is a pair of hyphens (--) and only white space may occur between the second comment delimiter and markup declaration close (MDC) delimiter: >. See the WDG’s explanation of HTML comments for more information.

For an SGML parser treating the element’s content as markup, the invalid comment syntax may potentially cause a problem and result in part of the script being output. As we will see later, this example will actually cause a fatal error in XHTML documents.

In the case of the known content model, since the parser is aware that the content model is CDATA, yet the script element is still an unknown element for the UA, it would process it as such and actually end up outputting the entire content of the element as text, thus defeating the purpose of attempting to hide the script.

For backwards compatibility, legacy UAs processing the element’s content as markup is depended upon to hide the content, but for the above reasons it means that it does not allow for backwards compatibility with any hypothetical legacy user agent using a strictly conforming SGML parser, in all cases. However, there are no browsers that do read the DTD and due to the bugs in all real legacy browsers, neither of these issues have ever caused any real world problems in HTML. It will also never cause any problem in the future either, since all future implementations will support the script element.

XHTML Problems

This technique does in fact cause real problems for XHTML documents that many authors are unaware of. It seems that Microsoft have fallen into this trap with their new Visual Web Developer 2005 Express application, as discussed by Charl van Niekerk in ASP.NET 2.0 – Part 2. It also seems that the Movable Type developers have made this mistake too, as Jacques Distler pointed out recently.

In XHTML, the content model of the script element is declared as #PCDATA (parsed character data), not CDATA, thus the content of the script element is supposed to be parsed as markup and the comment declaration really is a comment. Because of this, XHTML UAs will (when the document is served as XML) ignore the content of the comment, and thus ignore the script entirely.

However because the HTML script element contains CDATA and user agents treat XHTML documents served as text/html as HTML, they also treat the content of the script element as CDATA, and thus the comment is not treated as a comment by HTML 4 UAs. This is one of the many problems with serving XHTML as text/html.

As I mentioned earlier, example 2, which contains the extra pair of hyphens, will actually cause a fatal error in XHTML. In SGML, if it were a real comment, it would also be invalid, yet it would not be fatal since UAs employ error handling techniques to continue processing the document. In XML, however, it is a well-formedness error, which is fatal. Thus not only would the script be ignored because of the comment, but the entire document would be rendered totally useless for the user.

However, because most authors do incorrectly serve their XHTML documents as text/html, and the UAs parse it as HTML, authors are generally not aware of these issues. There are many other problems with using scripts for both HTML and XHTML, but those issues are out of scope for this article and best left for another day.

The Correct Method for XHTML

The correct way to use an inline script within XHTML is to escape it as character data using a CDATA section:

Example 3:

<script type="text/javascript"><![CDATA[
    var i = 5;
    if (i < 10) {
        // do something
    }
//]]></script> 

The CDATA section is necessary to ensure that special characters such as < and & are not treated as markup, which would otherwise result in a well-formedness error if it were not escaped as character references. The alternative is to encode such characters with the character references like &lt; and &amp;, however the readability of the script would be reduced and is not backwards compatible with HTML UAs, when XHTML is incorrectly served as text/html.

If your script doesn’t make use of either of those special characters, then the CDATA section is not necessary, but it’s a good habit to always include it anyway.

Backwards Compatibility

Ignoring the fact that XHTML should not be served as text/html and accepting that it does happen in the real world, for an HTML 4 UA processing the above XHTML script element, the CDATA section markup results in a JavaScript syntax error since the only markup-like syntax that scripting engines allow is an SGML comment declaration at the beginning of the first line. In order to allow the script to be correctly recognised and processed by current HTML and XHTML UAs, while still hiding from legacy UAs, a clever combination of comments and a CDATA section needs to be used.

Example 4:

<script type="text/javascript"><!--//--><![CDATA[//><!--
    ...
//--><!]]></script> 

An HTML UA which correctly treats the script element as CDATA is supposed to ignore everything following the comment declaration open delimiter on the first line. i.e. everything after the first “<!--” should be ignored. An XHTML UA will treat it as a comment followed by a CDATA section to escape the entire script.

In summary, an HTML 4 UA will pass the entire content of the script element off to the JavaScript engine, which will quite happily ignore the first line of markup, whereas an XHTML UA will only pass everything between <![CDATA[ and ]]> (not inclusive). The end result is that both HTML and XHTML UAs treat the script as CDATA, as intended, with neither of them ignoring any of the actual script, while providing some level of backwards compatibility with legacy user agents.

This does, however, open up a whole new theoretical problem for any hypothetical legacy UA that is unaware of the script element and processes it as markup with a conforming SGML parser. When example 4 is processed as markup within SGML, it should be parsed identically to XML. So the unknown script element in a legacy user agent with a strictly conforming SGML parser would output the content of the script (within the CDATA section) as text. However, the only user agent I know of the supports the CDATA section for HTML documents is the new Opera 8. Thus, the above syntax is really only backwards compatible with the non-conformant parsing behaviour of real world legacy HTML UAs, which does not cause any practical problems.

Avoiding All of These Problems

To avoid all of these problems with scripts and the decision of whether or not to include the pseudo-comment declaration in HTML documents at all, the best solution is to always include scripts as external files. This has the advantage of not being unintentionally ignored by XHTML UAs, and not erroneously processed in anyway by legacy HTML/SGML UAs, regardless of whether they are conforming or not. It also helps to better separate the markup from the script, and makes the script more easily reusable in multiple documents.

However, you must remember, that simply solving all of these markup issues for your scripts in an XHTML document doesn’t necessarily mean the the script will work correctly for both XHTML and HTML when served with the correct MIME types. There are more issues such as document.write() not working for XML, the need to use the namespace aware DOM methods in XHTML and many more related issues.

The Future: HTML or XHTML

The discussion of XHTML versus HTML (via) has popped up again, and until now I’ve managed to resist the urge to throw in my 2¢. Well, no longer will I sit on the side line while the same arguments (via) get rehashed again and again, which will not get us anywhere. The question, which I will attempt to answer, is whether the future of the Internet lies with HTML or XHTML.

Firstly, I’m just going to set a few ground rules. This is not going to be another version of XHTML as text/html is considered harmful or there are no real benefits to use XHTML or an XHTML isn’t even supported kind of article. I’m going to get straight to the facts, so here goes…

HTML

HTML is all but dead. It’s been getting beaten to death ever since the early versions of Netscape and IE. It’s been on life support and holding on by a thread (albeit a particularly strong, yet very much frayed, thread) ever since IE5/Mac threw it a lifeline called DOCTYPE sniffing. Yet no attempt to revive it has been, or will ever be, successful in prolonging its life more than a few years past its use-by-date and it is almost time to let it rest in peace.

I know what you’re all thinking. I’m either insane or just over a week late for April Fools. How could, arguably, the most successful document format in the history of the web, and computing in general, have been so irreparably damaged to be this close to death?

The answer and the reason for my temporary insanity, which has lead to these rather shocking and completely outrageous yet incredibly accurate claims, all comes down to the question of what HTML is supposed to be, compared with the mind numbingly deformed representation we all know and love today, and how it can and cannot be improved in the future.

What is HTML Supposed to Be?

From its humble beginnings as a small, light-weight, non-proprietary, easy-to-use document format designed for the publication and distribution of scientific documents (created by the mastermind who is aptly titled the inventor of the World Wide Web and whom we all know as Tim Berners-Lee) closely resembled the international standard, ISO:8879 – Standard Generalised Markup Language (SGML).

While HTML was not originally based on SGML, the similarities in syntax and the lack of formal parsing rules for HTML led to the decision to resolve the differences and formalise HTML 2.0 as an application of SGML. This was eventually published by the IETF as RFC 1866 in November 1995. Martin Bryan provides a relatively short summary of how HTML began, and the process to convert it into an application of SGML.

What is HTML Now?

Sadly, by the time HTML was formalised as an application of SGML, the irreparable damage to the language (which would eventually lead to the coining of the term tag soup by Dan Connolly) had already been done. None of the HTML browsers that were implemented prior to HTML 2.0 contained conforming SGML parsers, few have ever done so since, and no mainstream browser ever will.

As a result, browsers don’t read DTDs. Instead they have all known elements, attributes and their content models essentially hard coded, and basically ignore any element they have never heard of. For this reason it is widely believed that DTDs serve absolutely no purpose for anything other than a validator, and DOCTYPEs are for nothing but triggering standards mode in modern browsers.

There are many intentionally broken features in existing HTML parsers that directly violate both the HTML recommendation and SGML standard that will never be fixed. The reason is the simple fact that to do so would break millions of legacy documents, which would only end up affecting the user’s ability to access them. See HTML 4.01 Appendix B for a brief, yet very incomplete, summary of unsupported SGML features.

How Can HTML Be Improved?

The simple answer is not much at all. The ability of HTML to progress and improve is severely limited by the aforementioned non-conforming parsers and millions of legacy documents that would break if any serious improvements were to be made. As Hixie put it: we can at best add new elements when it comes to the HTML parser.

The element content models for many existing elements cannot be changed much. (e.g. The p element cannot be updated to allow nested lists, tables or blockquotes, the title element cannot be updated to contain any semantic inline-markup, etc.) Much of the quirky non-conformant behaviour exhibited by existing browsers will have to be inherited by any future implementations. In fact, such behaviour is being retroactively standardised by Ian Hickson and the WHAT Working Group.

There is even speculation about whether or not HTML should retain the pretence of being an application of SGML. Other than the benefits of validation with SGML DTDs, and the triggering of standards mode with an SGML DOCTYPE, there is little reason to do so. However, the extensive conformance criteria expressed within the WHAT Working Group drafts that simply cannot be expressed within a DTD would make validation – as a quality assurance or conformance tool – limited, at best.

Not only that, but any serious attempt at retaining backwards compatibility with existing browsers is expected to require an extensive library of hacks (like Dean Edward’s IE7) to make existing browsers do anything useful with the new extensions. Not even style sheets will have any effect on the new elements without this library of hacks, as the new elements will be essentially ignored.

The question is: do we really want to hold onto a dying language any longer than we need to, with any and all progressions and enhancements being so extremely limited; or should we really start pushing to move to a much more flexible and beneficial alternative?

XHTML

Despite all prior claims of XHTML having no benefit whatsoever, when it comes to extending the language with new elements, attributes and content models, the benefits far out weigh the negatives. In fact, all claims that XHTML has no benefits over HTML only apply to XHTML 1.0 because the semantics of both document formats are identical.

What is XHTML Supposed to be?

XHTML is supposed to be an application of XML with very strict parsing rules. Do I really need to continue? I will assume we all know what XML and XHTML are, so no need for me to reiterate it all. For anyone that doesn’t, that’s what search engines are for. 🙂

What is XHTML Now?

Unfortunately, most XHTML on the web is nothing more than tag soup, or is at least not well-formed, served as text/html. As previous surveys have shown, a majority of sites claiming to be XHTML don’t even validate, and most would end up with browsers choking on them if the correct MIME type were used.

Some of the other problems are: that XHTML is not implemented by IE, incremental rendering for XHTML in Gecko doesn’t yet work, scripts written for tag-soup often won’t work in real XHTML, style sheets need to be fixed, etc., etc… Most of this stuff is discussed in Ian Hickson’s document Sending XHTML as text/html is Considered Harmful (which I’m sure everyone has read by now) and elsewhere on the web.

However, the major benefit of XHTML over HTML is that we do already have (mostly) very strictly conforming XML parsers. While these do still have a few bugs, they can be fixed without any detrimental effect on legacy content. This fact alone allows much greater room for enhancement than HTML ever will.

How Can XHTML Be Improved?

With a proper understanding of how to use XML and XHTML, there are really no limitations on how far XHTML can progress. We will not be held up by extreme browser bugs and limitations; there’s no non-conformant behaviour that will have to be replicated by future implementations, element content models can be changed for existing elements, and new elements can be added and supported very easily. And at least with full style sheet support they will not be rendered totally useless (as in HTML without a library of hacks) in existing XHTML UAs.

It is completely true that, if you are not using any of the XML only features such as mixed namespace documents (e.g. XHTML+MathML), there are almost no benefits to be gained from using XHTML 1.0. However, there will be benefits in using either XHTML 2.0 or the WHAT Working Group’s (X)HTML Applications, including Web Forms 2.0, Web Apps 1.0 and Web Controls 1.0, which I think should be collectively known as HAppy 1.0 (for HTML Applications), not (X)HTML 5.0.

By using the XHTML variant of HAppy 1.0 (if that’s what it gets called – with or without the uppercase A – let me know what you think ;-)) backwards compatibility with existing XHTML UAs will be much easier, because at least style sheets will work and the new elements will simply behave like divs and spans. Backwards compatibility with IE and other legacy UAs will require a bit more work, though: you will need to arrange for your XHTML document to be converted into HTML, as serving this new version of XHTML as text/html will be strictly forbidden.