Lachlan Hunt | Lachy’s Log

The discussion of XHTML versus HTML (via) has popped up again, and until now I’ve managed to resist the urge to throw in my 2¢. Well, no longer will I sit on the side line while the same arguments (via) get rehashed again and again, which will not get us anywhere. The question, which I will attempt to answer, is whether the future of the Internet lies with HTML or XHTML.

Firstly, I’m just going to set a few ground rules. This is not going to be another version of XHTML as text/html is considered harmful or there are no real benefits to use XHTML or an XHTML isn’t even supported kind of article. I’m going to get straight to the facts, so here goes…

HTML

HTML is all but dead. It’s been getting beaten to death ever since the early versions of Netscape and IE. It’s been on life support and holding on by a thread (albeit a particularly strong, yet very much frayed, thread) ever since IE5/Mac threw it a lifeline called DOCTYPE sniffing. Yet no attempt to revive it has been, or will ever be, successful in prolonging its life more than a few years past its use-by-date and it is almost time to let it rest in peace.

I know what you’re all thinking. I’m either insane or just over a week late for April Fools. How could, arguably, the most successful document format in the history of the web, and computing in general, have been so irreparably damaged to be this close to death?

The answer and the reason for my temporary insanity, which has lead to these rather shocking and completely outrageous yet incredibly accurate claims, all comes down to the question of what HTML is supposed to be, compared with the mind numbingly deformed representation we all know and love today, and how it can and cannot be improved in the future.

What is HTML Supposed to Be?

From its humble beginnings as a small, light-weight, non-proprietary, easy-to-use document format designed for the publication and distribution of scientific documents (created by the mastermind who is aptly titled the inventor of the World Wide Web and whom we all know as Tim Berners-Lee) closely resembled the international standard, ISO:8879 – Standard Generalised Markup Language (SGML).

While HTML was not originally based on SGML, the similarities in syntax and the lack of formal parsing rules for HTML led to the decision to resolve the differences and formalise HTML 2.0 as an application of SGML. This was eventually published by the IETF as RFC 1866 in November 1995. Martin Bryan provides a relatively short summary of how HTML began, and the process to convert it into an application of SGML.

What is HTML Now?

Sadly, by the time HTML was formalised as an application of SGML, the irreparable damage to the language (which would eventually lead to the coining of the term tag soup by Dan Connolly) had already been done. None of the HTML browsers that were implemented prior to HTML 2.0 contained conforming SGML parsers, few have ever done so since, and no mainstream browser ever will.

As a result, browsers don’t read DTDs. Instead they have all known elements, attributes and their content models essentially hard coded, and basically ignore any element they have never heard of. For this reason it is widely believed that DTDs serve absolutely no purpose for anything other than a validator, and DOCTYPEs are for nothing but triggering standards mode in modern browsers.

There are many intentionally broken features in existing HTML parsers that directly violate both the HTML recommendation and SGML standard that will never be fixed. The reason is the simple fact that to do so would break millions of legacy documents, which would only end up affecting the user’s ability to access them. See HTML 4.01 Appendix B for a brief, yet very incomplete, summary of unsupported SGML features.

How Can HTML Be Improved?

The simple answer is not much at all. The ability of HTML to progress and improve is severely limited by the aforementioned non-conforming parsers and millions of legacy documents that would break if any serious improvements were to be made. As Hixie put it: we can at best add new elements when it comes to the HTML parser.

The element content models for many existing elements cannot be changed much. (e.g. The p element cannot be updated to allow nested lists, tables or blockquotes, the title element cannot be updated to contain any semantic inline-markup, etc.) Much of the quirky non-conformant behaviour exhibited by existing browsers will have to be inherited by any future implementations. In fact, such behaviour is being retroactively standardised by Ian Hickson and the WHAT Working Group.

There is even speculation about whether or not HTML should retain the pretence of being an application of SGML. Other than the benefits of validation with SGML DTDs, and the triggering of standards mode with an SGML DOCTYPE, there is little reason to do so. However, the extensive conformance criteria expressed within the WHAT Working Group drafts that simply cannot be expressed within a DTD would make validation – as a quality assurance or conformance tool – limited, at best.

Not only that, but any serious attempt at retaining backwards compatibility with existing browsers is expected to require an extensive library of hacks (like Dean Edward’s IE7) to make existing browsers do anything useful with the new extensions. Not even style sheets will have any effect on the new elements without this library of hacks, as the new elements will be essentially ignored.

The question is: do we really want to hold onto a dying language any longer than we need to, with any and all progressions and enhancements being so extremely limited; or should we really start pushing to move to a much more flexible and beneficial alternative?

XHTML

Despite all prior claims of XHTML having no benefit whatsoever, when it comes to extending the language with new elements, attributes and content models, the benefits far out weigh the negatives. In fact, all claims that XHTML has no benefits over HTML only apply to XHTML 1.0 because the semantics of both document formats are identical.

What is XHTML Supposed to be?

XHTML is supposed to be an application of XML with very strict parsing rules. Do I really need to continue? I will assume we all know what XML and XHTML are, so no need for me to reiterate it all. For anyone that doesn’t, that’s what search engines are for. 🙂

What is XHTML Now?

Unfortunately, most XHTML on the web is nothing more than tag soup, or is at least not well-formed, served as text/html. As previous surveys have shown, a majority of sites claiming to be XHTML don’t even validate, and most would end up with browsers choking on them if the correct MIME type were used.

Some of the other problems are: that XHTML is not implemented by IE, incremental rendering for XHTML in Gecko doesn’t yet work, scripts written for tag-soup often won’t work in real XHTML, style sheets need to be fixed, etc., etc… Most of this stuff is discussed in Ian Hickson’s document Sending XHTML as text/html is Considered Harmful (which I’m sure everyone has read by now) and elsewhere on the web.

However, the major benefit of XHTML over HTML is that we do already have (mostly) very strictly conforming XML parsers. While these do still have a few bugs, they can be fixed without any detrimental effect on legacy content. This fact alone allows much greater room for enhancement than HTML ever will.

How Can XHTML Be Improved?

With a proper understanding of how to use XML and XHTML, there are really no limitations on how far XHTML can progress. We will not be held up by extreme browser bugs and limitations; there’s no non-conformant behaviour that will have to be replicated by future implementations, element content models can be changed for existing elements, and new elements can be added and supported very easily. And at least with full style sheet support they will not be rendered totally useless (as in HTML without a library of hacks) in existing XHTML UAs.

It is completely true that, if you are not using any of the XML only features such as mixed namespace documents (e.g. XHTML+MathML), there are almost no benefits to be gained from using XHTML 1.0. However, there will be benefits in using either XHTML 2.0 or the WHAT Working Group’s (X)HTML Applications, including Web Forms 2.0, Web Apps 1.0 and Web Controls 1.0, which I think should be collectively known as HAppy 1.0 (for HTML Applications), not (X)HTML 5.0.

By using the XHTML variant of HAppy 1.0 (if that’s what it gets called – with or without the uppercase A – let me know what you think ;-)) backwards compatibility with existing XHTML UAs will be much easier, because at least style sheets will work and the new elements will simply behave like divs and spans. Backwards compatibility with IE and other legacy UAs will require a bit more work, though: you will need to arrange for your XHTML document to be converted into HTML, as serving this new version of XHTML as text/html will be strictly forbidden.

Lachy’s Log

If I start now, I'll be finished later!

All posts by Lachlan Hunt

I’m on SitePoint

The Future: HTML or XHTML