The Future: HTML or XHTML

The discussion of XHTML versus HTML (via) has popped up again, and until now I’ve managed to resist the urge to throw in my 2¢. Well, no longer will I sit on the side line while the same arguments (via) get rehashed again and again, which will not get us anywhere. The question, which I will attempt to answer, is whether the future of the Internet lies with HTML or XHTML.

Firstly, I’m just going to set a few ground rules. This is not going to be another version of XHTML as text/html is considered harmful or there are no real benefits to use XHTML or an XHTML isn’t even supported kind of article. I’m going to get straight to the facts, so here goes…

HTML

HTML is all but dead. It’s been getting beaten to death ever since the early versions of Netscape and IE. It’s been on life support and holding on by a thread (albeit a particularly strong, yet very much frayed, thread) ever since IE5/Mac threw it a lifeline called DOCTYPE sniffing. Yet no attempt to revive it has been, or will ever be, successful in prolonging its life more than a few years past its use-by-date and it is almost time to let it rest in peace.

I know what you’re all thinking. I’m either insane or just over a week late for April Fools. How could, arguably, the most successful document format in the history of the web, and computing in general, have been so irreparably damaged to be this close to death?

The answer and the reason for my temporary insanity, which has lead to these rather shocking and completely outrageous yet incredibly accurate claims, all comes down to the question of what HTML is supposed to be, compared with the mind numbingly deformed representation we all know and love today, and how it can and cannot be improved in the future.

What is HTML Supposed to Be?

From its humble beginnings as a small, light-weight, non-proprietary, easy-to-use document format designed for the publication and distribution of scientific documents (created by the mastermind who is aptly titled the inventor of the World Wide Web and whom we all know as Tim Berners-Lee) closely resembled the international standard, ISO:8879 – Standard Generalised Markup Language (SGML).

While HTML was not originally based on SGML, the similarities in syntax and the lack of formal parsing rules for HTML led to the decision to resolve the differences and formalise HTML 2.0 as an application of SGML. This was eventually published by the IETF as RFC 1866 in November 1995. Martin Bryan provides a relatively short summary of how HTML began, and the process to convert it into an application of SGML.

What is HTML Now?

Sadly, by the time HTML was formalised as an application of SGML, the irreparable damage to the language (which would eventually lead to the coining of the term tag soup by Dan Connolly) had already been done. None of the HTML browsers that were implemented prior to HTML 2.0 contained conforming SGML parsers, few have ever done so since, and no mainstream browser ever will.

As a result, browsers don’t read DTDs. Instead they have all known elements, attributes and their content models essentially hard coded, and basically ignore any element they have never heard of. For this reason it is widely believed that DTDs serve absolutely no purpose for anything other than a validator, and DOCTYPEs are for nothing but triggering standards mode in modern browsers.

There are many intentionally broken features in existing HTML parsers that directly violate both the HTML recommendation and SGML standard that will never be fixed. The reason is the simple fact that to do so would break millions of legacy documents, which would only end up affecting the user’s ability to access them. See HTML 4.01 Appendix B for a brief, yet very incomplete, summary of unsupported SGML features.

How Can HTML Be Improved?

The simple answer is not much at all. The ability of HTML to progress and improve is severely limited by the aforementioned non-conforming parsers and millions of legacy documents that would break if any serious improvements were to be made. As Hixie put it: we can at best add new elements when it comes to the HTML parser.

The element content models for many existing elements cannot be changed much. (e.g. The p element cannot be updated to allow nested lists, tables or blockquotes, the title element cannot be updated to contain any semantic inline-markup, etc.) Much of the quirky non-conformant behaviour exhibited by existing browsers will have to be inherited by any future implementations. In fact, such behaviour is being retroactively standardised by Ian Hickson and the WHAT Working Group.

There is even speculation about whether or not HTML should retain the pretence of being an application of SGML. Other than the benefits of validation with SGML DTDs, and the triggering of standards mode with an SGML DOCTYPE, there is little reason to do so. However, the extensive conformance criteria expressed within the WHAT Working Group drafts that simply cannot be expressed within a DTD would make validation – as a quality assurance or conformance tool – limited, at best.

Not only that, but any serious attempt at retaining backwards compatibility with existing browsers is expected to require an extensive library of hacks (like Dean Edward’s IE7) to make existing browsers do anything useful with the new extensions. Not even style sheets will have any effect on the new elements without this library of hacks, as the new elements will be essentially ignored.

The question is: do we really want to hold onto a dying language any longer than we need to, with any and all progressions and enhancements being so extremely limited; or should we really start pushing to move to a much more flexible and beneficial alternative?

XHTML

Despite all prior claims of XHTML having no benefit whatsoever, when it comes to extending the language with new elements, attributes and content models, the benefits far out weigh the negatives. In fact, all claims that XHTML has no benefits over HTML only apply to XHTML 1.0 because the semantics of both document formats are identical.

What is XHTML Supposed to be?

XHTML is supposed to be an application of XML with very strict parsing rules. Do I really need to continue? I will assume we all know what XML and XHTML are, so no need for me to reiterate it all. For anyone that doesn’t, that’s what search engines are for. 🙂

What is XHTML Now?

Unfortunately, most XHTML on the web is nothing more than tag soup, or is at least not well-formed, served as text/html. As previous surveys have shown, a majority of sites claiming to be XHTML don’t even validate, and most would end up with browsers choking on them if the correct MIME type were used.

Some of the other problems are: that XHTML is not implemented by IE, incremental rendering for XHTML in Gecko doesn’t yet work, scripts written for tag-soup often won’t work in real XHTML, style sheets need to be fixed, etc., etc… Most of this stuff is discussed in Ian Hickson’s document Sending XHTML as text/html is Considered Harmful (which I’m sure everyone has read by now) and elsewhere on the web.

However, the major benefit of XHTML over HTML is that we do already have (mostly) very strictly conforming XML parsers. While these do still have a few bugs, they can be fixed without any detrimental effect on legacy content. This fact alone allows much greater room for enhancement than HTML ever will.

How Can XHTML Be Improved?

With a proper understanding of how to use XML and XHTML, there are really no limitations on how far XHTML can progress. We will not be held up by extreme browser bugs and limitations; there’s no non-conformant behaviour that will have to be replicated by future implementations, element content models can be changed for existing elements, and new elements can be added and supported very easily. And at least with full style sheet support they will not be rendered totally useless (as in HTML without a library of hacks) in existing XHTML UAs.

It is completely true that, if you are not using any of the XML only features such as mixed namespace documents (e.g. XHTML+MathML), there are almost no benefits to be gained from using XHTML 1.0. However, there will be benefits in using either XHTML 2.0 or the WHAT Working Group’s (X)HTML Applications, including Web Forms 2.0, Web Apps 1.0 and Web Controls 1.0, which I think should be collectively known as HAppy 1.0 (for HTML Applications), not (X)HTML 5.0.

By using the XHTML variant of HAppy 1.0 (if that’s what it gets called – with or without the uppercase A – let me know what you think ;-)) backwards compatibility with existing XHTML UAs will be much easier, because at least style sheets will work and the new elements will simply behave like divs and spans. Backwards compatibility with IE and other legacy UAs will require a bit more work, though: you will need to arrange for your XHTML document to be converted into HTML, as serving this new version of XHTML as text/html will be strictly forbidden.

21 thoughts on “The Future: HTML or XHTML

  1. You can style new elements actually. At least, in Mozilla you can do so in text/html documents. Try adding a DI element (the one I described once), put it to display:block and have fun.

    Also, you are saying that when I want my stuff to work in a legacy browser like IE and HTML5 compliant browsers I should really use HTML, right? Because XHTML would only make the incremental loading slower and do possible damage when an error sneeks in, not?

    And what Henrik said, obviously. I’d say HTML as SGML is dead. That is something we all knew, but never acknowledged.

    Make sure to read Matthew Thomas’ response on the WHATWG list about this: “It will perhaps be a good thing once a future version of XML gives
    authors the option of more graceful error-handling.”

    (Please fix the markup of this reply. The markup of this reply has been fixed. It is totally unobvious how to add any kind of formatting in the comments.)

  2. I personally believe there has been vicious cycle of bad markup followed by bloated browsers designed to deal with it, followed by laziness on the part of developers (because you can get away with it) followed by, well, more of the same.

    Be it HTML or XHTML, it doesn’t really matter to me as long as we can clean-up the code that is being produced. Perhaps it’s too late, I don’t know. Billions of pages of slop is a lot of slop.

    I do know this much, none of the software we have would work at all if compiler designers were forced to deal with source code that may or may not pass the test of being valid for the compiler they are building.

    I mean, if you’re a writer and your product is filled with misspellings and bad grammar, that doesn’t make you much of a writer, does it? Why should it be okay to design Web pages with invalid markup?

  3. Interesting piece.

    I really do agree with that Doug said.
    Personally, I think XHTML is something that (should) lead to more valid and well-formed code, but the real main thing is to shape up the general coding for internet, be it HTML or XHTML.

  4. Html is a fine tool, till “designers” pushed it out of its domain. Html needs to stay as a baseline format, just like plain text. Html is a text tool not a layout/design tool. The limits are a virtue.

    XHTML ?? What is its domain? the standards are a moving target. So many over designed pages, so much “junk”. It has its place but will it ever converge on a single standard?

  5. For this reason it is widely believed that DTDs serve absolutely no purpose for anything other than a validator, and DOCTYPEs are for nothing but triggering standards mode in modern browsers.

    Your choice of words indicate that you know that this belief is wrong. Could you elaborate on that? I was under the impression that non-validating user agents do not have to read an external DTD. Am I wrong about that? (I’m not all that clued-in on SGML … either.)

    Anyway, now that you have proclaimed HTML to be all but dead, and I’ve said that XHTML is dead (or at least the idea of XHTML1), should we all start publishing in Flash? 🙂

  6. I don’t care what people say about graceful error handling in XML. This cry and whine that draconian handling will break your page and make your users suffer for you if you have a single error is just another legacy of HTML we’ve gotten used to: our toolchains tend to be of the “glue strings together” (aka templates) variety.

    But they should be XML all the way from bottom to top. The right way to build something like a CMS is to never stick user input right into the output. Either require the use of something like wiki markup which is always, always translated to valid XHTML, or let the user enter tagsoup which is corrected before storing (offer the user a forced preview of their fixed-up tagsoup or something, interface wise).

    There should never be any part of your publishing toolchain just gluing strings together. Ever.

    Everyone saying something else is just noise in the background.

  7. Tommy, although it may seems that my choice of words indicates that belief to be wrong, I actually tried to choose my words in a way that describes the popular belief without taking sides myself.

    I don’t agree that XHTML is dead, it is simply not well supported on the client side yet to be worth using. However, I do believe that XHTML is the future and that HTML will only be retained for backwards compatibility with legacy clients.

  8. I was under the impression that non-validating user agents do not have to read an external DTD. Am I wrong about that? (I’m not all that clued-in on SGML … either.)

    Sorry I forgot to answer this question before. Here’s a short, but not entirely complete, summary (sorry if there are any inaccuracies, though).

    For SGML, the DTD needs to be read by a conforming SGML parser in order to have knowledge of which elements have required start-tags and end-tags, whether or not certain features, such as SHORTTAG, are enabled, entities, default attribute values, etc. Current browsers get away with this by having essentially hard coded knowledge of these, but they don’t even implement certain features like SHORTTAG NET at all and do many things wrongly.

    In XML, only a validating parser needs to read a DTD because many of those SGML features are not available and the DOM can be built easily without any knowledge of the DTD. However, one that doesn’t will not necessarily know about things like entity references (unless they have such knowledge available from elsewhere, which is how current browser support entiry references in XHTML).

  9. Thanks, Lachlan. I was afraid that I had missed something vital somewhere. 🙂

    Since HTML DTDs are static, user agents get away with hard-coding them. Most of them get the SHORTTAG feature wrong anyway, which is why the pretend-XHTML sent as text/html doesn’t cause random greater-than characters to be scattered around the screen.

    My ‘XHTML is dead’ title is a bit misleading, something which I readily admit. The meaning was (and is) that the idea of XHTML leading the way to better markup, better accessibility etc. did not come to fruition. And now the bad habits are so firmly rooted in so many designers and developers that XHTML1 will never be able to accomplish that. Maybe XHTML2, whenever that arrives.

  10. Hopefully the break in compatibility that XHTML2 brings will cause a ‘paradigm shift’ in developer’s attitudes. I’m not holding my breath though.

  11. I wonder if the message of my Case for XHTML article was really clear to you (or, hell, if you’ve even read it).

    In essence, I agree strongly that XHTML is the future, and I agree with pretty much every one of your points here, but the core of my article is something I don’t see mentioned at all.

    All technical details aside, XHTML still serves us a very valuable purpose, and that is to spread awareness of web standards, semantics, proper markup, separation of content and presentation, and so forth. HTML simply fails to do that, and has failed for years on end now. XHTML is doing a spectacular job of introducing people to this ‘new’ world of Standards-based webdevelopment.

    That, on its own, makes XHTML an immensely valuable asset to all who care about web standards and its use worldwide.

  12. You might actually be interested in a plugin I’ve been working on, replaces the standard header and serves sites as application/xhtml+xml to all browsers that will accept it.

    WP Content Negotiator

  13. People don’t code to standars, people code to implemetations

  14. Zaur, no, XHTML is not compatible with all browsers. It just has enough similarities with HTML to give the appearance of compatibility under some conditions.

Comments are closed.