Validation Quiz Explanation

Last week, I’m sure you all had fun trying to get your mind around understanding the incredibly complex, yet almost entirely valid markup in the validation quiz. It was solved a lot sooner than I had expected by both Anne van Kesteren and David Håsäther. Anne was correct, but his explanation wasn’t quite satisfactory enough to win. David’s explanation was spot on. Well done to both of them.

For the rest of you who aren’t SGML experts, and are still trying to figure how a non-conformant XML declaration in an HTML document with 2 DOCTYPEs can be valid, read on to find out.

The XML Declaration

Despite appearances to the contrary, the first line is not an XML declaration at all.

<?xml version="1.0" comment="Find the Error!" ?>

In SGML, it is a Processing Instruction, which just happens to look like somewhat an XML declaration. Although the meaning of PI is undefined in SGML and HTML, it still passes validation. If the document were served with an XML MIME type, rather than text/html, then an XML parser would try to process it as an XML declaration, although it would be non-conformant since there is no comment attribute defined for it in the XML recommendation.

It is actually included as the first of two exploits for known validator bugs. This bug — bug 14 to be precise — prematurely sets the validator to XML parsing mode. Because of this, the validator incorrectly parses the comments, DOCTYPEs and everything else that follows as ill-formed and invalid XHTML. This bug is the cause of the 80 incorrect errors being issued for the entire document.

The XHTML 1.0 Strict DOCTYPE

Knowing that the pseudo-XML-declaration above is really a valid SGML PI and that the document is served as text/html, the comment declarations and DOCTYPEs should be parsed with SGML rules, not with XML rules like the validator does incorrectly.

<!-- -- -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- -- -->

As I briefly mentioned in HTML Comments in Scripts and which is discussed in more detail by the WDG, a comment declaration starts with a markup declaration open (MDO): <!, ends with a markup declaration close (MDC) > and contains one or more comments. Each comment within the comment declaration starts with and ends with a matching pair of hyphens. Because of this, despite appearances, there is actually only one comment declaration that surrounds the XHTML 1.0 DOCTYPE and contains 3 separate comments.

The first comment, between the first and second pair of hyphens, and the third comment, between the fifth and sixth pair, each only contain a space. The second comment, between the third and fourth pair, contains everything in between, including the XHTML DOCTYPE. The entire comment declaration ends just before the HTML 4.01 DOCTYPE, making it an HTML 4.01 document.

This section is essentially the same as the following:

<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
--> 

The HTML 4.01 DOCTYPE

The second exploit for a known bug in the validator — bug 24 — is actually used in the HTML 4.01 DOCTYPE.

<!doctype html public "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "?" -- U+263A WHITE SMILING FACE -->
]>

In SGML, it is perfectly valid to write doctype html public in lowercase, although it is conventional to use uppercase. With a lower case DOCTYPE, the validator does not identify the document as HTML 4.01, and (if it were valid) would only state This Page Is Valid!, rather than This Page Is Valid HTML 4.01 Strict!

The HTML 4.01 DOCTYPE also includes an internal subset. i.e. Everything between the square brackets. This, in addition to everything defined in the HTML 4.01 Strict DTD, defines a new entity “smile”, representing the character U+263A, a white smiling face (?). This entity may be referenced using entity reference: &smile;, as is done later in the document.

This is equivalent to the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "?" -- U+263A WHITE SMILING FACE -->
]>

However, because internal subsets are unsupported, it will be omitted from the final document and the entity reference will be replaced with the real character later.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

The Document Head

<html lang="en">
<title/validation quiz/
</head>

There is nothing strange about the html start-tag, it’s fairly standard and just describes the document’s language as being English (en).

The start-tag for the head element has been omitted. This is perfect valid since both the start- and end-tags for the html, head, body and tbody elements are optional. Even though the start-tag is omitted, the head element’s start-tag is still implied by the presence of the title element and, despite the missing start-tag, it is still valid to include the end-tag.

The title element uses a special syntax, known as SHORTTAG NET (Null End Tag). Many of the SGML SHORTTAG features are unsupported in real world browsers, but that doesn’t make this any less valid. The first solidus closes the start-tag (known as a net-enabling start-tag close delimiter) and begins the element’s content. The second solidus is the null-end-tag, which closes the element.

The markup for this section is exactly equivalent to this:

<html lang="en">
<head>
  <title>validation quiz</title>
</head> 

The Document Body

Like the head element, the start-tag for the body element has been omitted. However, as we will see later, the end-tag hasn’t been, although its presence is not immediately obvious. The body element is implied by the first paragraph, immediately following the end-tag for the head element. (Note: It is not implied simply by the end-tag for the head element, even though the next element must be a body element.)

Paragraph 1

The first paragraph is quite straight forward.

<p>In this document, there &exist;s a single validation error.
   It makes use of some <strong<em/very/</strong> uncommon &
   unsupported markup techniques designed to fool the faint
   hearted.

It starts with the p start-tag, which is required, but the optional end-tag is omitted (which is important, as we will see for paragraph 2). It contains an entity reference, &exist; which is defined in the character entity references for symbols, mathematical symbols, and Greek letters.

The strong and em elements in this may look invalid, but they are not. The em element makes use of the same SHORTTAG NET syntax used for the title element. The strong element, however, is a little more confusing. The start-tag is unclosed — it omits the tag close delimiter (TAGC) >. In SGML, TAGC may be omitted when the first non-white space character is a tag open delimiter (TAGO) <.

There is also a lone ampersand within this paragraph. Because the ampersand usually represents the start of an entity reference; ampersands are, in many circumstances, required to be written using the entity reference &amp;. However, there are valid cases in SGML where this is not required, such as when immediately followed by white space or other character that may not be part of an entity name.

As stated previously, because this is the first paragraph, it implies the presence of the body start-tag. The end-tag is actually implied by the start of the next paragraph, but I’ve included it in this section because it’s easier and makes no difference to the end result. Thus the markup for this section is equivalent to this:

<body>
  <p>In this document, there &exist;s a single validation error.
     It makes use of some <strong><em>very</em></strong> uncommon
     &amp; unsupported markup techniques designed to fool the
     faint hearted.</p>

Paragraph 2

The second paragraph is a little more complex. It starts and ends with tags that are missing the tag name. In SGML terms, these are respectively known as empty start-tags and empty end-tags.

<>This exploits some known bugs in <a href=http://validator.w3.org/
  to both help prevent cheaters and confuse even the most experienced
  authors.</>

According to the rules of SGML, an empty start-tag represents the same element as the most recently opened element within the tree. There is a condition in SGML that changes this rule, but that small detail will be omitted because it is not relevant to HTML. For a full explanation, see section 9.3.1 Empty Tags in Martin Bryan’s SGML and HTML Explained.

Because the end-tag for the first paragraph was omitted (recall that I said it was important), the paragraph element is still open, and thus the empty start-tag is recognised as a paragraph start-tag as well. As I stated previously, this start-tag also implies the end-tag for the previous paragraph. The empty end-tag simply closes the most recently opened element within the tree, which is this paragraph.

The a element in this paragraph also looks invalid because it appears to be missing both the TAGC and an end tag., yet neither are really missing and it is not invalid. However, the markup does not mean what it appears to mean at first glance. As explained above for the SHORTTAG NET syntax used for the title element, the first solidus is the NET enabling start-tag close delimiter and the second in the null end-tag. These are processed in this way because the attribute value is not quoted. If it were quoted, the solidus would not represent the NET syntax.

The markup for this section is equivalent to this:

<p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
   to both help prevent cheaters and confuse even the most experienced
   authors.</p>

The Form

This section is quite easy, given that most of the concepts used have been covered earlier. To make it easier to explain, I’ve indented the lines a little, but the markup is otherwise unchanged.

<form method="get" action="http://validator.w3.org/check"
  <table
      <tr
        <td<input text checked id=uri name=uri size=40/>
        <><label for=uri>Is this test too hard?</label></>
      <><td<button button>Don't Cheat!</>
  </tbody
 ></table>

The start-tags for the form, table, tr and td elements are unclosed start-tags again. The empty start-tags and end-tags should be fairly self explanatory.

The tbody element, like the head and body elements, is missing its start-tag but not its end-tag. The tbody end-tag is not, in this case, an unclosed tag, because it is closed on the following line with the TAGC delimiter just before the table end-tag.

The attributes for the input and button elements make use of a feature called attribute minimisation. Despite popular belief, attribute minimisation allows the omission of the attribute name where the attribute value may be unambiguously associated with a particular attribute. The text attribute, is not actually a text attribute. It is a value that may be unambiguously associated with the type attribute and is, therefore, the minimised form of type="text". The checked attribute is more commonly known has the short form of checked="checked". This is exactly the same for the button element, where the value button is unambiguously associated with type="button".

The input element also uses a net enabling start-tag. Because it is an empty element, for which end-tags are forbidden, it does not need the null end-tag to be present. The net enabling start-tag is also followed by a greater than symbol >, which is designed to make it look like XHTML syntax. However, because the start-tag and element are ended with the net enabling start-tag close delimiter, the greater than symbol actually follows the element, and should be treated as character data, not markup.

Finally, the required end-tag for the form element is not in this section, but it does appear later in the document, which will be discussed when we get to it. The markup for this section is equivalent to:

<form method="get" action="http://validator.w3.org/check">
  <table>
    <tbody>
      <tr>
        <td><input type="text" checked="checked" id="uri"
                   name="uri" size="40">&gt;</td>
        <td><label for="uri">Is this test too hard?</label></td>
      </tr>
      <tr>
        <td><button type="button">Don't Cheat!</button></td>
    </tbody>
  </table>

The List

The list is made up of several sections. First, some list items, followed by a processing instruction and lastly a really complicated comment declaration containing what appears to be invalid markup.

<ul/
<li><![CDATA[
<li Oops<!-- ?]]> -->
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ---->
<blink>I'll bet this is &#147;annoying&#148;!</blink>
<p align="right">Remeber, it's a Strict DOCTYPE!
<!-- ------ Don't give up now! ----- >
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<p>Is the error here --><li>?/

The ul element starts with net enabling start-tag, as discussed previously. The null end-tag appears at the very end of this section.

The first list item contains a CDATA section. Many people know this syntax from XML, but it is also valid in SGML, though it is unsupported in most browsers. Opera 8 is the only browser I know of that supports it for HTML. The CDATA section means that its content (everything up to the CDATA section end ]]> should be treated as character data.

Thus, the second list item, which looks like an invalid unclosed start-tag containing an invalid attribute, is not really markup. It is character data that should be output as such. If it were markup, it would be an error because “Oops” is not a valid attribute or value. For the same reason, the comment declaration surrounding the CDATA section end is not really a comment. Because no markup is recognised, it’s also treated as character data and output as text.

The second real list item (following the CDATA section) contains an unencoded less than symbol. In most cases, this should be encoded as &lt;. While in XHTML, this is compulsory, in HTML (like for the unencoded ampersand in paragraph 1) there are circumstances where it is valid to leave it unencoded.

The processing instruction <?hello ... ?> is also valid. These may appear almost anywhere within an SGML document. Although this PI has no defined meaning, it does not affect validation in any way.

The big comment declaration is actually fairly complicated, but easy to understand with a basic understanding of SGML comment syntax. If you recall the discussion of the comment syntax above in the DOCTYPE section, you’ll remember that a comment starts and ends with matching pairs of hyphens. By counting the number of hyphen pairs, you should notice that the blink element, the p elements and the meta element are actually commented out.

Within the commented out blink element, there are also some invalid numeric character references used. The decimal code points 147 and148 are actually Windows-1252 code points which are commonly used and supported in web browsers. However, because numeric character references are supposed to use Unicode code points and these are control characters, these character references are invalid.

The whole comment is actually closed on the last line of this section, just before a new list item is opened. This is valid because the unordered list has not yet been closed. The final list item simply contains a question mark, and is immediately followed by the null end-tag for the ul element.


<ul>
  <li>
  &lt;li Oops&lt;!-- ? --&gt;</li>
  <li>There are &lt; 2 validation errors in this document</li>
  <?hello comment="What's this doing here?"?>
  <!--- Found the error yet? ...
    <p>Is the error here --> <li>?</li>
</ul>

Paragraph 3

The final paragraph, which is also the location of the only validation error within the document, should be extremely easy to understand, if you successfully read and understood the entire explanation so far.

<p/>The question is: Is this<br>HTML or<br/>XHTML
   served as text/html? &smile</></></>

The p element contains a net enabling start-tag, followed by a greater than symbol. Again this was designed to look like XHTML’s empty element syntax, but it is not.

The two br elements are actually both valid HTML, despite the second appearing to use XHTML empty element syntax also. The second br element is closed by the net enabling start-tag and also followed by a greater than symbol.

The p element is actually closed by the solidus in text/html. Thus, everything following the solidus is outside of the p element and a direct child of the form element; which, as you should recall, is still open.

The entity reference &smile is missing the reference close (REFC) delimiter (semi-colon ;). This is valid in HTML where the entity reference is followed by any non-name character. However, because the entity declaration was removed from the DOCTYPE above due to lack of browser support, it will be replaced with the real white smiling face character.

The first empty end-tag is for the form element, which is still open. The second is for the body element and the third is for the html element.

    <p>&gt;The question is: Is this<br>HTML or<br>&gt;XHTML
       served as text</p>html? ?
  </form>
</body>
</html>

Putting it All Together

After we combine each of the newly marked up sections using the commonly supported syntax, we end up with the entire document looking like this:

<?xml version="1.0" comment="Find the Error!" ?>
<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
  <title>validation quiz</title>
</head>
<body>
  <p>In this document, there &exist;s a single validation error.
     It makes use of some <strong><em>very</em></strong> uncommon
     &amp; unsupported markup techiques designed to fool the faint
     hearted.</p>
  <p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
     to both help prevent cheaters and confuse even the most experienced
     authors.</p>

  <form method="get" action="http://validator.w3.org/check">
    <table>
      <tbody>
        <tr>
          <td><input type="text" checked="checked" id="uri"
                     name="uri" size="40">&gt;</td>
          <td><label for="uri">Is this test too hard?</label></td>
        </tr>
        <tr>
          <td><button type="button">Don't Cheat!</button></td>
        </tr>
      </tbody>
    </table>

    <ul>
      <li>
      &lt;li Oops&lt;!-- ? --&gt;</li>
      <li>There are &lt; 2 validation errors in this document</li>
      <?hello comment="What's this doing here?"?>
      <!--- Found the error yet? ...
        <p>Is the error here -->
      <li>?</li>
    </ul>

    <p>&gt;The question is: Is this<br>HTML or<br>&gt;XHTML
       served as text</p>html? ?
  </form>
</body>
</html>

Well, that’s it. If you have any questions or need clarification for anything I’ve discussed, please don’t hesitate to ask me.

6 thoughts on “Validation Quiz Explanation

  1. Very interesting stuff. With all the emphasis these days on XHTML, it’s easy to forget about HTML’s SGML roots, and all the strange things that result from it. One correction though: you got the URI wrong for SGML and HTML Explained.

    BTW has anyone documented which browsers support these SGML tricks? Might be interesting to get a list of “safe” ones that you can use in web pages.

  2. Hello, I saw code like <html lang="en">

    Could you help me to find all list of languages, I meant for other european languages?

    Thanks in advance

  3. Sarkis, firstly, I am not a help desk. Secondly, that is off topic for this entry. Thirdly, do a search for “ISO 639” (language codes) and “ISO 3166” (country codes) and you’ll find what you’re looking for.

Comments are closed.