Last week, I’m sure you all had fun trying to get your mind around understanding the incredibly complex, yet almost entirely valid markup in the validation quiz. It was solved a lot sooner than I had expected by both Anne van Kesteren and David Håsäther. Anne was correct, but his explanation wasn’t quite satisfactory enough to win. David’s explanation was spot on. Well done to both of them.
For the rest of you who aren’t SGML experts, and are still trying to figure how a non-conformant XML declaration in an HTML document with 2 DOCTYPEs can be valid, read on to find out.
The XML Declaration
Despite appearances to the contrary, the first line is not an XML declaration at all.
<?xml version="1.0" comment="Find the Error!" ?>
In SGML, it is a Processing Instruction, which just happens to look like somewhat an XML declaration. Although the meaning of PI is undefined in SGML and HTML, it still passes validation. If the document were served with an XML MIME type, rather than text/html, then an XML parser would try to process it as an XML declaration, although it would be non-conformant since there is no comment attribute defined for it in the XML recommendation.
It is actually included as the first of two exploits for known validator bugs. This bug — bug 14 to be precise — prematurely sets the validator to XML parsing mode. Because of this, the validator incorrectly parses the comments, DOCTYPEs and everything else that follows as ill-formed and invalid XHTML. This bug is the cause of the 80 incorrect errors being issued for the entire document.
The XHTML 1.0 Strict DOCTYPE
Knowing that the pseudo-XML-declaration above is really a valid SGML PI and that the document is served as text/html, the comment declarations and DOCTYPEs should be parsed with SGML rules, not with XML rules like the validator does incorrectly.
<!-- -- --> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!-- -- -->
As I briefly mentioned in HTML
Comments in Scripts and which is discussed
in more detail by the WDG, a comment declaration starts with a markup
declaration open (
<!, ends with a markup declaration close (
> and contains
one or more comments. Each comment within the comment declaration starts with
and ends with a matching pair of hyphens. Because of this, despite appearances,
there is actually only one comment declaration that surrounds the XHTML 1.0
DOCTYPE and contains 3 separate comments.
The first comment, between the first and second pair of hyphens, and the third comment, between the fifth and sixth pair, each only contain a space. The second comment, between the third and fourth pair, contains everything in between, including the XHTML DOCTYPE. The entire comment declaration ends just before the HTML 4.01 DOCTYPE, making it an HTML 4.01 document.
This section is essentially the same as the following:
<!-- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -->
The HTML 4.01 DOCTYPE
The second exploit for a known bug in the validator — bug 24 — is actually used in the HTML 4.01 DOCTYPE.
<!doctype html public "-//W3C//DTD HTML 4.01//EN" [ <!ENTITY smile CDATA "☺" -- U+263A WHITE SMILING FACE --> ]>
In SGML, it is perfectly valid to write
doctype html public in
lowercase, although it is conventional to use uppercase. With a lower case
the validator does not identify the document as HTML 4.01, and (if it were
valid) would only state This Page Is Valid!, rather than This Page
Is Valid HTML 4.01 Strict!
The HTML 4.01
DOCTYPE also includes an internal subset. i.e. Everything between
the square brackets. This, in addition to everything defined in the HTML 4.01
Strict DTD, defines a new entity “
smile”, representing the character
a white smiling face (☺). This entity may be referenced using entity reference:
as is done later in the document.
This is equivalent to the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" [ <!ENTITY smile CDATA "☺" -- U+263A WHITE SMILING FACE --> ]>
However, because internal subsets are unsupported, it will be omitted from the final document and the entity reference will be replaced with the real character later.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
The Document Head
<html lang="en"> <title/validation quiz/ </head>
There is nothing strange about the
html start-tag, it’s fairly standard and
just describes the document’s language as being English (
The start-tag for the
head element has been omitted. This is perfect valid
since both the start- and end-tags for the
are optional. Even though the start-tag is omitted, the
head element’s start-tag
is still implied by the presence of the
title element and, despite the missing
start-tag, it is still valid to include the end-tag.
title element uses a special syntax, known as SHORTTAG NET (Null
End Tag). Many of the SGML SHORTTAG features are unsupported in real world
browsers, but that doesn’t make this any less valid. The first solidus closes
the start-tag (known as a net-enabling start-tag close delimiter) and begins
the element’s content. The second solidus is the null-end-tag, which closes
The markup for this section is exactly equivalent to this:
<html lang="en"> <head> <title>validation quiz</title> </head>
The Document Body
head element, the start-tag for the
has been omitted. However, as we will see later, the end-tag hasn’t been,
although its presence is not immediately obvious. The
is implied by the first paragraph, immediately following the end-tag for
(Note: It is not implied simply by the end-tag for the
even though the next element must be a
The first paragraph is quite straight forward.
<p>In this document, there ∃s a single validation error. It makes use of some <strong<em/very/</strong> uncommon & unsupported markup techniques designed to fool the faint hearted.
It starts with the
p start-tag, which is required, but the optional
end-tag is omitted (which is important, as we will see for paragraph 2). It
contains an entity reference,
∃ which is defined in the character
entity references for symbols, mathematical symbols, and Greek letters.
em elements in this may look invalid, but they are not. The
em element makes use of the same SHORTTAG NET syntax used for the
strong element, however, is a little more confusing. The start-tag is unclosed
— it omits the tag close delimiter (
>. In SGML,
TAGC may be omitted
when the first non-white space character is a tag open delimiter (
There is also a lone ampersand within this paragraph. Because the ampersand
usually represents the start of an entity reference; ampersands are, in many
circumstances, required to be written using the entity reference
However, there are valid cases in SGML where this is not required, such as when
immediately followed by white space or other character that may not be part
of an entity name.
As stated previously, because this is the first paragraph, it implies the
presence of the
body start-tag. The end-tag is actually implied by the start
of the next paragraph, but I’ve included it in this section because it’s easier
and makes no difference to the end result. Thus the markup for this section
is equivalent to this:
<body> <p>In this document, there ∃s a single validation error. It makes use of some <strong><em>very</em></strong> uncommon & unsupported markup techniques designed to fool the faint hearted.</p>
The second paragraph is a little more complex. It starts and ends with tags that are missing the tag name. In SGML terms, these are respectively known as empty start-tags and empty end-tags.
<>This exploits some known bugs in <a href=http://validator.w3.org/ to both help prevent cheaters and confuse even the most experienced authors.</>
According to the rules of SGML, an empty start-tag represents the same element as the most recently opened element within the tree. There is a condition in SGML that changes this rule, but that small detail will be omitted because it is not relevant to HTML. For a full explanation, see section 9.3.1 Empty Tags in Martin Bryan’s SGML and HTML Explained.
Because the end-tag for the first paragraph was omitted (recall that I said it was important), the paragraph element is still open, and thus the empty start-tag is recognised as a paragraph start-tag as well. As I stated previously, this start-tag also implies the end-tag for the previous paragraph. The empty end-tag simply closes the most recently opened element within the tree, which is this paragraph.
The a element in this paragraph also looks invalid because it appears to be
missing both the
TAGC and an end tag., yet neither are
really missing and it is not invalid. However, the markup does not mean what
it appears to mean at first glance. As explained above for the SHORTTAG NET
syntax used for the title element, the first solidus is the NET enabling
start-tag close delimiter and the second in the null end-tag. These are processed
in this way because the attribute value is not quoted. If it were quoted,
the solidus would not represent the NET syntax.
The markup for this section is equivalent to this:
<p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/ to both help prevent cheaters and confuse even the most experienced authors.</p>
This section is quite easy, given that most of the concepts used have been covered earlier. To make it easier to explain, I’ve indented the lines a little, but the markup is otherwise unchanged.
<form method="get" action="http://validator.w3.org/check" <table <tr <td<input text checked id=uri name=uri size=40/> <><label for=uri>Is this test too hard?</label></> <><td<button button>Don't Cheat!</> </tbody ></table>
The start-tags for the
td elements are unclosed start-tags
again. The empty start-tags and end-tags should be fairly self explanatory.
tbody element, like the
body elements, is missing its start-tag
but not its end-tag. The
tbody end-tag is not, in this case, an unclosed tag,
because it is closed on the following line with the
TAGC delimiter just before
The attributes for the
button elements make use of a feature called
attribute minimisation. Despite popular belief, attribute minimisation allows
the omission of the attribute name where the attribute value may be unambiguously
associated with a particular attribute. The
text attribute, is not actually
text attribute. It is a value that may be unambiguously associated with the
type attribute and is, therefore, the minimised form of
checked attribute is more commonly known has the short form of
This is exactly the same for the
button element, where the value
button is unambiguously
input element also uses a net enabling start-tag. Because it is
an empty element, for which end-tags are forbidden, it does not need the
null end-tag to be present. The net enabling start-tag is also followed by
a greater than symbol >, which is designed to make it look like XHTML syntax.
However, because the start-tag and element are ended with the net enabling
start-tag close delimiter, the greater than symbol actually follows the element,
and should be treated as character data, not markup.
Finally, the required end-tag for the
form element is not in this section,
but it does appear later in the document, which will be discussed when we
get to it. The markup for this section is equivalent to:
<form method="get" action="http://validator.w3.org/check"> <table> <tbody> <tr> <td><input type="text" checked="checked" id="uri" name="uri" size="40">></td> <td><label for="uri">Is this test too hard?</label></td> </tr> <tr> <td><button type="button">Don't Cheat!</button></td> </tbody> </table>
The list is made up of several sections. First, some list items, followed by a processing instruction and lastly a really complicated comment declaration containing what appears to be invalid markup.
<ul/ <li><![CDATA[ <li Oops<!-- ?]]> --> <li>There are < 2 validation errors in this document</li> <?hello comment="What's this doing here?"?> <!--- Found the error yet? ----> <blink>I'll bet this is “annoying”!</blink> <p align="right">Remeber, it's a Strict DOCTYPE! <!-- ------ Don't give up now! ----- > <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <p>Is the error here --><li>?/
ul element starts with net enabling start-tag, as discussed previously.
The null end-tag appears at the very end of this section.
The first list item contains a
CDATA section. Many people know this syntax
from XML, but it is also valid in SGML, though it is unsupported in most browsers.
Opera 8 is the only browser I know of that supports it for HTML. The
means that its content (everything up to the
CDATA section end
be treated as character data.
Thus, the second list item, which looks like an invalid unclosed start-tag
containing an invalid attribute, is not really markup. It is character data
that should be output as such. If it were markup, it would be an error because
Oops” is not a valid attribute or value. For the same reason, the comment
declaration surrounding the CDATA section end is not really a comment. Because
no markup is recognised, it’s also treated as character data and output as text.
The second real list item (following the
CDATA section) contains an
unencoded less than symbol. In most cases, this should be encoded as
While in XHTML, this is compulsory, in HTML (like for the unencoded ampersand
in paragraph 1) there are circumstances where it is valid to leave it unencoded.
The processing instruction
<?hello ... ?> is also valid. These may
appear almost anywhere within an SGML document. Although this PI has no defined
meaning, it does not affect validation in any way.
The big comment declaration is actually fairly complicated, but easy to understand
with a basic understanding of SGML comment syntax. If you recall the discussion
of the comment syntax above in the DOCTYPE section, you’ll remember that a comment
starts and ends with matching pairs of hyphens. By counting the number of hyphen
pairs, you should notice that the
blink element, the
p elements and the
meta element are actually commented out.
Within the commented out
blink element, there are also some invalid numeric
character references used. The decimal code points 147 and148 are actually
Windows-1252 code points which are commonly used and supported in web browsers. However,
because numeric character references are supposed to use Unicode code points
and these are control characters, these character references are invalid.
The whole comment is actually closed on the last line of this section, just before a new list item is opened. This is valid because the unordered list has not yet been closed. The final list item simply contains a question mark, and is immediately followed by the null end-tag for the ul element.
<ul> <li> <li Oops<!-- ? --></li> <li>There are < 2 validation errors in this document</li> <?hello comment="What's this doing here?"?> <!--- Found the error yet? ... <p>Is the error here --> <li>?</li> </ul>
The final paragraph, which is also the location of the only validation error within the document, should be extremely easy to understand, if you successfully read and understood the entire explanation so far.
<p/>The question is: Is this<br>HTML or<br/>XHTML served as text/html? &smile</></></>
p element contains a net enabling start-tag, followed by a greater than
symbol. Again this was designed to look like XHTML’s empty element syntax, but
it is not.
br elements are actually both valid HTML, despite the second
appearing to use XHTML empty element syntax also. The second
is closed by the net enabling start-tag and also followed by a greater than
p element is actually closed by the solidus in
text/html. Thus, everything
following the solidus is outside of the
p element and a direct child of the
form element; which, as you should recall, is still open.
The entity reference
&smile is missing the reference close (
;). This is valid in HTML where the entity reference
is followed by any non-name character. However, because the entity declaration
was removed from the DOCTYPE above due to lack of browser support, it will
be replaced with the real white smiling face character.
The first empty end-tag is for the
form element, which is still open. The
second is for the
body element and the third is for the
<p>>The question is: Is this<br>HTML or<br>>XHTML served as text</p>html? ☺ </form> </body> </html>
Putting it All Together
After we combine each of the newly marked up sections using the commonly supported syntax, we end up with the entire document looking like this:
<?xml version="1.0" comment="Find the Error!" ?> <!-- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> --> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html lang="en"> <head> <title>validation quiz</title> </head> <body> <p>In this document, there ∃s a single validation error. It makes use of some <strong><em>very</em></strong> uncommon & unsupported markup techiques designed to fool the faint hearted.</p> <p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/ to both help prevent cheaters and confuse even the most experienced authors.</p> <form method="get" action="http://validator.w3.org/check"> <table> <tbody> <tr> <td><input type="text" checked="checked" id="uri" name="uri" size="40">></td> <td><label for="uri">Is this test too hard?</label></td> </tr> <tr> <td><button type="button">Don't Cheat!</button></td> </tr> </tbody> </table> <ul> <li> <li Oops<!-- ? --></li> <li>There are < 2 validation errors in this document</li> <?hello comment="What's this doing here?"?> <!--- Found the error yet? ... <p>Is the error here --> <li>?</li> </ul> <p>>The question is: Is this<br>HTML or<br>>XHTML served as text</p>html? ☺ </form> </body> </html>
Well, that’s it. If you have any questions or need clarification for anything I’ve discussed, please don’t hesitate to ask me.