Last week, I’m sure you all had fun trying to get your mind around understanding the incredibly complex, yet almost entirely valid markup in the validation quiz. It was solved a lot sooner than I had expected by both Anne van Kesteren and David Håsäther. Anne was correct, but his explanation wasn’t quite satisfactory enough to win. David’s explanation was spot on. Well done to both of them.
For the rest of you who aren’t SGML experts, and are still trying to figure how a non-conformant XML declaration in an HTML document with 2 DOCTYPEs can be valid, read on to find out.
The XML Declaration
Despite appearances to the contrary, the first line is not an XML declaration at all.
<?xml version="1.0" comment="Find the Error!" ?>
In SGML, it is a Processing Instruction, which just happens to look like somewhat an XML declaration. Although the meaning of PI is undefined in SGML and HTML, it still passes validation. If the document were served with an XML MIME type, rather than text/html, then an XML parser would try to process it as an XML declaration, although it would be non-conformant since there is no comment attribute defined for it in the XML recommendation.
It is actually included as the first of two exploits for known validator bugs. This bug — bug 14 to be precise — prematurely sets the validator to XML parsing mode. Because of this, the validator incorrectly parses the comments, DOCTYPEs and everything else that follows as ill-formed and invalid XHTML. This bug is the cause of the 80 incorrect errors being issued for the entire document.
The XHTML 1.0 Strict DOCTYPE
Knowing that the pseudo-XML-declaration above is really a valid SGML PI and that the document is served as text/html, the comment declarations and DOCTYPEs should be parsed with SGML rules, not with XML rules like the validator does incorrectly.
<!-- -- -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- -- -->
As I briefly mentioned in HTML
Comments in Scripts and which is discussed
in more detail by the WDG, a comment declaration starts with a markup
declaration open (MDO
): <!
, ends with a markup declaration close (MDC
) >
and contains
one or more comments. Each comment within the comment declaration starts with
and ends with a matching pair of hyphens. Because of this, despite appearances,
there is actually only one comment declaration that surrounds the XHTML 1.0
DOCTYPE and contains 3 separate comments.
The first comment, between the first and second pair of hyphens, and the third comment, between the fifth and sixth pair, each only contain a space. The second comment, between the third and fourth pair, contains everything in between, including the XHTML DOCTYPE. The entire comment declaration ends just before the HTML 4.01 DOCTYPE, making it an HTML 4.01 document.
This section is essentially the same as the following:
<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-->
The HTML 4.01 DOCTYPE
The second exploit for a known bug in the validator — bug 24 — is actually used in the HTML 4.01 DOCTYPE.
<!doctype html public "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "?" -- U+263A WHITE SMILING FACE -->
]>
In SGML, it is perfectly valid to write doctype html public
in
lowercase, although it is conventional to use uppercase. With a lower case DOCTYPE
,
the validator does not identify the document as HTML 4.01, and (if it were
valid) would only state This Page Is Valid!, rather than This Page
Is Valid HTML 4.01 Strict!
The HTML 4.01 DOCTYPE
also includes an internal subset. i.e. Everything between
the square brackets. This, in addition to everything defined in the HTML 4.01
Strict DTD, defines a new entity “smile
”, representing the character U+263A
,
a white smiling face (?). This entity may be referenced using entity reference: ⌣
,
as is done later in the document.
This is equivalent to the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "?" -- U+263A WHITE SMILING FACE -->
]>
However, because internal subsets are unsupported, it will be omitted from the final document and the entity reference will be replaced with the real character later.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
The Document Head
<html lang="en">
<title/validation quiz/
</head>
There is nothing strange about the html
start-tag, it’s fairly standard and
just describes the document’s language as being English (en
).
The start-tag for the head
element has been omitted. This is perfect valid
since both the start- and end-tags for the html
, head
, body
and tbody
elements
are optional. Even though the start-tag is omitted, the head
element’s start-tag
is still implied by the presence of the title
element and, despite the missing
start-tag, it is still valid to include the end-tag.
The title
element uses a special syntax, known as SHORTTAG NET (Null
End Tag). Many of the SGML SHORTTAG features are unsupported in real world
browsers, but that doesn’t make this any less valid. The first solidus closes
the start-tag (known as a net-enabling start-tag close delimiter) and begins
the element’s content. The second solidus is the null-end-tag, which closes
the element.
The markup for this section is exactly equivalent to this:
<html lang="en">
<head>
<title>validation quiz</title>
</head>
The Document Body
Like the head
element, the start-tag for the body
element
has been omitted. However, as we will see later, the end-tag hasn’t been,
although its presence is not immediately obvious. The body
element
is implied by the first paragraph, immediately following the end-tag for
the head
element.
(Note: It is not implied simply by the end-tag for the head
element,
even though the next element must be a body
element.)
Paragraph 1
The first paragraph is quite straight forward.
<p>In this document, there ∃s a single validation error.
It makes use of some <strong<em/very/</strong> uncommon &
unsupported markup techniques designed to fool the faint
hearted.
It starts with the p
start-tag, which is required, but the optional
end-tag is omitted (which is important, as we will see for paragraph 2). It
contains an entity reference, ∃
which is defined in the character
entity references for symbols, mathematical symbols, and Greek letters.
The strong
and em
elements in this may look invalid, but they are not. The
em
element makes use of the same SHORTTAG NET syntax used for the title
element.
The strong
element, however, is a little more confusing. The start-tag is unclosed
— it omits the tag close delimiter (TAGC
) >
. In SGML, TAGC
may be omitted
when the first non-white space character is a tag open delimiter (TAGO
) <
.
There is also a lone ampersand within this paragraph. Because the ampersand
usually represents the start of an entity reference; ampersands are, in many
circumstances, required to be written using the entity reference &
.
However, there are valid cases in SGML where this is not required, such as when
immediately followed by white space or other character that may not be part
of an entity name.
As stated previously, because this is the first paragraph, it implies the
presence of the body
start-tag. The end-tag is actually implied by the start
of the next paragraph, but I’ve included it in this section because it’s easier
and makes no difference to the end result. Thus the markup for this section
is equivalent to this:
<body>
<p>In this document, there ∃s a single validation error.
It makes use of some <strong><em>very</em></strong> uncommon
& unsupported markup techniques designed to fool the
faint hearted.</p>
Paragraph 2
The second paragraph is a little more complex. It starts and ends with tags that are missing the tag name. In SGML terms, these are respectively known as empty start-tags and empty end-tags.
<>This exploits some known bugs in <a href=http://validator.w3.org/
to both help prevent cheaters and confuse even the most experienced
authors.</>
According to the rules of SGML, an empty start-tag represents the same element as the most recently opened element within the tree. There is a condition in SGML that changes this rule, but that small detail will be omitted because it is not relevant to HTML. For a full explanation, see section 9.3.1 Empty Tags in Martin Bryan’s SGML and HTML Explained.
Because the end-tag for the first paragraph was omitted (recall that I said it was important), the paragraph element is still open, and thus the empty start-tag is recognised as a paragraph start-tag as well. As I stated previously, this start-tag also implies the end-tag for the previous paragraph. The empty end-tag simply closes the most recently opened element within the tree, which is this paragraph.
The a element in this paragraph also looks invalid because it appears to be
missing both the TAGC
and an end tag., yet neither are
really missing and it is not invalid. However, the markup does not mean what
it appears to mean at first glance. As explained above for the SHORTTAG NET
syntax used for the title element, the first solidus is the NET enabling
start-tag close delimiter and the second in the null end-tag. These are processed
in this way because the attribute value is not quoted. If it were quoted,
the solidus would not represent the NET syntax.
The markup for this section is equivalent to this:
<p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
to both help prevent cheaters and confuse even the most experienced
authors.</p>
The Form
This section is quite easy, given that most of the concepts used have been covered earlier. To make it easier to explain, I’ve indented the lines a little, but the markup is otherwise unchanged.
<form method="get" action="http://validator.w3.org/check"
<table
<tr
<td<input text checked id=uri name=uri size=40/>
<><label for=uri>Is this test too hard?</label></>
<><td<button button>Don't Cheat!</>
</tbody
></table>
The start-tags for the form
, table
, tr
and td
elements are unclosed start-tags
again. The empty start-tags and end-tags should be fairly self explanatory.
The tbody
element, like the head
and body
elements, is missing its start-tag
but not its end-tag. The tbody
end-tag is not, in this case, an unclosed tag,
because it is closed on the following line with the TAGC
delimiter just before
the table
end-tag.
The attributes for the input
and button
elements make use of a feature called
attribute minimisation. Despite popular belief, attribute minimisation allows
the omission of the attribute name where the attribute value may be unambiguously
associated with a particular attribute. The text
attribute, is not actually
a text
attribute. It is a value that may be unambiguously associated with the
type
attribute and is, therefore, the minimised form of type="text"
.
The checked
attribute is more commonly known has the short form of checked="checked"
.
This is exactly the same for the button
element, where the value button
is unambiguously
associated with type="button"
.
The input
element also uses a net enabling start-tag. Because it is
an empty element, for which end-tags are forbidden, it does not need the
null end-tag to be present. The net enabling start-tag is also followed by
a greater than symbol >, which is designed to make it look like XHTML syntax.
However, because the start-tag and element are ended with the net enabling
start-tag close delimiter, the greater than symbol actually follows the element,
and should be treated as character data, not markup.
Finally, the required end-tag for the form
element is not in this section,
but it does appear later in the document, which will be discussed when we
get to it. The markup for this section is equivalent to:
<form method="get" action="http://validator.w3.org/check">
<table>
<tbody>
<tr>
<td><input type="text" checked="checked" id="uri"
name="uri" size="40">></td>
<td><label for="uri">Is this test too hard?</label></td>
</tr>
<tr>
<td><button type="button">Don't Cheat!</button></td>
</tbody>
</table>
The List
The list is made up of several sections. First, some list items, followed by a processing instruction and lastly a really complicated comment declaration containing what appears to be invalid markup.
<ul/
<li><![CDATA[
<li Oops<!-- ?]]> -->
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ---->
<blink>I'll bet this is “annoying”!</blink>
<p align="right">Remeber, it's a Strict DOCTYPE!
<!-- ------ Don't give up now! ----- >
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<p>Is the error here --><li>?/
The ul
element starts with net enabling start-tag, as discussed previously.
The null end-tag appears at the very end of this section.
The first list item contains a CDATA
section. Many people know this syntax
from XML, but it is also valid in SGML, though it is unsupported in most browsers.
Opera 8 is the only browser I know of that supports it for HTML. The CDATA
section
means that its content (everything up to the CDATA
section end ]]>
should
be treated as character data.
Thus, the second list item, which looks like an invalid unclosed start-tag
containing an invalid attribute, is not really markup. It is character data
that should be output as such. If it were markup, it would be an error because
“Oops
” is not a valid attribute or value. For the same reason, the comment
declaration surrounding the CDATA section end is not really a comment. Because
no markup is recognised, it’s also treated as character data and output as text.
The second real list item (following the CDATA
section) contains an
unencoded less than symbol. In most cases, this should be encoded as <
.
While in XHTML, this is compulsory, in HTML (like for the unencoded ampersand
in paragraph 1) there are circumstances where it is valid to leave it unencoded.
The processing instruction <?hello ... ?>
is also valid. These may
appear almost anywhere within an SGML document. Although this PI has no defined
meaning, it does not affect validation in any way.
The big comment declaration is actually fairly complicated, but easy to understand
with a basic understanding of SGML comment syntax. If you recall the discussion
of the comment syntax above in the DOCTYPE section, you’ll remember that a comment
starts and ends with matching pairs of hyphens. By counting the number of hyphen
pairs, you should notice that the blink
element, the p
elements and the meta
element are actually commented out.
Within the commented out blink
element, there are also some invalid numeric
character references used. The decimal code points 147 and148 are actually Windows-1252
code points which are commonly used and supported in web browsers. However,
because numeric character references are supposed to use Unicode code points
and these are control characters, these character references are invalid.
The whole comment is actually closed on the last line of this section, just before a new list item is opened. This is valid because the unordered list has not yet been closed. The final list item simply contains a question mark, and is immediately followed by the null end-tag for the ul element.
<ul>
<li>
<li Oops<!-- ? --></li>
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ...
<p>Is the error here --> <li>?</li>
</ul>
Paragraph 3
The final paragraph, which is also the location of the only validation error within the document, should be extremely easy to understand, if you successfully read and understood the entire explanation so far.
<p/>The question is: Is this<br>HTML or<br/>XHTML
served as text/html? &smile</></></>
The p
element contains a net enabling start-tag, followed by a greater than
symbol. Again this was designed to look like XHTML’s empty element syntax, but
it is not.
The two br
elements are actually both valid HTML, despite the second
appearing to use XHTML empty element syntax also. The second br
element
is closed by the net enabling start-tag and also followed by a greater than
symbol.
The p
element is actually closed by the solidus in text/html
. Thus, everything
following the solidus is outside of the p
element and a direct child of the
form element; which, as you should recall, is still open.
The entity reference &smile
is missing the reference close (REFC
)
delimiter (semi-colon ;
). This is valid in HTML where the entity reference
is followed by any non-name character. However, because the entity declaration
was removed from the DOCTYPE above due to lack of browser support, it will
be replaced with the real white smiling face character.
The first empty end-tag is for the form
element, which is still open. The
second is for the body
element and the third is for the html
element.
<p>>The question is: Is this<br>HTML or<br>>XHTML
served as text</p>html? ?
</form>
</body>
</html>
Putting it All Together
After we combine each of the newly marked up sections using the commonly supported syntax, we end up with the entire document looking like this:
<?xml version="1.0" comment="Find the Error!" ?>
<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
<title>validation quiz</title>
</head>
<body>
<p>In this document, there ∃s a single validation error.
It makes use of some <strong><em>very</em></strong> uncommon
& unsupported markup techiques designed to fool the faint
hearted.</p>
<p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
to both help prevent cheaters and confuse even the most experienced
authors.</p>
<form method="get" action="http://validator.w3.org/check">
<table>
<tbody>
<tr>
<td><input type="text" checked="checked" id="uri"
name="uri" size="40">></td>
<td><label for="uri">Is this test too hard?</label></td>
</tr>
<tr>
<td><button type="button">Don't Cheat!</button></td>
</tr>
</tbody>
</table>
<ul>
<li>
<li Oops<!-- ? --></li>
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ...
<p>Is the error here -->
<li>?</li>
</ul>
<p>>The question is: Is this<br>HTML or<br>>XHTML
served as text</p>html? ?
</form>
</body>
</html>
Well, that’s it. If you have any questions or need clarification for anything I’ve discussed, please don’t hesitate to ask me.
I guess the quiz mainly proves why throwing away a huge swath of SGML and calling the result XML was a good idea… ?
Very interesting stuff. With all the emphasis these days on XHTML, it’s easy to forget about HTML’s SGML roots, and all the strange things that result from it. One correction though: you got the URI wrong for SGML and HTML Explained.
BTW has anyone documented which browsers support these SGML tricks? Might be interesting to get a list of “safe” ones that you can use in web pages.
Hello, I saw code like <html lang="en">
Could you help me to find all list of languages, I meant for other european languages?
Thanks in advance
Sarkis, firstly, I am not a help desk. Secondly, that is off topic for this entry. Thirdly, do a search for “ISO 639” (language codes) and “ISO 3166” (country codes) and you’ll find what you’re looking for.
Lachlan Hunt – Thanks & sorry.