Monthly Archives: May 2005

Separator Elements

The traditional, structured way to markup and delimit separate sections in SGML and XML is to put each part within its own container element. e.g.

Example 1:

<part>
  <p>part 1</p>
</part>
<part>
  <p>part 2</p>
</part>

From a structural point of view, this clearly indicates a certain separation between the two parts, whatever the actual container element or semantics may be.

However, there is an anomaly in all versions HTML and XHTML with some elements designed as separators, rather than container elements. Specifically, I’m referring to HTML’s hr and br elements, and XHTML 2’s proposed separator element. To separate one part from another, these elements are placed between the content, which some argue goes against the spirit of SGML and XML. e.g.

Example 2:

<p>part 1</p>
<separator/>
<p>part 2</p>

Tag Name

The XHTML 2 separator element has been renamed from HTML 4’s hr element. The reason for this is that the presentation of the separation is not always horizontal, and not always a rule. According to a recent presentation by Stephen Pemberton, the Japanese community were asking for a vr (Vertical Rule) element. He also provided some examples from different editions of the same book where, in each case, the presentation of the separator was different, but the semantics were the same.

In order to remove all presentational suggestion from the element name, and in an attempt to address the true structure of the construct, the element was renamed to separator. However, this name has been questioned because of its apparent spelling difficulty. It seems that ‘separator’ is often incorrectly spelt with an ‘e’ in place of the first ‘a’ as: ‘seperator’. There has been a suggestion for the tag name to be reduced to ‘sep’ to avoid all this confusion and make it easier to type.

However, regardless of the tag name’s spelling, the question still remains as to whether the name truly represents the elements semantics; or, indeed, whether it has any semantics at all, or if it’s just a place holder for a presentational construct.

Structure

Jukka Korpela has previously argued strongly against these empty separator-like elements in HTML, and suggested that the br element be replaced with a much more structured line element instead, and that the hr element be replaced with a similar structural equivalent. He compares hr with the original structure of the p element as a separator between paragraphs, rather than a container; and Unicode control codes; and explains why this structure is not appropriate in an SGML or XML context.

There is also the question of what exactly does it separate? In example 2 above, the separator element clearly separates part 1 from part 2. However, the separation is not always as clear. For example, if the separator were to be included within its own part, then the question of what it actually separates becomes a little more complex.

Example 3:

<part>
  <p>part 1</p>
</part>
<part>
  <separator/>
</part>
<part>
  <p>part 2</p>
</part>

It would appear that the intention of this markup structure is to markup the separation between parts 1 and 2. However, from a structural point of view, the separator element in this case really only serves to separate the content within the second part element. If the separator element is really intended to both structurally and semantically separate parts 1 and 2, then it would seem that the boundary of separation extends beyond the separator element’s parent, which really seems to break the tree-like structure of XML markup even more.

Now, you may be asking what possible reason there is that would require such a structure with any real XHTML elements; but take my word for it, as we will see later, this kind of structure is required in some circumstances, for the separator element.

Alternative Markup

This issue with br element has in fact already been addressed with the introduction of the l element in XHTML 2, but the issues with the hr and separator elements still remain. In essence, the hr element is the block-level equivalent to the inline br element; yet it is not quite so easy to replace it with a structural and semantic equivalent, as it is replacing br with l, because it is harder to define its semantics.

Examples 1 and 2 above are logically identical, in that they both mark up the separation between parts 1 and 2. They are clearly structurally different; but the question of semantics still remains. It has been argued that the structured, semantic equivalent already exists in XHTML 2 as the section element. For example, the following is structurally equivalent to example 1 and logically equivalent to both examples 1 and 2, but is it semantically equivalent to either?

Example 4:

<section>
  <p>part 1</p>
</section>
<section>
  <p>part 2</p>
</section>

Presentation

One of the arguments against the semantic equivalence of examples 2 and 4 is that example 2 typically results in some default presentation as an indication of separation between sections, which may convey some semantics to the reader. This indication is typically a horizontal rule, stars, or other graphical representation in a visual medium; a long pause in an aural medium, etc.

Although the presentation alone is not, in any way, a reason to retain any element in a semantic language – particularly if a non-presentational, well structured, semantic equivalent already exists – the issue of presentation is not simply about how it looks in this case; but the semantics that the presentation conveys to the reader, which makes it semantically different from other existing markup.

Semantics

As we have seen throughout this discussion, each issue comes down to the issue of semantics, and what the element actually means. One of the problems is with defining the semantics is that the definition was rather vague in HTML 2.0, and only became more so later. The definitions for the hr element in all versions from HTML 2.0 through to XHTML 1.1 and the separator element in the XHTML 2.0 draft are as follows:

Horizontal Rule: HR (HTML 2.0):
The <HR> element is a divider between sections of text; typically a full width horizontal rule or equivalent graphic.
Horizontal Rules (HTML 3.0 – Expired draft)
The <HR> element is used for horizontal rules that act as dividers between sections. The SRC attribute can be used to designate a custom graphic, otherwise subclass HR with the CLASS attribute and specify the appropriate rendering with an associated style sheet.
HR – horizontal rules (HTML 3.2)
Horizontal rules may be used to indicate a change in topic. In a speech based user agent, the rule could be rendered as a pause.
Rules: the HR element (HTML 4.0 to XHTML 1.1 (Modularization of XHTML))
The HR element causes a horizontal rule to be rendered by visual user agents.
The separator element (XHTML 2.0)
The separator element separates parts of the document from each other.

With such poor definitions, it’s easy to understand why there is such a debate over the element’s semantics, and why it is often viewed as a poorly structured, non-semantic, presentational element. In order to understand its true semantics, and how it differs from other elements like section, we must analyse its legitimate, non-presentational usage in the real world. There are two common use cases we will investigate, including separators used in books, and separators used to group menu items in a typical GUI.

Usage

Book Chapters and Topics

In books, and other similar publications, such separators are often used to indicate a minor change in topic, scene or perspective. These changes, or divisions, are usually smaller than a whole chapter and, in fact, a chapter may contain many such divisions with each separated visually with some kind of rule, stars or other graphical representation.

Compare this with the section element, which is designed to structure a document into sections – each with its own heading (in most cases). The sections delimited with the separator element tend not to have headings – they’re still related to the same section – but they do indicate a slight, yet related, change in topic. Thus, while both are similar, there is a semantic difference between the two.

Menu Items

For menu items, such separators usually group related items. For example, options to create, open and save documents are separated from those used to preview and print the document, yet they are still operations performed on the file as a whole, and therefore generally belong in the File menu. In this case, the separator doesn’t really indicate a change in topic, but rather a way to group related items. Using the separator element, this structure could be marked up as follows.

Example 5:

<nl>
  <label>Menu</label>
  <li>Item 1</li>
  <li>Item 2</li>
  <li><separator/></li>
  <li>Item 3</li>
  <li>Item 4</li>
</nl>

Note the similarity in structure to example 3 where the separator element is within its own separate container element.

It should be noted that there is another common method used to group related menu items with the use of submenus, which usually fly-out to the side of the parent menu item. For this reason, it is often believed that nested lists represent the semantics of submenus, while separators indicate the semantics of the lower-level grouping. However, as I will demonstrate, this is not always the case, and structures already exist to semantically markup both kinds of groupings without the use of a separator element.

Solutions

The first, and most often proposed, solution is to introduce a new container element for these purposes. The benefit is that the element will (hopefully) have well defined semantics and allow a UA to present it appropriately. However, the difficulty with this is that a new element that would be appropriate for use within a section would not be appropriate for use within a list, due to the differing content models and semantics of the seperation.

The next commonly proposed solution is for authors to include a class attribute on the elements that require separation and then use CSS to style them appropriately, such as with a border. While this solution is perfectly reasonable when the separation is purely presentational with no semantic meaning whatsoever, it is not suitable for the semantic cases discussed above.

If it were used for a semantic case, then it would be a case of moving the semantics out with the presentation (rather than separating) and actually depend upon the availability of stylesheets to accurately express the semantics. As a result, the document may inadvertently change meaning when stylesheets are disabled or unsupported.

Another solution is to make use of a semantic attribute on existing container elements, such as section, div or whatever else. Luckily, in XHTML 2, there already exists an attribute for the purpose of extending elements with more specific and well defined semantics, known as the role attribute.

Book Chapters and Topics

Using the example of the book chapters given earlier, a section element could be used to represent a chapter and div elements to represent the change in topic.

Example 6:

<section role="chapter">
  <h>Chapter Title</h>
  <div role="topic">
    <p>part 1</p>
  </div>
  <div role="topic">
    <p>part 2</p>
  </div>
</section>

Ordinarily, the div element is semantically meaningless; however semantics are added to it with the use of the role attribute in this case. Since the semantics of the role attribute and value would be well defined, unlike the class attribute, UAs may freely default to drawing a horizontal rule, or other presentation, between each topic; even in a non-CSS environment. Of course, authors may freely use CSS to alter this presentation as desired. The important point is that the semantics of the separation is retained, while providing a much more structured and semantic method.

Menu Items

Unlike books, the separation in the case of menu items generally doesn’t indicate a change of topic; but rather a method to logically group related items. This separation is still somewhat semantic – not entirely presentational – and, as such, requires structurally correct, semantic markup. This can be easily achieved using various kinds of nested lists.

In the following example, the nested unordered lists represent the same grouping achieved with a separator element, and the nested navigation list represents a sub menu.

Example 7:

<nl>
  <label>Menu</label>
  <li>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
  </li>
  <li>
    <ul>
      <li>Item 3</li>
      <li>Item 4</li>
    </ul>
  </li>
  <li>
    <nl>
      <label>Submenu</label>
      <li>Submenu Item 1</li>
      <li>Submenu Item 2</li>
    </nl>
  </li>
</nl>

Conclusion

It should be clear that the structure of the separator element is not really appropriate for the tree-like structure of an XML document — it’s an anomaly introduced in early versions of HTML that is still hanging on by a thread. Regardless of what the element is actually called: hr, separator or even sep, the most important issue that needs to be addressed is its semantic definition, and thus whether the structure is really appropriate within a semantic language.

Only after clearly defining the intended semantics of the element can it be determined whether or not suitable alternatives already exist, and, if not, whether or not it is possible to apply the same semantics in a more well structured manner.

There are already elements in XHTML intended to markup sections and divisions within a document, as well as many other structures. In many of these structures, there may be reasons to logically, structurally and semantically group and/or separate sections within. However, because the semantics of the separation for one structure may differ from another, a single general purpose grouping or separation mechanism may not provide the required structure and semantics in all cases.

Where the separation indicates a change in topic, scene or perspective, the use of existing structural elements with a semantic attribute is a good, feasible solution. However, where the separation merely indicates groups of related items, such as in a list, then adjusting the markup structure may be all that is needed; perhaps combined with some additional semantic attributes as well.

Validation Quiz Explanation

Last week, I’m sure you all had fun trying to get your mind around understanding the incredibly complex, yet almost entirely valid markup in the validation quiz. It was solved a lot sooner than I had expected by both Anne van Kesteren and David Håsäther. Anne was correct, but his explanation wasn’t quite satisfactory enough to win. David’s explanation was spot on. Well done to both of them.

For the rest of you who aren’t SGML experts, and are still trying to figure how a non-conformant XML declaration in an HTML document with 2 DOCTYPEs can be valid, read on to find out.

The XML Declaration

Despite appearances to the contrary, the first line is not an XML declaration at all.

<?xml version="1.0" comment="Find the Error!" ?>

In SGML, it is a Processing Instruction, which just happens to look like somewhat an XML declaration. Although the meaning of PI is undefined in SGML and HTML, it still passes validation. If the document were served with an XML MIME type, rather than text/html, then an XML parser would try to process it as an XML declaration, although it would be non-conformant since there is no comment attribute defined for it in the XML recommendation.

It is actually included as the first of two exploits for known validator bugs. This bug — bug 14 to be precise — prematurely sets the validator to XML parsing mode. Because of this, the validator incorrectly parses the comments, DOCTYPEs and everything else that follows as ill-formed and invalid XHTML. This bug is the cause of the 80 incorrect errors being issued for the entire document.

The XHTML 1.0 Strict DOCTYPE

Knowing that the pseudo-XML-declaration above is really a valid SGML PI and that the document is served as text/html, the comment declarations and DOCTYPEs should be parsed with SGML rules, not with XML rules like the validator does incorrectly.

<!-- -- -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- -- -->

As I briefly mentioned in HTML Comments in Scripts and which is discussed in more detail by the WDG, a comment declaration starts with a markup declaration open (MDO): <!, ends with a markup declaration close (MDC) > and contains one or more comments. Each comment within the comment declaration starts with and ends with a matching pair of hyphens. Because of this, despite appearances, there is actually only one comment declaration that surrounds the XHTML 1.0 DOCTYPE and contains 3 separate comments.

The first comment, between the first and second pair of hyphens, and the third comment, between the fifth and sixth pair, each only contain a space. The second comment, between the third and fourth pair, contains everything in between, including the XHTML DOCTYPE. The entire comment declaration ends just before the HTML 4.01 DOCTYPE, making it an HTML 4.01 document.

This section is essentially the same as the following:

<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
--> 

The HTML 4.01 DOCTYPE

The second exploit for a known bug in the validator — bug 24 — is actually used in the HTML 4.01 DOCTYPE.

<!doctype html public "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "☺" -- U+263A WHITE SMILING FACE -->
]>

In SGML, it is perfectly valid to write doctype html public in lowercase, although it is conventional to use uppercase. With a lower case DOCTYPE, the validator does not identify the document as HTML 4.01, and (if it were valid) would only state This Page Is Valid!, rather than This Page Is Valid HTML 4.01 Strict!

The HTML 4.01 DOCTYPE also includes an internal subset. i.e. Everything between the square brackets. This, in addition to everything defined in the HTML 4.01 Strict DTD, defines a new entity “smile”, representing the character U+263A, a white smiling face (). This entity may be referenced using entity reference: &smile;, as is done later in the document.

This is equivalent to the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile CDATA "☺" -- U+263A WHITE SMILING FACE -->
]>

However, because internal subsets are unsupported, it will be omitted from the final document and the entity reference will be replaced with the real character later.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

The Document Head

<html lang="en">
<title/validation quiz/
</head>

There is nothing strange about the html start-tag, it’s fairly standard and just describes the document’s language as being English (en).

The start-tag for the head element has been omitted. This is perfect valid since both the start- and end-tags for the html, head, body and tbody elements are optional. Even though the start-tag is omitted, the head element’s start-tag is still implied by the presence of the title element and, despite the missing start-tag, it is still valid to include the end-tag.

The title element uses a special syntax, known as SHORTTAG NET (Null End Tag). Many of the SGML SHORTTAG features are unsupported in real world browsers, but that doesn’t make this any less valid. The first solidus closes the start-tag (known as a net-enabling start-tag close delimiter) and begins the element’s content. The second solidus is the null-end-tag, which closes the element.

The markup for this section is exactly equivalent to this:

<html lang="en">
<head>
  <title>validation quiz</title>
</head> 

The Document Body

Like the head element, the start-tag for the body element has been omitted. However, as we will see later, the end-tag hasn’t been, although its presence is not immediately obvious. The body element is implied by the first paragraph, immediately following the end-tag for the head element. (Note: It is not implied simply by the end-tag for the head element, even though the next element must be a body element.)

Paragraph 1

The first paragraph is quite straight forward.

<p>In this document, there &exist;s a single validation error.
   It makes use of some <strong<em/very/</strong> uncommon &
   unsupported markup techniques designed to fool the faint
   hearted.

It starts with the p start-tag, which is required, but the optional end-tag is omitted (which is important, as we will see for paragraph 2). It contains an entity reference, &exist; which is defined in the character entity references for symbols, mathematical symbols, and Greek letters.

The strong and em elements in this may look invalid, but they are not. The em element makes use of the same SHORTTAG NET syntax used for the title element. The strong element, however, is a little more confusing. The start-tag is unclosed — it omits the tag close delimiter (TAGC) >. In SGML, TAGC may be omitted when the first non-white space character is a tag open delimiter (TAGO) <.

There is also a lone ampersand within this paragraph. Because the ampersand usually represents the start of an entity reference; ampersands are, in many circumstances, required to be written using the entity reference &amp;. However, there are valid cases in SGML where this is not required, such as when immediately followed by white space or other character that may not be part of an entity name.

As stated previously, because this is the first paragraph, it implies the presence of the body start-tag. The end-tag is actually implied by the start of the next paragraph, but I’ve included it in this section because it’s easier and makes no difference to the end result. Thus the markup for this section is equivalent to this:

<body>
  <p>In this document, there &exist;s a single validation error.
     It makes use of some <strong><em>very</em></strong> uncommon
     &amp; unsupported markup techniques designed to fool the
     faint hearted.</p>

Paragraph 2

The second paragraph is a little more complex. It starts and ends with tags that are missing the tag name. In SGML terms, these are respectively known as empty start-tags and empty end-tags.

<>This exploits some known bugs in <a href=http://validator.w3.org/
  to both help prevent cheaters and confuse even the most experienced
  authors.</>

According to the rules of SGML, an empty start-tag represents the same element as the most recently opened element within the tree. There is a condition in SGML that changes this rule, but that small detail will be omitted because it is not relevant to HTML. For a full explanation, see section 9.3.1 Empty Tags in Martin Bryan’s SGML and HTML Explained.

Because the end-tag for the first paragraph was omitted (recall that I said it was important), the paragraph element is still open, and thus the empty start-tag is recognised as a paragraph start-tag as well. As I stated previously, this start-tag also implies the end-tag for the previous paragraph. The empty end-tag simply closes the most recently opened element within the tree, which is this paragraph.

The a element in this paragraph also looks invalid because it appears to be missing both the TAGC and an end tag., yet neither are really missing and it is not invalid. However, the markup does not mean what it appears to mean at first glance. As explained above for the SHORTTAG NET syntax used for the title element, the first solidus is the NET enabling start-tag close delimiter and the second in the null end-tag. These are processed in this way because the attribute value is not quoted. If it were quoted, the solidus would not represent the NET syntax.

The markup for this section is equivalent to this:

<p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
   to both help prevent cheaters and confuse even the most experienced
   authors.</p>

The Form

This section is quite easy, given that most of the concepts used have been covered earlier. To make it easier to explain, I’ve indented the lines a little, but the markup is otherwise unchanged.

<form method="get" action="http://validator.w3.org/check"
  <table
      <tr
        <td<input text checked id=uri name=uri size=40/>
        <><label for=uri>Is this test too hard?</label></>
      <><td<button button>Don't Cheat!</>
  </tbody
 ></table>

The start-tags for the form, table, tr and td elements are unclosed start-tags again. The empty start-tags and end-tags should be fairly self explanatory.

The tbody element, like the head and body elements, is missing its start-tag but not its end-tag. The tbody end-tag is not, in this case, an unclosed tag, because it is closed on the following line with the TAGC delimiter just before the table end-tag.

The attributes for the input and button elements make use of a feature called attribute minimisation. Despite popular belief, attribute minimisation allows the omission of the attribute name where the attribute value may be unambiguously associated with a particular attribute. The text attribute, is not actually a text attribute. It is a value that may be unambiguously associated with the type attribute and is, therefore, the minimised form of type="text". The checked attribute is more commonly known has the short form of checked="checked". This is exactly the same for the button element, where the value button is unambiguously associated with type="button".

The input element also uses a net enabling start-tag. Because it is an empty element, for which end-tags are forbidden, it does not need the null end-tag to be present. The net enabling start-tag is also followed by a greater than symbol >, which is designed to make it look like XHTML syntax. However, because the start-tag and element are ended with the net enabling start-tag close delimiter, the greater than symbol actually follows the element, and should be treated as character data, not markup.

Finally, the required end-tag for the form element is not in this section, but it does appear later in the document, which will be discussed when we get to it. The markup for this section is equivalent to:

<form method="get" action="http://validator.w3.org/check">
  <table>
    <tbody>
      <tr>
        <td><input type="text" checked="checked" id="uri"
                   name="uri" size="40">&gt;</td>
        <td><label for="uri">Is this test too hard?</label></td>
      </tr>
      <tr>
        <td><button type="button">Don't Cheat!</button></td>
    </tbody>
  </table>

The List

The list is made up of several sections. First, some list items, followed by a processing instruction and lastly a really complicated comment declaration containing what appears to be invalid markup.

<ul/
<li><![CDATA[
<li Oops<!-- ?]]> -->
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ---->
<blink>I'll bet this is &#147;annoying&#148;!</blink>
<p align="right">Remeber, it's a Strict DOCTYPE!
<!-- ------ Don't give up now! ----- >
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<p>Is the error here --><li>?/

The ul element starts with net enabling start-tag, as discussed previously. The null end-tag appears at the very end of this section.

The first list item contains a CDATA section. Many people know this syntax from XML, but it is also valid in SGML, though it is unsupported in most browsers. Opera 8 is the only browser I know of that supports it for HTML. The CDATA section means that its content (everything up to the CDATA section end ]]> should be treated as character data.

Thus, the second list item, which looks like an invalid unclosed start-tag containing an invalid attribute, is not really markup. It is character data that should be output as such. If it were markup, it would be an error because “Oops” is not a valid attribute or value. For the same reason, the comment declaration surrounding the CDATA section end is not really a comment. Because no markup is recognised, it’s also treated as character data and output as text.

The second real list item (following the CDATA section) contains an unencoded less than symbol. In most cases, this should be encoded as &lt;. While in XHTML, this is compulsory, in HTML (like for the unencoded ampersand in paragraph 1) there are circumstances where it is valid to leave it unencoded.

The processing instruction <?hello ... ?> is also valid. These may appear almost anywhere within an SGML document. Although this PI has no defined meaning, it does not affect validation in any way.

The big comment declaration is actually fairly complicated, but easy to understand with a basic understanding of SGML comment syntax. If you recall the discussion of the comment syntax above in the DOCTYPE section, you’ll remember that a comment starts and ends with matching pairs of hyphens. By counting the number of hyphen pairs, you should notice that the blink element, the p elements and the meta element are actually commented out.

Within the commented out blink element, there are also some invalid numeric character references used. The decimal code points 147 and148 are actually Windows-1252 code points which are commonly used and supported in web browsers. However, because numeric character references are supposed to use Unicode code points and these are control characters, these character references are invalid.

The whole comment is actually closed on the last line of this section, just before a new list item is opened. This is valid because the unordered list has not yet been closed. The final list item simply contains a question mark, and is immediately followed by the null end-tag for the ul element.


<ul>
  <li>
  &lt;li Oops&lt;!-- ? --&gt;</li>
  <li>There are &lt; 2 validation errors in this document</li>
  <?hello comment="What's this doing here?"?>
  <!--- Found the error yet? ...
    <p>Is the error here --> <li>?</li>
</ul>

Paragraph 3

The final paragraph, which is also the location of the only validation error within the document, should be extremely easy to understand, if you successfully read and understood the entire explanation so far.

<p/>The question is: Is this<br>HTML or<br/>XHTML
   served as text/html? &smile</></></>

The p element contains a net enabling start-tag, followed by a greater than symbol. Again this was designed to look like XHTML’s empty element syntax, but it is not.

The two br elements are actually both valid HTML, despite the second appearing to use XHTML empty element syntax also. The second br element is closed by the net enabling start-tag and also followed by a greater than symbol.

The p element is actually closed by the solidus in text/html. Thus, everything following the solidus is outside of the p element and a direct child of the form element; which, as you should recall, is still open.

The entity reference &smile is missing the reference close (REFC) delimiter (semi-colon ;). This is valid in HTML where the entity reference is followed by any non-name character. However, because the entity declaration was removed from the DOCTYPE above due to lack of browser support, it will be replaced with the real white smiling face character.

The first empty end-tag is for the form element, which is still open. The second is for the body element and the third is for the html element.

    <p>&gt;The question is: Is this<br>HTML or<br>&gt;XHTML
       served as text</p>html? ☺
  </form>
</body>
</html>

Putting it All Together

After we combine each of the newly marked up sections using the commonly supported syntax, we end up with the entire document looking like this:

<?xml version="1.0" comment="Find the Error!" ?>
<!--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
  <title>validation quiz</title>
</head>
<body>
  <p>In this document, there &exist;s a single validation error.
     It makes use of some <strong><em>very</em></strong> uncommon
     &amp; unsupported markup techiques designed to fool the faint
     hearted.</p>
  <p>This exploits some known bugs in <a href="http:"></a>validator.w3.org/
     to both help prevent cheaters and confuse even the most experienced
     authors.</p>

  <form method="get" action="http://validator.w3.org/check">
    <table>
      <tbody>
        <tr>
          <td><input type="text" checked="checked" id="uri"
                     name="uri" size="40">&gt;</td>
          <td><label for="uri">Is this test too hard?</label></td>
        </tr>
        <tr>
          <td><button type="button">Don't Cheat!</button></td>
        </tr>
      </tbody>
    </table>

    <ul>
      <li>
      &lt;li Oops&lt;!-- ? --&gt;</li>
      <li>There are &lt; 2 validation errors in this document</li>
      <?hello comment="What's this doing here?"?>
      <!--- Found the error yet? ...
        <p>Is the error here -->
      <li>?</li>
    </ul>

    <p>&gt;The question is: Is this<br>HTML or<br>&gt;XHTML
       served as text</p>html? ☺
  </form>
</body>
</html>

Well, that’s it. If you have any questions or need clarification for anything I’ve discussed, please don’t hesitate to ask me.

Validation Quiz

Let’s say you’ve been writing HTML and XHTML for years. Being a standards activist, you always write well formed, valid markup. You meticulously validate every document you write. Not only that, but you’ve installed the web developer toolbar in Firefox or Mozilla and, as a hobby, you run the validator on every site you visit. With years of experience under your belt, you think you can handle any error the validator throws at you, and you’re confident you can fix whatever it is in under a minute.

If that description fits you, then I hereby challenge you to find the one and only real validation error within the following sample HTML or XHTML document (I’m not telling you which, you figure it out). Do you think the validator will help? Go ahead and test it! I’ve exploited some known bugs in the validator to ensure you can’t cheat quite so easily. The validator will, in its current state, issue 80 errors; none of which are real!

There is one, and only one, true validation error within this document. The first person to comment with the correct answer and explanation will be featured in a follow up post to them give some recognition for their hard work. Feel free to discuss and ask questions here in the comments (or wherever else you like). This is designed to be a fun exercise for you to realise just how much you really don’t know about HTML and/or XHTML.

Do you think you’re ready to take the quiz? Do you think this will be a walk in the park, and you’ll be the first across the line with the right answer? Ok, here it is, and remember, have fun!

Assume the HTTP headers contain: Content-Type: text/html;charset=UTF-8

<?xml version="1.0" comment="Find the Error!" ?>
<!-- -- -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- -- -->
<!doctype html public "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY smile   CDATA "☺" -- U+263A WHITE SMILING FACE -->
]>
<html lang="en">
<title/validation quiz/
</head>
<p>In this document, there &exist;s a single validation error.  It makes
use of some <strong<em/very/</strong> uncommon & unsupported markup techiques
designed to fool the faint hearted.
<>This exploits some known bugs in <a href=http://validator.w3.org/
to both help prevent cheaters and confuse even the most experienced
authors.</>
<form method="get" action="http://validator.w3.org/check"
<table
<tr
<td<input text checked id=uri name=uri size=40/>
<><label for=uri>Is this test too hard?</label></>
<><td<button button>Don't Cheat!</>
</tbody
></table>
<ul/
<li><![CDATA[
<li Oops<!-- ?]]> -->
<li>There are < 2 validation errors in this document</li>
<?hello comment="What's this doing here?"?>
<!--- Found the error yet? ---->
<blink>I'll bet this is &#147;annoying&#148;!</blink>
<p align="right">Remeber, it's a Strict DOCTYPE!
<!-- ------ Don't give up now! ----- >
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<p>Is the error here --><li>?/
<p/>The question is: Is this<br>HTML or<br/>XHTML
served as text/html? &smile</></></>