All posts by Lachlan Hunt

HTML Comments in Scripts

It is common practice when including inline scripts within HTML markup, to surround the entire script within what appears to be an HTML comment like this:

Example 1:

<script type="text/javascript"><!--
    ...
//--></script> 

This technique is discussed in HTML 4.01, section 18.3.2 Hiding script data from user agents. However, few people really understand its purpose, what it really is, the problems it creates within XHTML documents, nor the theoretical problems it creates for HTML.

In this article, the term legacy user agents, legacy UAs (or equivalent) is used to refer only to user agents implemented before the script and style elements were introduced as place holders in HTML 3.2.

Note: Although this document will be primarily focussing on the script element, the concepts presented apply equally to the style element as well.

Purpose of the Comment

The purpose of the comment is to allow documents to degrade gracefully in legacy user agents by preventing them from outputting the script as content for the user to read. Since HTML user agents output the content of any unknown elements and because it would be unwise to do so for a script, the use of the comment is designed to ensure that this does not occur within legacy user agents.

It should be noted that there are no user agents in use today that don’t support the script element (regardless of whether they support the actual script or not), so using this technique on the web today seems rather superfluous. Yet it’s interesting to note that so many sites still make use of this old technique that was designed for accessibility reasons, despite the fact that so many of these sites choke in many other ways in browsers with scripts disabled or unsupported.

It’s Not Really a Comment

The content model of the script element in HTML is declared as CDATA, which stands for character data and means that the content within the element is not processed as markup, but as plain text. The only piece of markup which is recognised is the end-tag open (ETAGO) delimiter: </. Where the ETAGO occurs, it must be for the element’s end-tag ( </script> in this case). This is actually the cause of a really common validation error in scripts for people that use the document.write() function or the innerHTML property.

Because no other markup is recognised, the comment declaration is not really a comment; but rather plain text that looks like a comment. It’s designed for legacy UAs that don’t read the DTD and are, therefore, unaware that the script element actually contains CDATA. Because it is designed for backwards compatibility with legacy UAs processing it as markup, there are certain considerations that should be made as a result of this.

There are two small theoretical problems which most people are unaware of but, since no browser has ever been a strictly conforming SGML parser, neither of which are of any practical concern. As you will see, it is in fact the bugs in the legacy UAs for which this technique is designed, that ensures it always works as intended.

For legacy browsers encountering the unknown script element, they may or may not know the content model, depending on whether or not they’ve read the DTD or obtained the information from elsewhere.

In the case of the unknown content model, the parser would treat the element’s content as markup and hide the comment as expected, in most cases. However, there may be a problem caused by the presence of two hyphens within the script that is contained within the comment. Consider the following example:

Example 2:

<script type="text/javascript"><!--
 var i;
 for (i = 10; i > 0; i--) {
 // do something
 }
 //--></script>

Although that is perfect valid HTML 4.01, given that it is not really a comment, this creates a problem for a legacy user agent that process that as a comment. A comment delimiter in SGML is a pair of hyphens (--) and only white space may occur between the second comment delimiter and markup declaration close (MDC) delimiter: >. See the WDG’s explanation of HTML comments for more information.

For an SGML parser treating the element’s content as markup, the invalid comment syntax may potentially cause a problem and result in part of the script being output. As we will see later, this example will actually cause a fatal error in XHTML documents.

In the case of the known content model, since the parser is aware that the content model is CDATA, yet the script element is still an unknown element for the UA, it would process it as such and actually end up outputting the entire content of the element as text, thus defeating the purpose of attempting to hide the script.

For backwards compatibility, legacy UAs processing the element’s content as markup is depended upon to hide the content, but for the above reasons it means that it does not allow for backwards compatibility with any hypothetical legacy user agent using a strictly conforming SGML parser, in all cases. However, there are no browsers that do read the DTD and due to the bugs in all real legacy browsers, neither of these issues have ever caused any real world problems in HTML. It will also never cause any problem in the future either, since all future implementations will support the script element.

XHTML Problems

This technique does in fact cause real problems for XHTML documents that many authors are unaware of. It seems that Microsoft have fallen into this trap with their new Visual Web Developer 2005 Express application, as discussed by Charl van Niekerk in ASP.NET 2.0 – Part 2. It also seems that the Movable Type developers have made this mistake too, as Jacques Distler pointed out recently.

In XHTML, the content model of the script element is declared as #PCDATA (parsed character data), not CDATA, thus the content of the script element is supposed to be parsed as markup and the comment declaration really is a comment. Because of this, XHTML UAs will (when the document is served as XML) ignore the content of the comment, and thus ignore the script entirely.

However because the HTML script element contains CDATA and user agents treat XHTML documents served as text/html as HTML, they also treat the content of the script element as CDATA, and thus the comment is not treated as a comment by HTML 4 UAs. This is one of the many problems with serving XHTML as text/html.

As I mentioned earlier, example 2, which contains the extra pair of hyphens, will actually cause a fatal error in XHTML. In SGML, if it were a real comment, it would also be invalid, yet it would not be fatal since UAs employ error handling techniques to continue processing the document. In XML, however, it is a well-formedness error, which is fatal. Thus not only would the script be ignored because of the comment, but the entire document would be rendered totally useless for the user.

However, because most authors do incorrectly serve their XHTML documents as text/html, and the UAs parse it as HTML, authors are generally not aware of these issues. There are many other problems with using scripts for both HTML and XHTML, but those issues are out of scope for this article and best left for another day.

The Correct Method for XHTML

The correct way to use an inline script within XHTML is to escape it as character data using a CDATA section:

Example 3:

<script type="text/javascript"><![CDATA[
    var i = 5;
    if (i < 10) {
        // do something
    }
//]]></script> 

The CDATA section is necessary to ensure that special characters such as < and & are not treated as markup, which would otherwise result in a well-formedness error if it were not escaped as character references. The alternative is to encode such characters with the character references like &lt; and &amp;, however the readability of the script would be reduced and is not backwards compatible with HTML UAs, when XHTML is incorrectly served as text/html.

If your script doesn’t make use of either of those special characters, then the CDATA section is not necessary, but it’s a good habit to always include it anyway.

Backwards Compatibility

Ignoring the fact that XHTML should not be served as text/html and accepting that it does happen in the real world, for an HTML 4 UA processing the above XHTML script element, the CDATA section markup results in a JavaScript syntax error since the only markup-like syntax that scripting engines allow is an SGML comment declaration at the beginning of the first line. In order to allow the script to be correctly recognised and processed by current HTML and XHTML UAs, while still hiding from legacy UAs, a clever combination of comments and a CDATA section needs to be used.

Example 4:

<script type="text/javascript"><!--//--><![CDATA[//><!--
    ...
//--><!]]></script> 

An HTML UA which correctly treats the script element as CDATA is supposed to ignore everything following the comment declaration open delimiter on the first line. i.e. everything after the first “<!--” should be ignored. An XHTML UA will treat it as a comment followed by a CDATA section to escape the entire script.

In summary, an HTML 4 UA will pass the entire content of the script element off to the JavaScript engine, which will quite happily ignore the first line of markup, whereas an XHTML UA will only pass everything between <![CDATA[ and ]]> (not inclusive). The end result is that both HTML and XHTML UAs treat the script as CDATA, as intended, with neither of them ignoring any of the actual script, while providing some level of backwards compatibility with legacy user agents.

This does, however, open up a whole new theoretical problem for any hypothetical legacy UA that is unaware of the script element and processes it as markup with a conforming SGML parser. When example 4 is processed as markup within SGML, it should be parsed identically to XML. So the unknown script element in a legacy user agent with a strictly conforming SGML parser would output the content of the script (within the CDATA section) as text. However, the only user agent I know of the supports the CDATA section for HTML documents is the new Opera 8. Thus, the above syntax is really only backwards compatible with the non-conformant parsing behaviour of real world legacy HTML UAs, which does not cause any practical problems.

Avoiding All of These Problems

To avoid all of these problems with scripts and the decision of whether or not to include the pseudo-comment declaration in HTML documents at all, the best solution is to always include scripts as external files. This has the advantage of not being unintentionally ignored by XHTML UAs, and not erroneously processed in anyway by legacy HTML/SGML UAs, regardless of whether they are conforming or not. It also helps to better separate the markup from the script, and makes the script more easily reusable in multiple documents.

However, you must remember, that simply solving all of these markup issues for your scripts in an XHTML document doesn’t necessarily mean the the script will work correctly for both XHTML and HTML when served with the correct MIME types. There are more issues such as document.write() not working for XML, the need to use the namespace aware DOM methods in XHTML and many more related issues.

Handy CSS

As I promised over two weeks ago, I’ve written some useful CSS that I’m going to share with you all. The first is used to identify links using relationships like rel="nofollow" and rel="tag", the second for identifying the language of a document being linked to, the third can be used either as a substitute for rel="external" on external links, or for identifying links to a particular site, and the fourth and final is Hixie’s Cat Attack Hack.

They’re not all new, some of you have probably seen or used similar variants before. However, that doesn’t make them any less useful, which is why I’m publishing them.

Relationships

The rel attribute is supposed to indicate a semantic link relationship from one resource to another. However, it has been very much abused lately for non-semantic, functional uses designed for a particular user agent (namely, search engines). The relationships I’m specifically referring to are Google’s nofollow, which I argued strongly against, and Technorati’s tag. (I know, I know, my current blog has them both cause they’re embedded in WordPress by default with no easy way to turn them off, but they will both be getting removed later).

a[rel~="nofollow"]::after {
    content: "\2620";
    color: #933;
    font-size: x-small;
}
a[rel~="tag"]::after {
    content: url(http://www.technorati.com/favicon.ico);
}

These make use of the attribute selector for space separated lists of values. Any a element with a relationship containing those values will be matched. Links with the nofollow relationship will be followed by a dark red skull and crossbones (?) and those with the tag relationship will be followed by the Technocrati icon. For the adventurous, you may wish to convert the Technorati icon to a data: URI; but if you do remember to either quote the data: URI or escape the special characters. See the CSS <uri> value for more information.

The nofollow style is a slight variation of the suggestion I found listed with the Nofollow Finder Greasmonkey script, and a slightly less irritating variation of Phil Ringnalda’s blinking lime.

Language

The hreflang attribute in HTML is used to indicate the language of the document being linked to; however, most user agents do absolutely nothing with it that is helpful for the user. So, basically, the only people that have used it in the past are people that actually care about semantics. Well, I’ve come up with a useful way to make use of it to identify links pointing to a document in a language I won’t be able to understand. Since I only understand English and because I don’t need to be explicitly told when a document is in English, those links won’t be indicated with this CSS.

:link[hreflang]:not([hreflang|=en])::after {
    content: " [" attr(hreflang) "]";
    font-size: xx-small;
    font-weight: 100;
    vertical-align: super;
}

Again, this makes use of the attribute selectors, as well as the commonly used :link pseudo-class. This first attribute selector matches any link with that attribute and the second matches when the attribute has a hyphen separated list of values beginning with en. The negation pseudo-class: :not(…) is a functional notation taking a simple selector.

Unlike the previous examples, this one is designed for use in an author stylesheet, not a user stylesheet. Many people make use of the non-standard rel=”external” relationship to indicate a link to an external site. However, adding that to each and every link is time consuming and and unnecessary. This style rule will place an north east arrow ( ?) after any link on your site to an external site.

[href^="http://"]:not([href*="lachy.id.au"])::after {
    content: "\2197";
}

It works by selecting any link with the href attribute beginning with “http://“, but does not contain your domain name. It uses two separate attribute selectors so that it will match when the URI uses the www. or not. So, the above will match not match either <a href=http://lachy.id.au…"> or <a href=http://www.lachy.id.au…">. If, you want to actually highlight links to a certain site, then just remove the negation pseudo-class, leaving just the two attribute selectors.

Hixie’s Cat Attack Hack

Anyone that regularly reads Hixie’s Natural Log will be aware of Hixie’s obsession with cats and those of us using Firefox will have noticed something very strange happenning with the autoscroll icon. When used, the normal autoscroll icon becomes extra large, there’s a picture of a cat in the background and the cursor becomes a pointer.

That trick, which I will call Hixie’s Cat Attack, makes use of the way Firefox inserts the autoscroll image as a direct descendant of the html element. This, which I will call Hixie’s Cat Attack Hack, simply reverses the effects of Hixie’s Cat Attack. This CSS, like the first two, should be placed in userContent.css in your profile directory.

html>img {
    width: 28px !important;
    height: 28px !important;
    background: none !important;
    cursor: normal !important;
}