It is common practice when including inline scripts within HTML markup, to
surround the entire script within what appears to be an HTML comment like this:
Example 1:
<script type="text/javascript"><!--
...
//--></script>
This technique is discussed in HTML 4.01, section 18.3.2 Hiding
script data from user agents. However, few people really understand its purpose, what it
really is, the problems it creates within XHTML documents, nor the theoretical
problems it creates for HTML.
In this article, the term legacy user agents, legacy UAs (or
equivalent) is used to refer only to user agents implemented before the script
and style
elements
were introduced as place holders in HTML 3.2.
Note: Although this document will be primarily focussing on the script
element,
the concepts presented apply equally to the style
element as well.
Purpose of the Comment
The purpose of the comment is to allow documents to degrade gracefully in
legacy user agents by preventing them from outputting the script as content
for the user to read. Since HTML user agents output the content of any unknown
elements and because it would be unwise to do so for a script, the use of the
comment is designed to ensure that this does not occur within legacy user agents.
It should be noted that there are no user agents in use today that don’t support
the script
element (regardless of whether they support the actual
script or not), so using this technique on the web today seems rather superfluous.
Yet it’s interesting to note that so many sites still make use of this old technique
that was designed for accessibility reasons, despite the fact that so many
of these sites choke in many other ways in browsers with scripts disabled or
unsupported.
The content model of the script
element in HTML is declared as CDATA
, which
stands for character data and means that the content within the element is not
processed as markup, but as plain text. The only piece of markup which is recognised
is the end-tag open (ETAGO
) delimiter: </
. Where the ETAGO
occurs, it must
be for the element’s end-tag ( </script>
in this case). This is actually
the cause of a really common
validation error in scripts for people that use
the document.write()
function or the innerHTML
property.
Because no other markup is recognised, the comment declaration is not really
a comment; but rather plain text that looks like a comment. It’s designed for
legacy UAs that don’t read the DTD and are, therefore, unaware that the script
element actually contains CDATA
. Because it is designed for backwards compatibility
with legacy UAs processing it as markup, there are certain considerations that
should be made as a result of this.
There are two small theoretical problems which most people are unaware of
but, since no browser has ever been a strictly conforming SGML parser, neither
of which are of any practical concern. As you will see, it is in fact the bugs
in the legacy UAs for which this technique is designed, that ensures it always
works as intended.
For legacy browsers encountering the unknown script
element, they may or may
not know the content model, depending on whether or not they’ve read the DTD
or obtained the information from elsewhere.
In the case of the unknown content model, the parser would treat the element’s
content as markup and hide the comment as expected, in most cases. However,
there may be a problem caused by the presence of two hyphens within the script
that is contained within the comment. Consider the following example:
Example 2:
<script type="text/javascript"><!--
var i;
for (i = 10; i > 0; i--) {
// do something
}
//--></script>
Although that is perfect valid HTML 4.01, given that it is not really a comment,
this creates a problem for a legacy user agent that process that as a comment.
A comment delimiter in SGML is a pair of hyphens (--
) and only white
space may occur between the second comment delimiter and markup declaration
close (MDC
) delimiter: >
. See the WDG’s
explanation of HTML comments for more
information.
For an SGML parser treating the element’s content as markup, the invalid comment
syntax may potentially cause a problem and result in part of the script being
output. As we will see later, this example will actually cause a fatal error
in XHTML documents.
In the case of the known content model, since the parser is aware that the
content model is CDATA
, yet the script
element is still an unknown element for
the UA, it would process it as such and actually end up outputting the entire
content of the element as text, thus defeating the purpose of attempting to
hide the script.
For backwards compatibility, legacy UAs processing the element’s content as
markup is depended upon to hide the content, but for the above reasons it
means that it does not allow for backwards compatibility with any hypothetical
legacy user agent using a strictly conforming SGML parser, in all cases.
However, there are no browsers that do read the DTD and due to the bugs in
all real legacy browsers, neither of these issues have ever caused any real
world problems in HTML. It will also never cause any problem in the future
either, since all future implementations will support the script
element.
XHTML Problems
This technique does in fact cause real problems for XHTML documents that many
authors are unaware of. It seems that Microsoft have fallen into this trap with
their new Visual Web Developer 2005 Express application, as discussed by Charl
van Niekerk in ASP.NET
2.0 – Part 2. It also seems that the Movable
Type developers
have made this mistake too, as Jacques
Distler pointed out recently.
In XHTML, the content model of the script
element is declared as #PCDATA
(parsed
character data), not CDATA
, thus the content of the script
element is supposed
to be parsed as markup and the comment declaration really is a comment. Because
of this, XHTML UAs will (when the document is served as XML) ignore the content
of the comment, and thus ignore the script entirely.
However because the HTML script
element contains CDATA
and user agents treat
XHTML documents served as text/html
as HTML, they also treat the content of
the script
element as CDATA
, and thus the comment is not treated as a comment
by HTML 4 UAs. This is one of the many problems with serving XHTML as text/html
.
As I mentioned earlier, example 2, which contains the extra pair of hyphens,
will actually cause a fatal error in XHTML. In SGML, if it were a real comment,
it would also be invalid, yet it would not be fatal since UAs employ error handling
techniques to continue processing the document. In XML, however, it is a well-formedness
error, which is fatal. Thus not only would the script be ignored because of
the comment, but the entire document would be rendered totally useless for the
user.
However, because most authors do incorrectly serve their XHTML documents as
text/html
, and the UAs parse it as HTML, authors are generally not aware of
these issues. There are many other problems with using scripts for both HTML
and XHTML, but those issues are out of scope for this article and best left
for another day.
The Correct Method for XHTML
The correct way to use an inline script within XHTML is to escape it as character
data using a CDATA
section:
Example 3:
<script type="text/javascript"><![CDATA[
var i = 5;
if (i < 10) {
// do something
}
//]]></script>
The CDATA
section is necessary to ensure that special characters such as <
and &
are
not treated as markup, which would otherwise result in a well-formedness error
if it were not escaped as character references. The alternative is to encode
such characters with the character references like <
and &
, however
the readability of the script would be reduced and is not backwards compatible
with HTML UAs, when XHTML is incorrectly served as text/html.
If your script doesn’t make use of either of those special characters, then
the CDATA
section is not necessary, but it’s a good habit to always
include it anyway.
Backwards Compatibility
Ignoring the fact that XHTML should not be served as text/html
and accepting
that it does happen in the real world, for an HTML 4 UA processing the above
XHTML script
element, the CDATA
section markup results in a JavaScript syntax
error since the only markup-like syntax that scripting engines allow is an SGML
comment declaration at the beginning of the first line. In order to allow the
script to be correctly recognised and processed by current HTML and XHTML UAs,
while still hiding from legacy UAs, a clever combination of comments and a CDATA
section needs to be used.
Example 4:
<script type="text/javascript"><!--//--><![CDATA[//><!--
...
//--><!]]></script>
An HTML UA which correctly treats the script
element as CDATA
is
supposed to ignore everything following the comment declaration open delimiter
on the first line. i.e. everything after the first “<!--
” should be
ignored. An XHTML UA will treat it as a comment followed by a CDATA
section
to escape the entire script.
In summary, an HTML 4 UA will pass the entire content of the script
element
off to the JavaScript engine, which will quite happily ignore the first line
of markup, whereas an XHTML UA will only pass everything between <![CDATA[
and ]]>
(not inclusive). The end result is that both HTML and XHTML UAs treat
the script as CDATA
, as intended, with neither of them ignoring any of the actual
script, while providing some level of backwards compatibility with legacy user
agents.
This does, however, open up a whole new theoretical problem for any hypothetical
legacy UA that is unaware of the script
element and processes it
as markup with a conforming SGML parser. When example 4 is
processed as markup within SGML, it should be parsed identically to XML.
So the unknown script
element in a
legacy user agent with a strictly conforming SGML parser would output the
content of the script (within the CDATA
section) as text. However,
the only user agent I know of the supports the CDATA
section for HTML
documents is the new Opera 8. Thus, the above syntax is really only backwards
compatible with the non-conformant parsing behaviour of real world legacy
HTML UAs, which does not cause any practical problems.
Avoiding All of These Problems
To avoid all of these problems with scripts and the decision of whether or
not to include the pseudo-comment declaration in HTML documents at all, the
best solution is to always include scripts as external files. This has the advantage
of not being unintentionally ignored by XHTML UAs, and not erroneously processed
in anyway by legacy HTML/SGML UAs, regardless of whether they are conforming
or not. It also helps to better separate the markup from the script, and makes
the script more easily reusable in multiple documents.
However, you must remember, that simply solving all of these markup issues
for your scripts in an XHTML document doesn’t necessarily mean the the script
will work correctly for both XHTML and HTML when served with the correct
MIME types. There are more issues such as document.write() not working for
XML, the need to use the namespace aware DOM methods in XHTML and many more
related issues.