One thing I come across frequently is incorrect terminology. I’ve written
about this topic once before (see HTML
Tags) and others have discussed similar
topics as well, particularly relating to elements, attributes and tags. But
a more specific area that deserves a little more attention is the distinction
between the DOCTYPE
, the XML declaration and the XML prolog and other things
within it.
The XML Prolog is the section at the beginning of an XML document which includes
everything that appears before the document’s root element. The XML declaration,
the DOCTYPE
and any processing instructions or comments may all be a part of
it. The following figure illustrates this concept.
In fact, the XML Prolog is always present in every XML document, though it may in fact be empty because all of those are optional in some circumstances.
The XML Declaration
<?xml version="1.0" encoding="UTF-8"?>
The XML declaration, if present, must occur at the very beginning of the file. It may not be preceded by anything except for a possible Byte Order Mark (depending on the character encoding). It is mostly used to provide XML version information and to declare the character encoding of the document. There is another thing called the standalone document declaration; but since it’s rarely needed or used and its purpose is not easy to explain, just ignore it.
Presently, only XML 1.0 and XML
1.1 are defined. Either may be used, but
the decision should not be made lightly. Do not just use version="1.1"
because
it is higher version number. For most authors these days, version="1.0"
should
be used. In fact, unless you have a specific reason that requires the use of
XML 1.1 features, you should stick with 1.0.
The encoding declaration, if present, must declare the encoding of the document. Authors may use any encoding supported by user agents, but are encouraged to use charsets registered with IANA (preferably UTF-8 or UTF-16). If the declaration is not present, the document must be encoded as UTF-8 or UTF-16 (unless it specified by a higher level protocol, like HTTP).
Processing Instructions
<?xml-stylesheet type="text/css" href="/style/design"?>
Processing Instructions are used to provide instructions to applications processing
the document. The example of the xml-stylesheet
PI given in the above diagram
is used to instruct an application to apply a stylesheet to the document.
PIs can be used almost anywhere within the document. Though, only those that appear prior to the root element are considered part of the prolog.
Comments
<!-- This is a comment -->
Most people know what comments are, there’s not much I need to say about them. However, like PIs, they’re only considered part of the prolog if they appear before the root element.
The Document Type Declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Many authors will have seen and used a DOCTYPE
in their documents, although
there are still many who don’t. The DOCTYPE
is used to reference a Document
Type Definition and is mostly used for validation purposes.
Many people know that using specific DOCTYPE
s will trigger standards mode
in browsers, but this does not apply to XML documents. DOCTYPE
sniffing only
applies to HTML documents (i.e. any document served as text/html
). Browsers
have, thankfully, not introduced it into XML processing. Henri Sivonen explains
more about this in Activating the
Right Layout Mode Using the Doctype Declaration.