When it comes to the web, one of the most important yet least understood concepts
is the media type of a file and, for text files, the character encoding. Raise
your hands now if you’ve ever been guilty of including the following meta
element
(or equivalent) in an HTML or XHTML document:
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
Anyone who has ever created an HTML document and did not raise their hand
to that question is a liar — every single HTML author in the world has used
it and, today, I am going to explain what it does and does not do,
and explain what you should use instead.
HTTP Response Headers
HTTP response headers are sent along with every single HTTP response and contain
metadata about the file being sent. The response header contains a number of
header fields used to specify a variety of information such as the last modified
dates, content length, encoding information and, in particular, the Content-Type
.
Each header field appears on a new line and takes the following format (white
space is optional):
Header-Field: value; parameter=parameter-value
There are various tools available for you to examine the HTTP headers sent
by your server, such as the Web
Developer toolbar, the Live
HTTP Headers extension,
Fiddler or an online tool like the W3C’s
HTTP HEAD service.
What is Content-Type
?
Content-Type
is an HTTP header field that is used by the server
to specify, and by the browser to determine, what type of file has been sent
and received, respectively, in order to know how to process it. The field
value is a MIME type, preferably one registered with IANA, followed by zero
or more parameters.
For HTML documents, this value is text/html
with an optional charset
parameter.
Take a look at the meta
element above and you will see the value of the content
attribute contains this MIME type and the charset
parameter, separated by a
semi-colon, which matches the format of the HTTP header field value. Thus,
the HTTP Content-Type header field should look something like this:
Content-Type: text/html; charset=UTF-8
Although, technically, the charset
parameter is optional, it should always
be included correctly.
The meta
element in HTML has two attributes of interest in this case: http-equiv
and content
. The http-equiv
attribute, which was designed as a method to include
HTTP header information within the document, contains the name of the header
field and the content
attribute contains its value.
The intention was that it be used by HTTP servers to create/set real HTTP
response headers prior to sending the document, but the reality is that there
are none (at least none that I’m aware of) that ever do this. It was not really
intended for processing by user agents on the client side, although it is described
in the section on specifying
the character encoding that user agents should,
in the absence of the information from a higher level protocol, observe the
meta
element for determining the character encoding.
It is, however, not used by any user agent for determining any other HTTP
header information and thus including it for anything but Content-Type
is nothing
short of completely useless, regardless of the examples given in the HTML 4.01
recommendation.
The content
Attribute
When used for specifying the Content-Type
, despite the fact that it
includes both the media type and the charset
parameter, it is only ever
used by browsers to determine the character encoding. Despite the popular misconception, it
is not used to determine the MIME type, as the MIME type needs to be known
before parsing the file can begin and (as always) the information specified
by a higher level protocol (like HTTP) takes precedence.
The Content-Type
header is always included for HTML files sent over HTTP and
it must at least contain the MIME type: text/html
. In the absence of this header,
the HTTP protocol provides some guidance on how to handle it, but it will likely
end up being treated as application/octet-stream
, which typically results in
the user agent prompting the user for what to do with the file.
Therefore, regardless of the MIME type included within the meta
element, the
MIME type used for HTML documents will always be text/html
. (XHTML documents
served as text/html
are considered to be HTML documents for the purpose of this
discussion). This makes the practice of using the following within XHTML documents
completely useless for specifying the MIME type:
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
Infact, for XHTML served as XML, this meta
element is not used at all – not
even for the character encoding. In such cases, XML rules apply and the encoding
is determined based on protocol information (e.g. HTTP headers), XML declaration
or the Byte Order Mark.
Determining Character Encoding
As mentioned, browsers do make use of the meta element for determining the
encoding in HTML. However, when the document is served over HTTP, this is in
direct violation of the HTTP 1.1 protocol [RFC
2616] which specifies a default
value of ISO-8859-1
for text/*
subtypes. This too is in violation of
RFC 2046,
which specifies US-ASCII
, but the discussion of this issue is best saved for
another post.
Additionally, for text/*
subtypes, web intermediaries are allowed to transcode
the file (i.e., convert one character encoding to another) and if the default
encoding is assumed, yet another is declared inline (which would not be parsed
by such an intermediary), then the results may not be good. For these reasons,
it is not recommended that inline encoding information be relied upon in text/html
.
(Interestingly, these same reasons apply to the use of text/xml
, which is partly
why text/xml
is not recommended for use in favour of application/xml.)
Setting HTTP Headers
Although it may seem much easier to copy and paste the meta
element
into every HTML document published, it is almost as trivial to configure
the server to send the correct HTTP headers. The method to do so will vary
depending on the server or server-side technology used, but specific information
can usually be found in the appropriate documentation. The W3C’s
I18N activity have provided
a useful summary of how
to specify the encoding information using various servers
and languages.