Content-Type

When it comes to the web, one of the most important yet least understood concepts is the media type of a file and, for text files, the character encoding. Raise your hands now if you’ve ever been guilty of including the following meta element (or equivalent) in an HTML or XHTML document:

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

Anyone who has ever created an HTML document and did not raise their hand to that question is a liar — every single HTML author in the world has used it and, today, I am going to explain what it does and does not do, and explain what you should use instead.

HTTP Response Headers

HTTP response headers are sent along with every single HTTP response and contain metadata about the file being sent. The response header contains a number of header fields used to specify a variety of information such as the last modified dates, content length, encoding information and, in particular, the Content-Type.

Each header field appears on a new line and takes the following format (white space is optional):

Header-Field: value; parameter=parameter-value

There are various tools available for you to examine the HTTP headers sent by your server, such as the Web Developer toolbar, the Live HTTP Headers extension, Fiddler or an online tool like the W3C’s HTTP HEAD service.

What is Content-Type?

Content-Type is an HTTP header field that is used by the server to specify, and by the browser to determine, what type of file has been sent and received, respectively, in order to know how to process it. The field value is a MIME type, preferably one registered with IANA, followed by zero or more parameters.

For HTML documents, this value is text/html with an optional charset parameter. Take a look at the meta element above and you will see the value of the content attribute contains this MIME type and the charset parameter, separated by a semi-colon, which matches the format of the HTTP header field value. Thus, the HTTP Content-Type header field should look something like this:

Content-Type: text/html; charset=UTF-8

Although, technically, the charset parameter is optional, it should always be included correctly.

The Meta Element

The meta element in HTML has two attributes of interest in this case: http-equiv and content. The http-equiv attribute, which was designed as a method to include HTTP header information within the document, contains the name of the header field and the content attribute contains its value.

The intention was that it be used by HTTP servers to create/set real HTTP response headers prior to sending the document, but the reality is that there are none (at least none that I’m aware of) that ever do this. It was not really intended for processing by user agents on the client side, although it is described in the section on specifying the character encoding that user agents should, in the absence of the information from a higher level protocol, observe the meta element for determining the character encoding.

It is, however, not used by any user agent for determining any other HTTP header information and thus including it for anything but Content-Type is nothing short of completely useless, regardless of the examples given in the HTML 4.01 recommendation.

The content Attribute

When used for specifying the Content-Type, despite the fact that it includes both the media type and the charset parameter, it is only ever used by browsers to determine the character encoding. Despite the popular misconception, it is not used to determine the MIME type, as the MIME type needs to be known before parsing the file can begin and (as always) the information specified by a higher level protocol (like HTTP) takes precedence.

The Content-Type header is always included for HTML files sent over HTTP and it must at least contain the MIME type: text/html. In the absence of this header, the HTTP protocol provides some guidance on how to handle it, but it will likely end up being treated as application/octet-stream, which typically results in the user agent prompting the user for what to do with the file.

Therefore, regardless of the MIME type included within the meta element, the MIME type used for HTML documents will always be text/html. (XHTML documents served as text/html are considered to be HTML documents for the purpose of this discussion). This makes the practice of using the following within XHTML documents completely useless for specifying the MIME type:

<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />

Infact, for XHTML served as XML, this meta element is not used at all – not even for the character encoding. In such cases, XML rules apply and the encoding is determined based on protocol information (e.g. HTTP headers), XML declaration or the Byte Order Mark.

Determining Character Encoding

As mentioned, browsers do make use of the meta element for determining the encoding in HTML. However, when the document is served over HTTP, this is in direct violation of the HTTP 1.1 protocol [RFC 2616] which specifies a default value of ISO-8859-1 for text/* subtypes. This too is in violation of RFC 2046, which specifies US-ASCII, but the discussion of this issue is best saved for another post.

Additionally, for text/* subtypes, web intermediaries are allowed to transcode the file (i.e., convert one character encoding to another) and if the default encoding is assumed, yet another is declared inline (which would not be parsed by such an intermediary), then the results may not be good. For these reasons, it is not recommended that inline encoding information be relied upon in text/html. (Interestingly, these same reasons apply to the use of text/xml, which is partly why text/xml is not recommended for use in favour of application/xml.)

Setting HTTP Headers

Although it may seem much easier to copy and paste the meta element into every HTML document published, it is almost as trivial to configure the server to send the correct HTTP headers. The method to do so will vary depending on the server or server-side technology used, but specific information can usually be found in the appropriate documentation. The W3C’s I18N activity have provided a useful summary of how to specify the encoding information using various servers and languages.

10 thoughts on “Content-Type

  1. When you say that the proper HTTP header should be used instead of the meta element, do you mean that the element should in some way be considered harmful? That is, as long as the HTTP header and the meta element agree WRT the charset, is there any harm in including the element in the page?

  2. The only use I see for the http-equiv attribute is for ‘offline’ documents (who uses them?!).

    Opera writes down (and uses) a meta element into the head (to specify the charset) when you save a page (that hasn’t got one already). In this case it’s really an equivalent of the header itself, right?

  3. Andrew, the only problem with including the meta element as well as the correct HTTP header is that if the document is transcoded somewhere along the way, then the HTTP headers would be updated but the meta element wouldn’t be. In other words, you can’t guarantee with 100% certainty that they will match at the visitor’s end. On the other hand, very few intermediaries (if any) do transcode files, so in most cases it’s not harmful at all.

    Krijn, offline documents and documents served over a protocol that does not include encoding information are indeed use-cases for the meta element. For offline documents, however, it is unfortunate that current operating systems/file systems don’t store the encoding as external metadata attached to the file just like they store creation/modified dates, read-only attributes, etc.

    Although, another alternative to the meta element is to use UTF-8 and include the BOM, which allows for detection to occur in the first few bytes of the file.

  4. Although, technically, the charset parameter is optional, it should always be included correctly.

    This is only true for text/ types. For application/ types, it is much better to not include a charset parameter. In particular, you should always send XML documents with an application/xml or application/foo+xml type and no charset, so that the <?xml?> preamble inside the document can specify the encoding. (Sam Ruby’s postulate: The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.)

    Btw, I have in fact never ever used such a meta tag. I was fuzzy on the details until a bit over a year ago, but I always knew the rough semantics and thought it was a really ugly kludge. Turns out, it is: it was never supposed to be seen at the client end. The (hackneyed) idea when it was proposed was that the server would parse the file and actually send those headers with the response, rather than the client interpreting these tags and then retroactively pretending that the information was in the header.

  5. Aristotle, yes that is correct about the charset for application/* subtypes, but my statement was indeed said in relation to text/html, not for all media types in general.

    I find it hard to believe that you’ve really never used the meta element. Even if you only ever used it for the first few pages you made while learning about HTML and haven’t used it since, that still counts.

  6. Nope. :-) Never ever. I only learned about it a long time after I had learned HTML, and refused to use it from the start. Maybe it was due to the mix of expertise and naïveté; I already had some rough knowledge of HTTP (I started learning about webservers very early, and my first interest about the web has always been building dynamic sites) but I didn’t know a thing about charsets at the time (I thought one always must write &auml; instead of ä in HTML). So all I saw was an ugly kludge without any discernible purpose.

    Well, that’s still all I see…

  7. Well, Aristotle, you’re certainly the exception to the rule. Most learn about web servers and HTTP after they’ve learned HTML (I certainly did).

  8. So did I. (All free or even affordable hosting sucked something fierce in 1997, so I didn’t get to do much web programming for my first two years on the web. First large script that ran online was in 1999, on a gaming site. Heh.) Thinking about it, it may also have played a role that I learned from (very basic) tutorials instead of View Source.

    Not that I claimed to be anything but an outlier; just saying that these outliers do exist, contrary to your absolute claim. When I got to that point, I went, “hey now.” :-)

    Anyway, belabouring minor points aside, the article is great as always.

  9. Actually, there used to be a decent reason for including one. If I remember the details correctly (it’s been a while), the most reliable character encoding for making characters show up properly in Netscape 4.x was UTF-8, with the non-ASCII characters encoded directly rather than with numerical references. However, if you used that character encoding, there would be a visible flicker during the page load. If you included a meta element that matched the information you were providing in your HTTP headers, the flicker problem would go away.

    It’s a stupid thing to have to include, but it did no harm and fixed an annoying bug.

Comments are closed.