Category Archives: Server-Side

Server-side scripting and configuration.

Hosting Plans

Well, it’s that time of year again and my web hosting needs renewing. What a pain having to go through all that hassle of forking out $US60 every year just so I can keep this site going.

Imagine. What if there was a plan that would be covered for life – well, actually, beyond just my life and well into the next millennia? What if there were a plan that I could just pay once — once only and that would be it? What if I never needed to worry about hosting again (only depending upon the condition that the company doesn’t collapse)?

Oh wait… There is! The life-time hosting plans from A Small Orange (available for a limited time only). I’m going to be taking the life-time medium plan: 1GB disk space, 25GB/month bandwidth. I’m currently on the small plan, but I figure this site will only grow. The once-off $US300 payment is equal to 5 years on the small plan, 2.5 years on the medium plan; which is nothing compared with the life of the site itself.

Content-Type

When it comes to the web, one of the most important yet least understood concepts is the media type of a file and, for text files, the character encoding. Raise your hands now if you’ve ever been guilty of including the following meta element (or equivalent) in an HTML or XHTML document:

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

Anyone who has ever created an HTML document and did not raise their hand to that question is a liar — every single HTML author in the world has used it and, today, I am going to explain what it does and does not do, and explain what you should use instead.

HTTP Response Headers

HTTP response headers are sent along with every single HTTP response and contain metadata about the file being sent. The response header contains a number of header fields used to specify a variety of information such as the last modified dates, content length, encoding information and, in particular, the Content-Type.

Each header field appears on a new line and takes the following format (white space is optional):

Header-Field: value; parameter=parameter-value

There are various tools available for you to examine the HTTP headers sent by your server, such as the Web Developer toolbar, the Live HTTP Headers extension, Fiddler or an online tool like the W3C’s HTTP HEAD service.

What is Content-Type?

Content-Type is an HTTP header field that is used by the server to specify, and by the browser to determine, what type of file has been sent and received, respectively, in order to know how to process it. The field value is a MIME type, preferably one registered with IANA, followed by zero or more parameters.

For HTML documents, this value is text/html with an optional charset parameter. Take a look at the meta element above and you will see the value of the content attribute contains this MIME type and the charset parameter, separated by a semi-colon, which matches the format of the HTTP header field value. Thus, the HTTP Content-Type header field should look something like this:

Content-Type: text/html; charset=UTF-8

Although, technically, the charset parameter is optional, it should always be included correctly.

The Meta Element

The meta element in HTML has two attributes of interest in this case: http-equiv and content. The http-equiv attribute, which was designed as a method to include HTTP header information within the document, contains the name of the header field and the content attribute contains its value.

The intention was that it be used by HTTP servers to create/set real HTTP response headers prior to sending the document, but the reality is that there are none (at least none that I’m aware of) that ever do this. It was not really intended for processing by user agents on the client side, although it is described in the section on specifying the character encoding that user agents should, in the absence of the information from a higher level protocol, observe the meta element for determining the character encoding.

It is, however, not used by any user agent for determining any other HTTP header information and thus including it for anything but Content-Type is nothing short of completely useless, regardless of the examples given in the HTML 4.01 recommendation.

The content Attribute

When used for specifying the Content-Type, despite the fact that it includes both the media type and the charset parameter, it is only ever used by browsers to determine the character encoding. Despite the popular misconception, it is not used to determine the MIME type, as the MIME type needs to be known before parsing the file can begin and (as always) the information specified by a higher level protocol (like HTTP) takes precedence.

The Content-Type header is always included for HTML files sent over HTTP and it must at least contain the MIME type: text/html. In the absence of this header, the HTTP protocol provides some guidance on how to handle it, but it will likely end up being treated as application/octet-stream, which typically results in the user agent prompting the user for what to do with the file.

Therefore, regardless of the MIME type included within the meta element, the MIME type used for HTML documents will always be text/html. (XHTML documents served as text/html are considered to be HTML documents for the purpose of this discussion). This makes the practice of using the following within XHTML documents completely useless for specifying the MIME type:

<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />

Infact, for XHTML served as XML, this meta element is not used at all – not even for the character encoding. In such cases, XML rules apply and the encoding is determined based on protocol information (e.g. HTTP headers), XML declaration or the Byte Order Mark.

Determining Character Encoding

As mentioned, browsers do make use of the meta element for determining the encoding in HTML. However, when the document is served over HTTP, this is in direct violation of the HTTP 1.1 protocol [RFC 2616] which specifies a default value of ISO-8859-1 for text/* subtypes. This too is in violation of RFC 2046, which specifies US-ASCII, but the discussion of this issue is best saved for another post.

Additionally, for text/* subtypes, web intermediaries are allowed to transcode the file (i.e., convert one character encoding to another) and if the default encoding is assumed, yet another is declared inline (which would not be parsed by such an intermediary), then the results may not be good. For these reasons, it is not recommended that inline encoding information be relied upon in text/html. (Interestingly, these same reasons apply to the use of text/xml, which is partly why text/xml is not recommended for use in favour of application/xml.)

Setting HTTP Headers

Although it may seem much easier to copy and paste the meta element into every HTML document published, it is almost as trivial to configure the server to send the correct HTTP headers. The method to do so will vary depending on the server or server-side technology used, but specific information can usually be found in the appropriate documentation. The W3C’s I18N activity have provided a useful summary of how to specify the encoding information using various servers and languages.

Handling Character Encodings

Anyone who’s ever written a form for user input and actually cares about ensuring the correct character encoding is submitted has had trouble with users submitting Windows-1252, where ISO-8859-1 was expected. Even if you were intelligent and were using a Unicode encoding like UTF-8 and accepting such input from your forms, there’s still a problem with Trackbacks, since you can’t have no control over what encoding they’re sent in.

This is commonly ignored by implementations and results in invalid characters used within HTML and you end up a few question marks (commonly shown as a U+FFFD Replacement Character by browsers) scattered around the text.

Now there is a solution. I’ve written some PHP to first detect the most likely encoding as either being UTF-8, ISO-8859-1 or Windows-1252. If it is UTF-8, nothing needs to be done with it. If it’s ISO-8859-1 or Windows-1252, we need to convert it to UTF-8.

Determining the Encoding

The first 3 functions I’ve written will allow you to determine what character encoding is used. These are isUTF8(), isISO88591() and isCP1252() and return true if the string validates as the respective encoding. These work by using regular expression that matches valid octet sequences for the encoding. The regular expression for UTF-8 was adapted from the Perl code provided by the W3C in an article about multilingual forms.

My version is a little more restrictive than that, in that it will reject any character with a code point from 128 to 159. Although these code points are valid in XML and can be validly encoded in UTF-8, they are Unicode control characters and they are invalid within HTML 4. Additionally, the chances of a user legitimately submitting those characters are slim to nil, so it’s better to reject them than try to convert them to something else.

The ISO-8859-1 function works in the same way. It too rejects characters with those code points, as it is far more likely that the user has submitted Windows-1252 than the control characters.

Converting to UTF-8

In PHP, the utf8_encode() function can be used to convert from ISO-8859-1 to UTF-8. However, the real world forces us to handle ISO-8859-1 as Windows-1252, yet the utf8_encode() function will not handle that as well as we would like.

Since Windows-1252 is a superset of ISO-8859-1, these can both be handled by the same function: utf8FromCP1252(). Internally, this makes use of the pre-existing utf8_encode() function. Afterwards, it searches the newly encoded UTF-8 string for characters in the offending code points and remaps them to their correct Unicode code points and encodes them.

To do this a second function is used which accepts the Windows-1252 encoded character, determines the code point, uses a look up table in an array to find the Unicode code point and then calls a third function to generated the UTF-8 encoded character from that code point.

The third function has been adapted from Anne Van Kesteren’s Character references to UTF-8 converter, who originally adapted it from Henri Sivonen’s UTF-8 to Code Point Array Converter. The main difference with my version is that I renamed it and changed the variable names used to something a little more sensible.

Code and Demo

You can see it all in action on the demonstration page. Enter some characters in the UTF-8 for and the ISO-8859-1 forms and see how it flawlessly handles the detection and conversion of your input into valid UTF-8 output. The source code is available also.