Category Archives: Characters

Character encodings, repertoires and related issues, including Unicode.

Web Developer Quiz Update

I’ve received quite a few responses to yesterday’s Web Developer Quiz, including some feedback about the type of questions I asked and criticism about them being too much about SGML which I’d like to take the opportunity to address.

Firstly, out of all the responses received in the last 24 hours (although, they’re not yet published), not one person has answered all questions correctly. Indeed, there are questions in there that no-one has answered correctly yet, which I am very surprised about — I was expecting to, at least collectively, receive the correct answers for all questions.

Secondly, I’m going to go through each section and explain, without giving away the answers just yet, why I asked each question and why it’s important for authors to know the answers to them.

Validation

Looking at the sample document, it’s not hard to see that it makes use of unsupported SGML features that cannot be used in the real world. However, this does not mean that authors do not need to be aware of them.

In fact, the document demonstrates just how easy it is to make unintentional use of such features, which, while it may not be what the author intended, will either result in one of two possibilites. 1. Completely unexpected errors that don’t seem to make sense: a problem I see a lot of beginners struggle with. 2. As is the case with this document, the combination of 2 specific authoring errors results in no validation error being reported at all, for the mistakes.

At this point, I’d like to point out that there is just 1 error within the document (most people have picked it so far), but it has nothing to do with the unsupported SGML features, and everything to do with the declared DOCTYPE. This will, perhaps, become more apparent to you when I reveal the answers and explain the reasons for the errors, or lack thereof, in more detail next week.

Elements in the DOM

The first of these questions is very much related to an unsupported SGML feature, rather than real world, practical HTML, and I admit, I just threw it in as a challenge for the more advanced authors. It is, however, important to be aware of the syntax and that it is unsupported, and thus cannot be used, even inadvertently.

The second question is testing your knowledge of real world, supported mark up. You need to be aware that start-tags and end-tags can be omitted for some elements, yet the elements will still be present in the DOM. You also need to be aware of the HTML/SGML comment syntax and, although it wasn’t really tested with these questions, the syntactic differences between SGML and XML comments.

Semantics

These are, perhaps, the easiest and most practical questions in the quiz. So far, nearly everybody has answered these questions correctly, and I don’t feel I need to explain why they were included, it seems quite obvious to everyone.

Character References

Surprisingly, nobody has correctly any of these 3 questions. Yet it is important from both a practical point of view and a validation point of view, to understand the similarities and differences between HTML, XHTML and XML with respect to character references. It is also important to have an understanding of the Unicode character repertoire and code points, which is what everyone has failed on, so far.

Media Types

Again, this is important from a practical perspective. Authors need to understand, that they should not use XHTML with the wrong media type, and also understand the practical limitations with doing so. Conversely, although this was not tested with these questions, it is important to understand the current practical limitations with using the correct media type for XHTML.

I’ll be revealing the answers including all the responses to the quiz on Sunday evening (local time). Until then, tell others, who haven’t seen it yet, about the quiz, I’m interested in finding out how much an average web developer really knows about the technologies they use every day.

Web Developer Quiz

This quiz is designed to test whether or not web developers have an understanding of the basic technologies used on the web, primarily HTML, HTTP, Media Types (MIME) and character repertoires and encodings. Personally, I expect every single web developer to pass this quiz with flying colours, yet reality tells me that a large proportion will struggle. So, in the interests of finding out exactly how much web developers in general do and do not know, and for your own personal benefit, I decided to publish this quiz (or survey, if you like).

Firstly, a few ground rules. Please don’t cheat. I expect all web developers to know the answers to these questions without the need for reference material or the use of automated tools. That means, please don’t make use of the validator or look up the specifications to answer these questions, they’re designed to be easy enough to answer without such tools, yet still provide enough of a challenge for all but the most knowledgeable authors. Secondly, in order to give everyone a fair go and avoid chance of having all the correct answers given away in the first response, I’ve temporarily enabled comment moderation and no comments will be appearing until I publish the results and answers next week. Ok, so on with the quiz…

This sample document applies to the first 3 questions. You may assume the HTTP headers contain:

Content-Type: text/html;charset=UTF-8

Note: This document uses some special syntax that is not widely supported in existing browsers; it is only designed to test your knowledge of HTML.

1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
2. <html lang="en">
3.   <title/Sample HTML 4.01 Document/
4.   <p align="right">This is a sample HTML 4.01 Strict document.
5.   <>How much do you know about HTML?</>
6.   <!-- -- --> <em>It’s not hard!</em> <!-- -- -->
7.   <p>Created by <a href=http://lachy.id.au/">Lachlan Hunt
8. </html>

Validation

Which lines in the above HTML document contain validation errors, if any? Note: I’m only looking for those errors that will be reported by a conforming SGML based validator.

Elements in the DOM

  1. How many p elements are there within the above document?
  2. Which of these elements, if any, will not be present within the the Document Object Model of the above document?
    • <head>
    • <body>
    • <em>

Semantics

  1. Which markup structure is the most semantically correct for a navigational link menu, regardless of how it will be presented visually?
    1. <div class="menu">
          <a href="…">Link  1</a> |
          <a href="…">Link 2</a> |
          <a href="…">Link 3</a>
      </div>
    2. <div class="menu">
          <a href="…">Link  1</a><br>
          <a href="…">Link  2</a><br>
          <a href="…">Link  3</a><br>
      </div>
    3. <ul class="menu">
          <li><a  href="…">Link 1</a></li>
          <li><a  href="…">Link 2</a></li>
          <li><a  href="…">Link 3</a></li>
      </div>
  2. Which markup structure is the most semantically correct for a title within the document body that may be horizontally centred in a visual medium (eg. screen) using a large, bold font?
    1. <div class="title">Document Title</div>
    2. <h1>Document Title</h1>
    3. <p align="center"><font size="+3"><b>Document Title</b></font></p>
    4. <h1 style="font-weight:bold;font-size:large;text-align:center;">Document Title</h1>
    5. <h1 class="LargeBoldCenterHeading">Document Title</h1>

Character References

Given these three numeric character references, and two character entity references:

  • &#x2019;
  • &#8217;
  • &#146;
  • &rsquo;
  • &apos;
  1. Which ones are invalid for an HTML 4.01 document?
  2. Which ones are invalid for an XHTML 1.0 document?
  3. Which ones are invalid for a generic XML document? (assume no DTD or Schema)

Media Types (MIME)

  1. Which of these MIME types SHOULD NOT be used for an XHTML 1.1 document?
    • application/xhtml+xml
    • text/html
    • application/xml
    • text/xml
  2. Using the answer from the previous question, under what conditions MAY (according to the recommendation) an XHTML 1.0 document use that MIME type?

Guide to Unicode, Part 3

In the beginning we discussed character repertoires, code points and HTML character references; together with their relationship to the Unicode standard. We then looked at character encodings, examining the differences between single- and multiple-octet encodings, and how to create files in UTF-8. If you are unfamiliar with those concepts, I recommend that you read parts 1 and 2 of this guide first and then return to this section when you are ready.

The saga continues in this third and final thrilling chapter, where we will look at some of the problems encountered with the use of Unicode and, in particular, UTF-8 and the BOM. This will involve taking a look at the debugging of the most common problems and how the tools we’ve looked at previously can help you with this. Following this I will discuss the importance and practicalities of ensuring that the character encoding is correctly declared. Finally I will discuss the purpose of the BOM and show the difference between the UTF-16 and UTF-32 variants: Little Endian and Big Endian.

The biggest problem with using Unicode on the web is that not all editors support Unicode, although some authors don’t realise this and declare the encoding as UTF-8 anyway. This only becomes a problem when characters outside the US-ASCII subset are used. However, since we have already covered the creation and editing of Unicode files, there is no need to discuss this problem further.

The next major issue encountered is that user agents may not always display the characters correctly. Ignoring the availability of Unicode fonts for now, people will often attempt to use UTF-8; yet find that some characters that they have used get turned into 2 or 3 seemingly random characters. After that they tend to give up and revert back to ISO-8859-1 (or Windows-1252) and use character references for the characters outside of these repertoires, or US-ASCII substitutes.

For those of you that have read part 2; you should now recognise that the displayed characters are not random, but are in fact the multi-octet UTF-8 encoded characters interpreted as a single-octet encoding. This error is usually caused by incorrectly declaring a single-octet encoding — most often ISO-8859-1. For example, the BOM (if present) may be displayed as  when a UTF-8 file is incorrectly declared as ISO-8859-1 by the HTTP response headers (similar to the demonstration in part 2 where the character encoding was manually overridden).

For any sequence of characters appearing incorrectly like this, you may use Ian Hickson’s UTF-8 decoder to determine what the character is. Conversely, you may also use the UTF-8 encoder to reveal both the ISO-8859-1 representation as well as the hexadecimal octet values for any Unicode character.

If you suspect that you are experiencing this kind of error, then the first thing to check is which encoding is being used by the user agent. Most popular UAs will allow you to view the character encoding, such as in Mozilla’s Page Info dialog or Opera’s Info panel. The W3C’s MarkUp Validator in verbose mode will also let you know this information. If you find that the character encoding is incorrect, then it is then necessary to determine whence the user agent is acquiring this information and to correct the error. If, however, you find that the character encoding is correct, then the problem is caused elsewhere and will be discussed later.

Depending on the file format, the character encoding may be declared on a file-by-file basis through, for example, the use of the <meta> element in HTML, the <?xml?> declaration in XML documents, the @charset rule in CSS or indicated by the presence of the BOM. On a server-wide basis it depends upon the server configuration and may be declared by the charset parameter of the Content-Type field in the HTTP response headers. It may also be indicated by a referencing document using, for example, the charset attribute in HTML or the encoding of the referencing document itself. Finally, in the absence of any of these indications, some specifications define the default encoding that should be used – commonly UTF-8, UTF-16, ISO-8859-1 or US-ASCII.

The order of precedence for each applicable method is defined in the relevant specification for the language being used (eg. Specifying a Character Encoding in HTML 4.01), often beginning with the Content-Type header field having the highest precedence and ending with the charset attribute or encoding of the referencing document, having the lowest. Of course, it can be fun when specifications collide on this issue, which illustrates why it is not only important to declare the character encoding appropriately, but to ensure that it is declared correctly.

Most of those may be easily checked simply by opening the file and seeing which are included. The only one you may have difficulty with is the HTTP headers since these are not so readily viewable without the right tool. Thankfully, there are several to choose from, ranging from Mozilla, Firefox, or other browser extensions like Live HTTP Headers to online tools like the W3C’s HTTP HEAD Service.

Once you have determined the method by which the encoding is being indicated, it is a simple matter of correcting the error (Correcting HTTP Headers will be discussed later). If more than one method is used (eg. the XML declaration, the meta element and/or the HTTP Content-Type header field), you should also ensure that all of them indicate the same encoding. For most document formats, the recommended method to use is the HTTP headers. For XML documents (served as XML) however, since they are self describing file formats, it is recommended that the server be configured to omit the charset parameter from the Content-Type header and that the <?xml?> declaration be used instead.

One thing to note: even if the character encoding is being correctly declared, the use of the BOM may still cause problems. Although many applications do support the use of the BOM (namely those that support UTF-8 properly) there are still many that don’t (in particular, older web browsers) and require that the file be saved with the BOM omitted. The problem is that not every editor that supports UTF-8 has an option to control the output of the BOM. When the file is read by an application that does not support UTF-8, or the BOM, these 3 bytes may be interpreted as single-octet characters. For this reason, the W3C Mark Up Validator will issue a warning about its use, and it is recommended that the BOM be omitted from HTML documents. (Note: this does not apply to XML or XHTML documents served as XML, since XML user agents are required to support UTF-8 and UTF-16 fully, including the BOM.)

To modify the HTTP headers, it is necessary to edit a server configuration file. For the Apache HTTPd server, most web hosting providers will allow web content authors to use .htaccess files for this, and many other purposes. Other web servers may also offer similar abilities, though the method may be very different. Consult your server’s documentation or contact your web host for information about how to configure the character encoding correctly. However, for those of you using an Apache server, it is simply a matter of including the appropriate directives in your .htaccess files or, if you have access to it (e.g. you are the server admimistrator), in the httpd.conf file. (However, most web hosting providers do not allow general access to that file.)

The AddDefaultCharset and/or AddCharset directives may be used. For example, you may set the default charset to UTF-8, but still wish that some files use ISO-8859-1. These directives in either .htaccess or httpd.conf will accomplish this:

AddDefaultCharset UTF-8
AddCharset ISO-8859-1 .latin1

The AddCharset directive takes both a character encoding name and a file extension. For files encoded as ISO-8859-1, it simply requires that the file contain a .latin1 file extension. This feature is most useful when Content Negotiation is enabled, so that the .latin1 file extension need not be included in the URI, but still takes effect to send the correct encoding.

Besides the fact that some user agents do not support the BOM, you may still encounter problems with its use in files intended to be processed by some server-side technology, such as PHP or JSP. For example, Pierre Igot described the problem he encountered with the presence of the BOM in WordPress PHP files. This issue occurs because upon encountering non-whitespace character data, the processor will assume that the content has begun, send out all HTTP headers and begin to output the resultant file. If, for example, you then attempt to use the PHP header() function, an error will be received indicating that it is too late to modify headers because the content has already begun.

Other errors may not cause any server-side errors, and are thus harder to catch, but they may result in invalid markup being transmitted by the server due to character data appearing where character data is not allowed. For example, consider an include file containing a set of link elements to be included within the head elements of several (X)HTML files using some scripting language or SSI: if the include file is encoded as UTF-8 and begins with the BOM, yet the scripting language processor does not support UTF-8 and therefore neither recognises the BOM nor strips it from the output, it will end up being included within the head elements. Since a head element may not contain character data, a markup validator will detect this and issue an appropriate error message.

The exact error will differ depending on whether you are using HTML or XHTML, though the cause will be the same. For HTML, the BOM will implicitly end the head element, and begin the body element. Thus, anything following will be treated as being within the body element, not the head, which may cause several errors depending on the content. For XHTML, however, the BOM will simply be treated as character data appearing within the head element, where no character data is allowed.

The final problem that I have encountered is slightly more complicated and proved quite difficult to solve. While using JSP to develop a web site, I had encoded all files as UTF-8, omitting the BOM. The issue was that one include file included the copyright symbol (U+00A9), which is encoded in UTF-8 as the octets C2 A9. At first, it appeared as though the UA was interpreting the character as a single octet encoding, thus displaying the characters ©. However, the HTML document was being correctly interpreted as UTF-8, since the character encoding was being declared by the HTTP Content-Type header field.

After much investigation, I found that if I encoded the include file as ISO-8859-1, but the main file as UTF-8, the desired UTF-8 output was received. It turned out that the JSP processor, Apache Tomcat with JBoss, thought that the include file was to be interpreted as ISO-8859-1 (the default for JSP); however the output was required to be UTF-8. Because of this, the JSP processor was attempting to convert the character encoding of the include file into UTF-8 on the fly.

Thus, when it encountered the octets C2 A9, it interpreted them as ISO-8859-1 characters, which map to the Unicode characters: U+00C2 and U+00A9. These characters, when encoded as UTF-8, form the octets C3 82 and C2 A9, respectively, which is the output I was receiving in the HTML document. I ended up solving this problem by correctly informing the JSP processor that the include files were also encoded as UTF-8, and not the default ISO-8859-1.

Up until now, we have looked at the BOM, discussed how it is encoded in UTF-8 and some of the problems it may cause; however we have not looked at its purpose. As mentioned, the BOM is optional in UTF-8; but it is required in UTF-16 and UTF-32 to indicate the order of octets for each character as either Little Endian, where the least significant byte appears first, and Big Endian where the most significant byte appears first. For this guide, we will only look at UTF-16, but a similar technique still applies to UTF-32 documents.

In UTF-16LE the BOM (U+FEFF) will appear in the file as the octet sequence FF FE (least significant byte first), but in UTF-16BE it appears as FE FF (most significant byte first). For UTF-16, because the reversed sequence U+FFFE is defined never to be a character in Unicode, that sequence of octets (FF FE) can be safely used to detect the encoding as UTF-16LE, and not a UTF-16BE file starting with the character U+FFFE, or vice versa.

Now that you have a rather more in-depth understanding of Unicode (including: the character repertoire, code points and the various encoding forms; the ability to create and edit Unicode encoded files; and an understanding of several of the problems that may be encountered) it is time to go forth and prosper — to make use of Unicode to its full potential, to simplify the use of non-US-ASCII chracters and to help promote the i18n of the web. However, as with everything else, there’s always more to be learned. Although it may seem that I have covered much, I’m sure you will find that I have only just scratched the surface. So, to help you out, I’ve compiled a short but comprehensive list of additional resources that will provide further information: