Category Archives: Standards

Standards, protocols, recommendations and guidelines.

HTML Tags

Well, after taking a long break from writing anything on this blog, I’m back and better than ever. I’ll try to post more regularly from now on, with much better content. I hope you, my loyal readers, didn’t miss me too much while I was gone, but anyway, let’s get on with the good stuff. ?

One thing I’ve come to notice a lot of people believe is that in HTML, everything is a tag (or at least can be called a tag). This is most certainly not the case. The most recent offender I’ve seen, and the reason I decided to write this, is the author of Firefox, ALT Tags, and Tooltips, which, as you can see by the title, incorrectly refers to attributes as tags. The article itself is quite good, and I fully agree with its message about tooltips for alt attributes, it’s just the incorrect references to the attributes as tags that bugs me. This author is not the first, nor the last to make the mistake, but it is about time people learn to call things by their real names.

If you read part 5, Terminology, of Joe English’s humorous document: “Not the comp.text.sgml Frequently Asked Questions List”, you will see the common name for everything except a tag, is a tag. The common name for a tag being a command, which, of course, makes perfect sense! ?

    --------------------------------------------------
    ISO/W3C terminology			Common name
    --------------------------------------------------
    attribute				tag
    attribute value			tag
    attribute value literal		tag
    attribute value specification	tag
    character reference			tag
    comment				tag
    comment declaration 		tag
    declaration				tag
    document type declaration		tag
    document type definition		tag
    element				tag
    element type			tag
    element type name			tag
    entity				tag
    entity reference			tag
    general entity			tag
    generic identifier			tag
    literal				tag
    numeric character reference		tag
    parameter entity			tag
    parameter literal			tag
    processing instruction		tag
    tag					command
    --------------------------------------------------

So what exactly is a tag then? Well, before I get to that, I’ll just explain what some of the more common SGML and XML terminology means and what a tag is not.

Firstly, tags are not commands. People believe they are commands because of the misconception that HTML is a presentational language, or even a programming language. HTML is certainly not a programming language, and while it is true that presentational features have crept in, they have already been deprecated and/or removed (X)HTML, or at least will be in future versions.

It is the presentational elements and attributes that could be seen as commands or instructions to display the content in a certain way; however, they are in fact suggestions, just like CSS properties – the only difference being that these presentational suggestions are mixed in with the markup, and have no real semantics that indicate what the content is, only what the author wants it to look like, usually in a visual medium. Any presentational feature, whether done with CSS or the presentational elements and attributes, can be overridden by a user with a user stylesheet (assuming the user agent supports that facility), therefore, they are only suggestions that a user does not have to accept, not commands that a user agent, nor user must obey.

HTML, since it has been formally based on SGML, is intended to mark up the structure and semantics of the content by saying what it is, not what it does, nor how it looks (with the exception of the afore mentioned presentational features). Basically, HTML is not a procedural programming language; it is a descriptive markup language, so tags are not commands.

Attribute Tags

There’s no excuse for calling attributes tags, other than complete laziness and/or ignorance, but as already shown, calling attributes tags is a common mistake. An attribute is a property of an element that is written within the start-tag of an element, and should be referred to as simply an attribute. eg. The alt attribute… is the simplest way of referring to an attribute, and is only slightly longer than writing tag. However, a shorthand method of referring to attributes, which I occasionally see within plain text e-mails, is to write it within vertical bars, or some other delimiter. eg. |alt|.

Character Tags (or Entities)

Character references are sometimes called tags, but are more often called entities. Just like attributes, they are not tags either, but what’s wrong with calling them entities?

According the section 3.2.3 of the HTML 4.01 recommendation, Character references are numeric or symbolic names for characters that may be included in an HTML document. Section 5.3 also states:

Character references in HTML may appear in two forms:

  • Numeric character references (either decimal or hexadecimal).
  • Character entity references.

The numeric character references take the form &#nnnn; (decimal) or &#xnnnn; (hex). Character entity references are the named entities for the ISO-8859-1 characters (from 160 to 255), symbols, mathematical symbols and Greek letters, and finally, markup-significant and internationalization characters.

Based on that, you may think that it is only the numeric references that are incorrectly referred to as entities; however, it is indeed both forms. In SGML and XML there are several types of entities, and the simplest explanation of what an entity is, is that which comes from ISO-8879 itself, the SGML specification: an entity is a collection of characters that can be referenced as a unit. The purpose of entities can be easily understood, but understanding exactly what an entity is and separating that concept from the markup, is more difficult.

An entity is a concept that is defined in a DTD using an entity declaration defining both the name, and the replacement text. The entities are referred to within a document using an entity reference in the form: &name;. The entity declaration and the entity reference are just the markup for the entity, but they are not the entity itself.

Generally, when people say entities in regard to an HTML document, they are actually referring to the character entity references and/or the numeric character references; not the entity itself. Though, this is not always the case, SGML and XML experts will usually get it right, but luckily, the intended meaning of the speaker can generally be understood from the context of its use.

The DOCTYPE Tag

The Document Type Declaration, or simply DOCTYPE, is often referred to as the DTD, or the DOCTYPE tag. The acronym, DTD, can be mistakenly used to refer to the Document Type Declaration, since it has the same initials as the acronym’s defined meaning: Document Type Definition.

The DOCTYPE is not a tag either, it is a declartion, so calling it the DOCTYPE tag is incorrect. However, more often than not, is easier to simply refer to it as just the DOCTYPE.

The <?xml?> Tag

The XML declaration, often referred to as a Processing Instruction or Prolog, is also sometimes called the <?xml?> tag. As you can probably guess, it is not a tag. It is also not a processing instruction either, but that, at least, is forgivable, since it does have the appearance of an XML PI, though it is defined separately as the XML Declaration. It is not the prolog either, but it is part of the prolog.

Elements and Tags

An element is not a tag, as noted at the end of section 3.2.1 Elements, in the HTML 4.01 recommendation:

Elements are not tags. Some people refer to elements as tags (e.g., “the P tag”). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup

Tag only refers to either the start- or end-tags. Every element has a start-tag (eg. <p>) and, with the exception of empty elements, an end-tag (eg. </p>). Empty elements never have an end-tag in HTML, though one is required in XML, and thus XHTML (which can use the special empty element tag syntax). As noted, in HTML, the start- or end-tags may be omitted for some elements, but those elements are still present.

An element is more of a concept that is defined using an element declaration, and comprises an element name, that appears within the start- and end-tags, any attributes within the start-tag, and (with the exception of empty elements) its content model and finally, its content. An element is included in a document by writing its start and end tags, as needed, but (like entity declarations and references) the element declaration and tags are only the markup for an element; they are not the element itself. It is important that this distinction be made and understood by authors – I just hope I’ve explained it well enough.

iiNet Standards Redesign

Recently, and to my surprise, iiNet have redesigned their entire site. Not only that, but it validates as XHTML 1.0 Transitional, separates presentation from structure (no tables for layout!), makes reasonably good use of alt text (it’s not perfect, but it’s quite good), fairly accessible use of JavaScript (no serious problems caused with it disabled) and even makes good use of sIFR!

iiNet have discussed this re-branding from a marketing perspective. That’s fair enough, a typical customer isn’t going to want to hear about their new found standards compliant and accessible design methods. So, in the process of congratulating them for this fine effort, I’ll take a look at exactly what they have done.

Structure and Presentation

Disabling stylesheets quickly reveals that they have actually put in a lot of effort into this redesign from the previous. The old design was a typical table layout with spacer gifs, invalid markup and a few pages just didn’t work correctly in anything but IE. With the redesign, they’ve used reasonably semantic markup – headings use <hn> elements, paragraphs use <p>, navigation menus and other lists use <li> and there’s no use of presentational class names or ids.

They have, unfortunately, used a few style attributes, but not many. Most of the presentation is specified in an external stylesheet. Ideally, they would use a semantic class name on those few elements that currently use the style attribute, but the damage caused is minimal

As I mentioned earlier, the markup does validate as XHTML 1.0 Transitional, with the exception of one image missing an alt attribute on this page discussing their re-branding. However, it is a presentational image, and only requires an empty alt="" attribute anyway. The home page nearly validate as Strict, however the only errors seem to be structural, due to <input> elements being directly inside a <form> element, the use of a name attribute in a <form> element and the use of a target attribute, which I strongly discourage. They’ve also used an invalid value: target="_new". The HTML 4.01 Specification states:

Except for the reserved names listed below, frame target names (%FrameTarget; in the DTD) must begin with an alphabetic character (a-zA-Z). User agents should ignore all other target names.

This means, that except for the special defined values, _blank, _self, _parent and _top, the value must begin with an alphabetic character. Thus, _new is invalid, even though the validator does not detect it. But, you must keep in mind that the validator is just a tool, and cannot check every conformance requirement, only those specifiable with the DTD. So, technically, they should be using _blank, but ideally, they should remove the target attribute completely, since the user should decide when they want a new window, not the author.

XHTML vs HTML Markup

Update: One thing I forgot to mention earlier, and hence why I’m adding this update, is that technically, they should not be using XHTML since they are serving it as text/html, and doing so is considered harmful. If they’re going to use XHTML, they should be using content negotiation to deliver it correctly as application/xhtml+xml to descent UAs that support it, and text/html to IE, and other legacy UAs that don’t. However, as many of you will know, this issue has been discussed recently. Some say it’s OK, other’s (like myself) think it should be avoided, and other’s insist that it should not be done. These people categorised into either the Stict or Transitional Party.

If they’re not going to serve XHTML properly, then they may as well use HTML 4.01 Strict. I recommend they change to Strict, because Transitional actually triggers Almost Standards Mode in Mozilla. It is near enough to standards compliance, but it adds a small quirk that should not be there, and only exists to support the thousands of pages that depend on IE’s bugginess, yet still use a valid DOCTYPE.

Images and Alt Text

As mentioned, there is one image that I found without alt text, but other than that, they seem to have actually done a reasonable job. Although, ideally, (in this case) the images should have exactly the same text as the images do, however, they have used text with a similar meaning, and viewing without images doesn’t loose too much.

For example, one image they have at the moment, states christmas broadband specials. Free Setup + Modem. Save up to $199.95. However, they have set alt="christmas special - free setup/modem". It misses the price, but it still passes a message that is close enough, especially compared with the vast majority of sites that use very poor, or no alt text whatsoever.

They have made use of Image replacement techniques for the navigation items, though not in the most accessible way. However, that’s limitation with CSS and image replacement techniques in general. iiNet have done image replacement by setting the background image on the <a> element for each link, so that hover effects still work in IE, and set the font to 1px, white to effectively hide it from view. However, like many image replacement techniqes, this is inneffective in the rare case that images are disabled, but CSS is enabled.

For many the headings, they have made use of sIFR, which was designed and developed to be accessible in the majority of cases. It has known limitations, but so far, is the most accessible image replacement technique available.

JavaScript

The site does make some use of JavaScript, however the site does not require it. With JavaScript disabled, the only issue I found was that the what’s webmail and what’s toolbox links don’t work. They are JavaScript links with the purpose of showing additional information about the webmail and toolbox services. Ideally, with JavaScript disabled, that information should be visible by default, but the additional information is not that much, and can be obtained by disabling stylesheets also. The links should also be added using JavaScript, so that useless links do not appear for users with JavaScript and/or CSS disabled, but again, it’s a minor issue.

So, in conclusion, I would like to congratulate iiNet for taking the initiative to move towards standards compliance, and for actually hiring a web developer that knows what they’re doing. Well Done!

Validating (X)HTML With IE Using File Upload

Warning: The following describes how to modify the registry in order to trick Windows XP SP2 into allowing text/html to be sent with file uploads. This hack has known side affects which may affect other applications running on your system, some of which are discussed in the comments. As a result, I accept no responsibility for damage caused to your system as a result of applying this hack, and this solution is provided as-is, with no guarentee, warranty or support. If you do not understand the regitry, nor how to reverse any change, then do not apply these changes – use them at your own risk.

Update: This technique is no longer required for HTML. Please see Validation by file upload and Internet Explorer on WinXP SP2

After downloading Windows XP Service Pack 2 recently, I was shocked that IE was now sending HTML documents with a .htm or .html extension as text/plain, thus causing any the W3C Markup Validator to issue this warning message:

Sorry, I am unable to validate this document because its content type is text/plain, which is not currently supported by this service.

The Content-Type field is sent by your web server (or web browser if you use the file upload interface) and depends on its configuration. Commonly, web servers will have a mapping of filename extensions (such as “.html”) to MIME Content-Type values (such as text/html).

That you recieved this message can mean that your server is not configured correctly, that your file does not have the correct filename extension, or that you are attempting to validate a file type that we do not support yet. In the latter case you should let us know that you need us to support that content type (please include all relevant details, including the URL to the standards document defining the content type) using the instructions on the Feedback Page.

This essentially means that it was impossible to validate any local HTML document using IE. This is really annoying, especially for any unfortunate developers who are forced to develop using only IE at work. Although I do pity anyone in that situation, there is now some relief!

After spending about half an hour searching through the registry for any setting that could be causing .html files to be sent as text/plain, I realised that it would be eaiser to find where the setting for other content types that do work, such as CSS. So, I found the setting for that, modified, and tested. When the CSS Content Type value was set to anything but text/html, IE uploaded the file with that MIME type. Thus, I came to the conclusion that it was not that the setting was incorrect, but that something in Windows security was preventing any text/html content being sent by changing it to text/plain on the way.

After that, I tried setting the valud for .html files to another type that the validator may support, such as text/sgml or application/sgml, but sadly, without luck! But, just before giving up all hope, I realised that perhaps Windows security, being as insecure as ever, is only checking for an exact match on the content type being set by IE with file uploads. I was correct!

In a normal HTTP header, the Content-Type can also include a charset parameter. For example:

Content-Type: text/html; charset=UTF-8

So, I figured, what if I want IE to send a charset parameter also. I set the Content Type value in the registry to that above, and it worked perfectly — the file validated!!! However, the charset will not always be UTF-8, or any other charset for that matter, so I removed the chaset parameter, and was left with the value text/html; That extra little semi-colon on the end is enough to bypass Windows security, and validate any HTML file.

Then, I remembered that IE also does not know how to validate XHTML documents either. So, I went to the registry key for .xhtml files, added the application/xhtml+xml MIME type, tested and Guess What! It Worked.

I have exported the required settings from the registry and they are availble here. IE6-SP2-Content-Type-text-html.reg will fix the value for text/html, and IE6-SP2-Content-Type-application-xhtml+xml.reg will add the MIME type for XHTML documents. Download them both, inspect their contents to ensure that they are safe, and apply them by launching them. You will be prompted by Windows to confirm that you want to apply the settings.

Update: For any users of ICQ: If you use change the text/html value to text/html; then each time the ICQ advertisement rotates, you may be prompted to save the file, because it is an unknown file type. I don’t konw why this happens, because IE still works the same as always — full of bugs! But for some reason it affects ICQ. I recommend you only apply that work around on computers that you do not use ICQ on, or else change it each time you need to validate with IE.