Category Archives: MarkUp

SGML, (X)HTML, XML and other markup languages.

HTML Tags

Well, after taking a long break from writing anything on this blog, I’m back and better than ever. I’ll try to post more regularly from now on, with much better content. I hope you, my loyal readers, didn’t miss me too much while I was gone, but anyway, let’s get on with the good stuff. ?

One thing I’ve come to notice a lot of people believe is that in HTML, everything is a tag (or at least can be called a tag). This is most certainly not the case. The most recent offender I’ve seen, and the reason I decided to write this, is the author of Firefox, ALT Tags, and Tooltips, which, as you can see by the title, incorrectly refers to attributes as tags. The article itself is quite good, and I fully agree with its message about tooltips for alt attributes, it’s just the incorrect references to the attributes as tags that bugs me. This author is not the first, nor the last to make the mistake, but it is about time people learn to call things by their real names.

If you read part 5, Terminology, of Joe English’s humorous document: “Not the comp.text.sgml Frequently Asked Questions List”, you will see the common name for everything except a tag, is a tag. The common name for a tag being a command, which, of course, makes perfect sense! ?

    --------------------------------------------------
    ISO/W3C terminology			Common name
    --------------------------------------------------
    attribute				tag
    attribute value			tag
    attribute value literal		tag
    attribute value specification	tag
    character reference			tag
    comment				tag
    comment declaration 		tag
    declaration				tag
    document type declaration		tag
    document type definition		tag
    element				tag
    element type			tag
    element type name			tag
    entity				tag
    entity reference			tag
    general entity			tag
    generic identifier			tag
    literal				tag
    numeric character reference		tag
    parameter entity			tag
    parameter literal			tag
    processing instruction		tag
    tag					command
    --------------------------------------------------

So what exactly is a tag then? Well, before I get to that, I’ll just explain what some of the more common SGML and XML terminology means and what a tag is not.

Firstly, tags are not commands. People believe they are commands because of the misconception that HTML is a presentational language, or even a programming language. HTML is certainly not a programming language, and while it is true that presentational features have crept in, they have already been deprecated and/or removed (X)HTML, or at least will be in future versions.

It is the presentational elements and attributes that could be seen as commands or instructions to display the content in a certain way; however, they are in fact suggestions, just like CSS properties – the only difference being that these presentational suggestions are mixed in with the markup, and have no real semantics that indicate what the content is, only what the author wants it to look like, usually in a visual medium. Any presentational feature, whether done with CSS or the presentational elements and attributes, can be overridden by a user with a user stylesheet (assuming the user agent supports that facility), therefore, they are only suggestions that a user does not have to accept, not commands that a user agent, nor user must obey.

HTML, since it has been formally based on SGML, is intended to mark up the structure and semantics of the content by saying what it is, not what it does, nor how it looks (with the exception of the afore mentioned presentational features). Basically, HTML is not a procedural programming language; it is a descriptive markup language, so tags are not commands.

Attribute Tags

There’s no excuse for calling attributes tags, other than complete laziness and/or ignorance, but as already shown, calling attributes tags is a common mistake. An attribute is a property of an element that is written within the start-tag of an element, and should be referred to as simply an attribute. eg. The alt attribute… is the simplest way of referring to an attribute, and is only slightly longer than writing tag. However, a shorthand method of referring to attributes, which I occasionally see within plain text e-mails, is to write it within vertical bars, or some other delimiter. eg. |alt|.

Character Tags (or Entities)

Character references are sometimes called tags, but are more often called entities. Just like attributes, they are not tags either, but what’s wrong with calling them entities?

According the section 3.2.3 of the HTML 4.01 recommendation, Character references are numeric or symbolic names for characters that may be included in an HTML document. Section 5.3 also states:

Character references in HTML may appear in two forms:

  • Numeric character references (either decimal or hexadecimal).
  • Character entity references.

The numeric character references take the form &#nnnn; (decimal) or &#xnnnn; (hex). Character entity references are the named entities for the ISO-8859-1 characters (from 160 to 255), symbols, mathematical symbols and Greek letters, and finally, markup-significant and internationalization characters.

Based on that, you may think that it is only the numeric references that are incorrectly referred to as entities; however, it is indeed both forms. In SGML and XML there are several types of entities, and the simplest explanation of what an entity is, is that which comes from ISO-8879 itself, the SGML specification: an entity is a collection of characters that can be referenced as a unit. The purpose of entities can be easily understood, but understanding exactly what an entity is and separating that concept from the markup, is more difficult.

An entity is a concept that is defined in a DTD using an entity declaration defining both the name, and the replacement text. The entities are referred to within a document using an entity reference in the form: &name;. The entity declaration and the entity reference are just the markup for the entity, but they are not the entity itself.

Generally, when people say entities in regard to an HTML document, they are actually referring to the character entity references and/or the numeric character references; not the entity itself. Though, this is not always the case, SGML and XML experts will usually get it right, but luckily, the intended meaning of the speaker can generally be understood from the context of its use.

The DOCTYPE Tag

The Document Type Declaration, or simply DOCTYPE, is often referred to as the DTD, or the DOCTYPE tag. The acronym, DTD, can be mistakenly used to refer to the Document Type Declaration, since it has the same initials as the acronym’s defined meaning: Document Type Definition.

The DOCTYPE is not a tag either, it is a declartion, so calling it the DOCTYPE tag is incorrect. However, more often than not, is easier to simply refer to it as just the DOCTYPE.

The <?xml?> Tag

The XML declaration, often referred to as a Processing Instruction or Prolog, is also sometimes called the <?xml?> tag. As you can probably guess, it is not a tag. It is also not a processing instruction either, but that, at least, is forgivable, since it does have the appearance of an XML PI, though it is defined separately as the XML Declaration. It is not the prolog either, but it is part of the prolog.

Elements and Tags

An element is not a tag, as noted at the end of section 3.2.1 Elements, in the HTML 4.01 recommendation:

Elements are not tags. Some people refer to elements as tags (e.g., “the P tag”). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup

Tag only refers to either the start- or end-tags. Every element has a start-tag (eg. <p>) and, with the exception of empty elements, an end-tag (eg. </p>). Empty elements never have an end-tag in HTML, though one is required in XML, and thus XHTML (which can use the special empty element tag syntax). As noted, in HTML, the start- or end-tags may be omitted for some elements, but those elements are still present.

An element is more of a concept that is defined using an element declaration, and comprises an element name, that appears within the start- and end-tags, any attributes within the start-tag, and (with the exception of empty elements) its content model and finally, its content. An element is included in a document by writing its start and end tags, as needed, but (like entity declarations and references) the element declaration and tags are only the markup for an element; they are not the element itself. It is important that this distinction be made and understood by authors – I just hope I’ve explained it well enough.

iiNet Standards Redesign

Recently, and to my surprise, iiNet have redesigned their entire site. Not only that, but it validates as XHTML 1.0 Transitional, separates presentation from structure (no tables for layout!), makes reasonably good use of alt text (it’s not perfect, but it’s quite good), fairly accessible use of JavaScript (no serious problems caused with it disabled) and even makes good use of sIFR!

iiNet have discussed this re-branding from a marketing perspective. That’s fair enough, a typical customer isn’t going to want to hear about their new found standards compliant and accessible design methods. So, in the process of congratulating them for this fine effort, I’ll take a look at exactly what they have done.

Structure and Presentation

Disabling stylesheets quickly reveals that they have actually put in a lot of effort into this redesign from the previous. The old design was a typical table layout with spacer gifs, invalid markup and a few pages just didn’t work correctly in anything but IE. With the redesign, they’ve used reasonably semantic markup – headings use <hn> elements, paragraphs use <p>, navigation menus and other lists use <li> and there’s no use of presentational class names or ids.

They have, unfortunately, used a few style attributes, but not many. Most of the presentation is specified in an external stylesheet. Ideally, they would use a semantic class name on those few elements that currently use the style attribute, but the damage caused is minimal

As I mentioned earlier, the markup does validate as XHTML 1.0 Transitional, with the exception of one image missing an alt attribute on this page discussing their re-branding. However, it is a presentational image, and only requires an empty alt="" attribute anyway. The home page nearly validate as Strict, however the only errors seem to be structural, due to <input> elements being directly inside a <form> element, the use of a name attribute in a <form> element and the use of a target attribute, which I strongly discourage. They’ve also used an invalid value: target="_new". The HTML 4.01 Specification states:

Except for the reserved names listed below, frame target names (%FrameTarget; in the DTD) must begin with an alphabetic character (a-zA-Z). User agents should ignore all other target names.

This means, that except for the special defined values, _blank, _self, _parent and _top, the value must begin with an alphabetic character. Thus, _new is invalid, even though the validator does not detect it. But, you must keep in mind that the validator is just a tool, and cannot check every conformance requirement, only those specifiable with the DTD. So, technically, they should be using _blank, but ideally, they should remove the target attribute completely, since the user should decide when they want a new window, not the author.

XHTML vs HTML Markup

Update: One thing I forgot to mention earlier, and hence why I’m adding this update, is that technically, they should not be using XHTML since they are serving it as text/html, and doing so is considered harmful. If they’re going to use XHTML, they should be using content negotiation to deliver it correctly as application/xhtml+xml to descent UAs that support it, and text/html to IE, and other legacy UAs that don’t. However, as many of you will know, this issue has been discussed recently. Some say it’s OK, other’s (like myself) think it should be avoided, and other’s insist that it should not be done. These people categorised into either the Stict or Transitional Party.

If they’re not going to serve XHTML properly, then they may as well use HTML 4.01 Strict. I recommend they change to Strict, because Transitional actually triggers Almost Standards Mode in Mozilla. It is near enough to standards compliance, but it adds a small quirk that should not be there, and only exists to support the thousands of pages that depend on IE’s bugginess, yet still use a valid DOCTYPE.

Images and Alt Text

As mentioned, there is one image that I found without alt text, but other than that, they seem to have actually done a reasonable job. Although, ideally, (in this case) the images should have exactly the same text as the images do, however, they have used text with a similar meaning, and viewing without images doesn’t loose too much.

For example, one image they have at the moment, states christmas broadband specials. Free Setup + Modem. Save up to $199.95. However, they have set alt="christmas special - free setup/modem". It misses the price, but it still passes a message that is close enough, especially compared with the vast majority of sites that use very poor, or no alt text whatsoever.

They have made use of Image replacement techniques for the navigation items, though not in the most accessible way. However, that’s limitation with CSS and image replacement techniques in general. iiNet have done image replacement by setting the background image on the <a> element for each link, so that hover effects still work in IE, and set the font to 1px, white to effectively hide it from view. However, like many image replacement techniqes, this is inneffective in the rare case that images are disabled, but CSS is enabled.

For many the headings, they have made use of sIFR, which was designed and developed to be accessible in the majority of cases. It has known limitations, but so far, is the most accessible image replacement technique available.

JavaScript

The site does make some use of JavaScript, however the site does not require it. With JavaScript disabled, the only issue I found was that the what’s webmail and what’s toolbox links don’t work. They are JavaScript links with the purpose of showing additional information about the webmail and toolbox services. Ideally, with JavaScript disabled, that information should be visible by default, but the additional information is not that much, and can be obtained by disabling stylesheets also. The links should also be added using JavaScript, so that useless links do not appear for users with JavaScript and/or CSS disabled, but again, it’s a minor issue.

So, in conclusion, I would like to congratulate iiNet for taking the initiative to move towards standards compliance, and for actually hiring a web developer that knows what they’re doing. Well Done!

Atom Feeds

I have finally got around to setting up the atom feeds correctly for both of my blogs. I’ve gone for a while without any feeds available, which probably means I’ve lost out on quite a few subscribers. Well, for those of you, if any, who’ve been visiting the blogs regularly, I’m happy to say the feeds are now available, and set up for auto discovery. The feeds are just the Atom feeds that blogger provides. As I said, The feeds are available for Lachy’s Log and Net Twits.

I was considering setting up an RSS feed as well, but having read the disasterous RSS 2.0 spec today, and Mark Pilgrim‘s myth of RSS compatibility, I have decided that RSS has been created as a mess of proprietary extensions worse than the HTML extensions created during the browser wars. For starters, there’s not even an agreed upon expansion for the acronym. It started out as RDF Site Summary, which was later chagned to Rich Site Summary, and finally to Really Simple Syndication. That was confusing enough, and now to see that not one is really compatible with any other, I’d rather just steer clear of the whole mess if I can.