All posts by Lachlan Hunt

The Cascade: Part 1

One of the most important yet, arguably, one of the least understood aspects of CSS is the cascade. Sure, most people will know that CSS stands for Cascading Style Sheets, but do you know what cascading really means and how it affects the way style sheets work?

From experience, I’ve seen many people struggle with specificity; often wondering why some particular rule isn’t applying and typically trying to work around it by adding id and class selectors to increase the specificity. While specificity is important to understand, it’s really only one step of the cascade; and so my aim today, and over the coming weeks, is to discuss cascading and inheritance as well as clearly explain the 4 steps of the cascade.

Style Sheet Origins

The cascade is designed around the combination of style sheets applying to a document, each coming from one of three origins, although authors typically only consider one: theirs! The three origins are the User Agent, User and Author. I’ll be talking more about the interaction of these in the next part, but for now, you just need to be aware that styles don’t just come from one place.

Step 1: Find Declarations

The first step in the cascade is to find all the style declarations that apply to each element, from all style sheets applied to the document, including style sheets from all three origins. This step involves collecting all the style declarations that apply for the target media type. This means, that if the document is being rendered on the screen, for example, any styles for the print media, or anything else for that matter, have already been discarded.

At this point, it doesn’t matter whether some selectors have a higher specificity than others or whether two rules set different values for the same property, or nearly anything else. The only factor is whether the selector matches the element or not.

For example, given the following stylesheets:

/* User Agent Stylesheet */
body { padding: 1em; margin: 0; line-height: 1.2em; }
p { margin: 1em 0; }

/* User Stylesheet */
body { font: large/1.4 sans-serif; }
* { background: blue none !important; color: white !important; }

/* Author Stylesheet */
html, body { margin: 0; padding: 0; background: #CCC; color: black; }
p { line-height: 1.4; }
#content p { margin: .8em 0; }

And this sample document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<title>The Cascade</title>
<div id="content">
  <p>Hello World!
</div>

Find all the declarations that will apply to the p element. Before you continue reading, take a look and write down all the styles that you think will apply. This should be a fairly easy exercise, there’s no difficult selectors used at all; but even if there was, the concept would be exactly the same.

Assuming you attempted the excercie, you should have something like this:

p { margin: 1em 0; }
* { background: blue none !important; color: white !important; }
p { line-height: 1.4; }
#content p { margin: .8em 0; }

Now, this same process is repeated for each and every element in the document, but I’ll leave that as an exercise for the reader. In the next article in this series, we’ll talk about sorting by origin and importance, and then, later on, specificity and the order specified.

Character References Explained

It seems that many people aren’t as well informed about character references as they should be, so I’m going to clearly explain everything you need to know (and some things you probably don’t) about them, primarily for HTML and XML. There are two types of character references: numeric character references and character entity references.

Numeric Character References

There are two forms of numeric character references: decimal and hexadecimal.

Syntax

Decimal take the form &#nnnn;, where nnnn is the reference to the Unicode code point for the character in decimal [0-9]. Hexadecimal takes the form &#xhhhh; where hhhh is the code point in hexadecimal notation [0-9a-fA-F].

For hexadecimal character references in HTML, the x is case insensitive but in XML it must be lower case. So, for example, &#XA0; is valid in HTML but invalid in XML.

In HTML, the reference end is either:

  • a reference close (REFC) delimiter (a semicolon by default)
  • a record end (RE) function code
  • a character, such as a space, which is not part of a valid entity name.

That basically means that the semi-colon may be omitted from the end in certain circumstances. However, remembering exactly when it can and can’t be omitted is difficult; and to avoid the chance of error as much as possible, it’s very good practice to always include it. In XML, the semi-colon is required. These same rules apply the character entity references as well.

Character Repertoire

It is most important to note that regardless of the document’s character encoding, numeric character references always refer to the code position in the Unicode character repertoire . In SGML terms, this is referred to as the Document Character Set (DCS) and, for HTML, is defined in the SGML Declaration. For both HTML and XML the DCS is defined as ISO-10646 (or Unicode). You should note that there is a difference between the DCS and the file’s character encoding, which can be anything, including ISO-8859-1, UTF-8, UTF-16, Shift_JIS, or whatever else; but the encoding does not affect numeric character references in any way, they always refer to a character within the DCS which is, as stated, defined as Unicode.

For example: if you want to include the right single quotation mark (’) using a numeric character reference, you need to know the Unicode code point. In this case, it is U+2019 or, in decimal, 8217. Thus, the hexadecimal and decimal character references will be, respectively, &#x2019; and &#8217;.

Common Mistakes

A very common mistake is to use the code point from the Windows-1252 character repertoire instead. You should be aware that this is not correct, even though browsers have been forced to support them simply because IE does. The problematic code points range from 128 to 159. In Windows-1252, the right single quotation mark falls within this range at the code position 0x92, or, more commonly known in decimal as 146. However, you cannot include the character using this code point, as in &#x92; or &#146;, because doing so would actually refer to the Unicode code point, not Windows-1252, and characters in this range are defined as control characters.

These control characters are defined as UNUSED in the SGML Declaration for HTML. In confusing and obscure SGML terms, that makes them non-SGML characters. According to the SGML handbook in section 13.1.1 that simply means that no meaning is assigned to that character, but explicitly states in section 9.2 that a non-SGML character can be entered as a data character within an SGML entity by using a character reference.

Validation Issues

Strictly speaking, although these characters cannot be included as data characters within an HTML document, it is not invalid to refer to these characters with character references. The problem is a combination of the fact that the meaning of a non-SGML character is nothing short of obscure and, in Unicode, these characters are non-printable control characters.

Because of this, the validator will only issue a warning; but, although its use is still technically valid, it should be treated as an error as it almost certainly does not mean what the author intended — in fact, its meaning is undefined. To clarify the validation issues a little more, compare the results of using a non-SGML character directly within the markup and a reference to a non-SGML character. The first will fail validation with an error; while the second will pass, but with a warning, even though they’re essentially using the same character.

You should note that this is not the case for XML (including XHTML). Technically, this range (from 128 to 159) is perfectly valid according the the production for Char in XML, but they still refer to Unicode control characters, their meaning is undefined in the context of the document and thus should not be used. Although the W3C Validator will issue the same error and warning for equivalent XHTML documents, this is a symptom of its origin as an SGML validator patched to work in a sort-of XML mode. However, validating with a true XML validator (like that provided by Page Valet) will not result in any errors or warnings at all.

It is important to realise that, in XML, using or referring to a character that does not match the production for Char violates a well-formedness constraint. For example, using a control character in the range from 0 to 31 (except for tab, newline and carriage return) either directly or with a numeric character reference results in a well-formedness error.

Character Entity References

Character entity references use symbolic names instead of numbers, and take the form &name;. All entity references are case sensitive. So, for example, &aring; and &Aring; refer to two separate characters in HTML: å and Å, respectively. The rules for the reference end are the same as for numeric character references (discussed above).

Some of the well known entity references in HTML include &amp;, &lt;, &gt; and &quot;. Interestingly, &quot; was actually removed from HTML 3.2, but this was later realised to be a mistake and added back again in HTML 4.

Predefined Entity References in XML

In XML, those are 4 of the 5 predefined entity references that can be used in any XML document, without needing them to be being defined in a DTD. The 5th predefined entity reference in XML is &apos;, but the reason I mention it separately from the others is that it is not defined in HTML and, as a result, it is also not supported in IE for HTML. However, it is rare that one actually needs to use it, as it is only required within an attribute value delimited by single quotes (‘), rather than the more conventional double quotes (“). In such cases, a numeric character reference can always be used in its place.

External Entity References

HTML 4, XHTML 1.x and MathML define many other character entity references in their respective DTDs. These are called external entity references. In HTML, they are divided into three groups: ISO-8859-1 characters, symbols, mathematical symbols, and Greek letters and markup-significant and internationalization characters. Digital Media Minute have provided a useful character entity reference chart containing all of these. If you’re interested in the MathML entities, see chapter 6 of MathML 2.0.

Because these are defined in the DTD, technically none of them can be used in an HTML document without an appropriate DOCTYPE declaration referencing an appropriate HTML DTD; although since browsers don’t read the DTD anyway, browsers will support them, regardless. However, in XHTML and MathML (served with an XML MIME type), the DOCTYPE is required for practical reasons to use any entity, other than the 5 predefined ones.

For example, &nbsp; and &rsquo; are defined in the XHTML DTD, they are not predefined in XML and so require the DTD to be used. Without it, their use violates a well-formedness contraint, but it should be noted that using externally defined entities is unsafe in XML because it requires a validating XML parser to read the DTD. The Mozilla Web Developer FAQ notes:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of externally defined character entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support externally defined character entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support externally defined character entities (other than the five pre-defined ones), the use of externally defined character entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.).

As noted, the alternative is to just use a numeric character reference instead, but the best option is to just use a Unicode encoding, such as UTF-8 or UTF-16, and enter the real character (see my Guide to Unicode for more information). Arguably, if you’re using a Unicode encoding, one of the only times when it is useful to use a character reference instead of the real character is for non-printable characters, such as non-breaking space (&nbsp; or, preferably, &#xA0;), Em-space, En-space, zero-width characters, etc. The main reason for that is to be able to clearly identify them when you’re reading/editing the source code.

Summary

Numeric character references, both decimal and hexadecimal, can be safely used in (X)HTML and XML, but you need to be careful that you’re referencing the character’s code point from the Unicode character repertoire, not Windows-1252 (especially in the range from 128 to 159).

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &amp;, &lt;, &gt;, &quot; and &apos;. Of these, you should note that &apos; is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.