Using character escapes in markup and CSS

What kinds of character escape can be used in markup?

You can use a character escape to represent any Unicode character in XML or (X)HTML using only ASCII characters.

Different specifications give different names to these constructs. For example, the HTML5 specification calls character entity references named character references. We have chosen to use names for this article that we hope are recognizably different and clear in meaning for the reader, whatever variations they have used so far.

NCRs (numeric character references) and character entity references are types of character escape used in markup. For example, the following are different ways of representing the character U+00A0 NO-BREAK SPACE.

(The NO-BREAK SPACE character looks like a space but prevents a line wrap between the characters on either side. In French it is commonly used with punctuation such as colons and exclamation marks, which are preceded by a space but should not appear at the beginning of a line during text wrap)

 : A hexadecimal NCR. All NCRs begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the code point value of a Unicode character. The hex number is not case-sensitive.
Example: Vive la France !
 : A decimal NCR. This uses a decimal number to represent the same Unicode code point.
Example: Vive la France !
 : A character entity reference. This is a very different type of escape. Character entity references are defined in the markup language definition. This means, for example, that for HTML only a specific range of characters (defined by the HTML specification) can be represented as character entity references (and that includes only a small subset of the Unicode range). Note that the entity name is case sensitive: in HTML, Á represents the uppercase letter Á, whereas á represents the lowercase á.
Example: Vive la France !

Some browsers allow you to omit the semicolon at the end of a numeric character reference, but this is not recommended, since it may lead to interoperability problems. Using the semicolon also avoids the potential problem of the end of the escape becoming undetectable when the escape is embedded in text.

One point worth special note is that values of numeric character references (such as € or € for the euro sign €) are interpreted as Unicode characters – no matter what encoding you use for your document. It is a common error for people working on content encoded in Windows code page 1252, for example, to try to represent the euro sign using . This is because the euro appears at position 80 (in hexadecimal) on the Windows 1252 code page. Using  in HTML should actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. (In fact, browsers tend to silently correct that error. See the test pages.)

CSS escapes

CSS represents escaped characters in a different way. To represent a character, start with a backslash followed by the hexadecimal number that represents the character's Unicode code point value.

If there is a following character that is not in the range A–F, a–f or 0–9, that is all you need. The following example represents the word émotion.

Example: .\E9motion { ... }

If, on the other hand, the next character is one that can be used in hexadecimal numbers, it won't be clear where the end of the number is. In these cases there are two options. The first is to use a space after the escape. This space is part of the escape syntax, and does not remain after the character escape is parsed. The following example shows how you could represent the word édition.

Example: .\E9 dition { ... }

Alternatively, you can use a 6-digit hexadecimal number, with or without a space. Here is an alternative way of writing édition.

Example: .\0000E9dition { ... }

Because any white-space following the hexadecimal number is swallowed up as part of the escape, if you actually want a space to appear after the escaped character you will need to add two spaces (after a hexadecimal number of any length).

The backslash can also be used in CSS before a syntax character to prevent it being read as part of the code. For more information about CSS escapes, see the CSS 2.1 specification.

When not to use escapes

It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.

Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivnĕjší se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

As we said before, use characters rather than escapes for ordinary text.

Use in XHTML. Using character entity references in a document that is parsed as XML may become problematic if the entities are defined externally to your document and the tools that process the XML do not read the external files. In such cases the entity references will not be replaced by characters. For this reason, if you need to use escapes, it may be safer to use numeric character references, or define the character entities you need inside the document.

If you use HTML-defined character entity references (such as á) to represent characters in XHTML, you should take care any time your content is processed using XML parsers or other tools.

When to use escapes

Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.

< (<)
> (>)
& (&)

You may also want to represent the double-quote (") as " and the single quote (') as ' – particularly in attribute text when you need to use the same type of quotes as those that surround the attribute value. Note, however, that, although it is part of the XML language, ' is not defined in HTML 4.01 and some browsers do not support ' in HTML. For this reason the XHTML specification recommends instead the use of ' if text may be passed to an HTML browser.

Encoding gaps. Escapes can be useful to represent characters not supported by the encoding you choose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

Input problems. If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters – it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters. Alternatively, if you only need the occasional character, use a character map tool or character picker.

Invisible or ambiguous characters. A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent ‏) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using   (or  ) makes it quite clear where such spaces appear in the text.

Use of escapes in style attributes

It is best to choose the right encoding so that you can just use characters in CSS declarations. This section addresses what should be a very rare circumstance where you may have decided to use escapes.

It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.

A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

<span style="font-family: L\FC beck">...</span>

than

<span style="font-family: L&#xFC;beck">...</span>