W3C I18n FAQ: Using character entities and NCRs

XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), and anyone who needs guidance on how and when to use alternatives to actual characters in a document.

What are character entity and NCR escapes, and when should I use them?

What are entities and NCRs?

You can use a character escape to represent any Unicode character in XML or (X)HTML using only ASCII characters. NCRs (Numeric Character References) and character entities are types of character escape. For example, the following are different ways of representing the character U+00A0 NO-BREAK SPACE:

The NO-BREAK SPACE character looks like a space but prevents a line wrap between the characters on either side. It is commonly used with punctuation such as colons and exclamation marks in French, which are preceded by a space but should not appear at the beginning of a line during text wrap.

 : A hexadecimal NCR. All NCRs begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the scalar value of a Unicode character, ie. the number assigned in the Unicode code charts. The hex number is not case-sensitive.
Example: Vive la France !
 : A decimal NCR. This uses a decimal number to represent the same scalar value.
Example: Vive la France !
 : A character entity. This is a very different type of escape. Character entities are defined in the markup language definition. This means, for example, that for HTML only a specific range of characters (defined by the HTML specification) can be represented as entities (and that includes only a small subset of the Unicode range). Note that the entity name is case sensitive: in HTML, Á represents the uppercase letter Á, whereas á represents the lowercase á.
Example: Vive la France !

One point worth special note is that values of numeric character references (such as € or € for the euro sign €) are interpreted as Unicode characters - no matter what encoding you use for your document. It is a common error for people working on content encoded in Windows code page 1252, for example, to try to represent the euro sign using . This is because the euro appears at position 80 on the Windows 1252 code page. Using  would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire.

When not to use escapes

It is almost always preferable to use an encoding that allows you to represent the characters in their normal form, rather than using character entities or NCRs.

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.

Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivnĕjší se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

Using character entities in XML may become problematic if the entities are defined externally to your document and the tools that process the XML do not read the external files. In such cases the character entities will not be replaced by characters. For this reason, if you need to use escapes, it may be safer to use numeric character references, or define the character entities you need inside the document. If you use HTML-defined character entities (such as á) to represent characters in XHTML, you should take care any time your content is processed using XML tools, or converted to XML.

When to use escapes

Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.

< (<)
> (>)
& (&)

You may also want to represent the double-quote (") as " and the single quote (') as ' - particularly in attribute text when you need to use the same type of quotes as those that surround the attribute value. Note, however, that, although it is part of the XML language, ' is not defined in HTML. For this reason the XHTML specification recommends instead the use of ' if text may be passed to an HTML user agent.

Encoding gaps. Escapes can be useful to represent characters not supported by the encoding you choose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

Input problems. If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters. Alternatively, if you only need the occasional character, use a character map tool or character picker.

Invisible or ambiguous characters. A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent ‏) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using   (or  ) makes it quite clear where such spaces appear in the text.

Changing to UTF-8 means re-saving your file. Using an encoding such as UTF-8 means that you can avoid the need for most escapes and just work with characters. To change the encoding of your document, however, it is not enough to just change the encoding declaration at the top of the page or on the server. You need to re-save your document in that encoding. For help understanding how to do that with your application read Setting encoding in web authoring applications.

Hex vs. decimal. Typically when the Unicode Standard refers to or lists characters, it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is recommended, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes, ie. á could be represented as á.

Although most common browsers now recognize hexadecimal NCRs, some now quite old browsers such as Netscape 4 do not. Bear this in mind if you have to deal with such browsers.

Supplementary characters. Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use 𣎴 rather than &#xD84C;&#xDFB4;

Single ampersands. Although HTML user agents have tended to turn a blind eye, you should never have a single ampersand (&) in your document. You should pay particular attention to URIs that include parameters. For example, your document should contain http://example.org/my-script.php?class=guest&name=user, rather than http://example.org/my-script.php?class=guest&name=user.

Setting encoding in web authoring applications http://www.w3.org/International/questions/qa-setting-encoding-in-applications
Changing (X)HTML page encoding to UTF-8 http://www.w3.org/International/questions/qa-changing-encoding.en.php
List of character entities supported by HTML 4 (and therefore XHTML 1.0) http://www.w3.org/TR/html401/sgml/entities.html
Tutorial: Character sets & encodings in XHTML, HTML and CSS http://www.w3.org/International/tutorials/tutorial-char-enc/
Other W3C articles related to character encodings http://www.w3.org/International/resource-index#charset

Specification detail:

Extensible Markup Language (XML) 1.0 (Third Edition), 2.4 Character Data and Markup http://www.w3.org/TR/2004/REC-xml-20040204/#syntax
Extensible Markup Language (XML) 1.0 (Third Edition), 4.1 Character and Entity References http://www.w3.org/TR/2004/REC-xml-20040204/#dt-charref
HTML 4.01 Specification, 5.3 Character references http://www.w3.org/TR/html401/charset.html#h-5.3
HTML 4.01 Specification, 24 Character entity references in HTML 4 http://www.w3.org/TR/html401/sgml/entities.html
XHTML™ 1.0 Specification, Appendix C. HTML Compatibility Guidelines http://www.w3.org/TR/2002/REC-xhtml1-20020801/#guidelines
Character Model for the World Wide Web 1.0: Fundamentals, 4.6 Character Escaping http://www.w3.org/TR/charmod/#sec-Escaping