Character entity references in HTML 4
eot; include($pathtophp.'/bp3/structure.php'); ?> xmlns="http://www.w3.org/1999/xhtml">XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), and anyone who needs guidance on how and when to use alternatives to actual characters in a document.
How can I use character escapes in markup and CSS, and when should I use or not use them?
You can use a character escape to represent any Unicode character in XML or (X)HTML using only ASCII characters.
NCRs (numeric character references) and character entity references are types of character escape used in markup. For example, the following are different ways of representing the character U+00A0 NO-BREAK SPACE.
(The NO-BREAK SPACE character looks like a space but prevents a line wrap between the characters on either side. In French it is commonly used with punctuation such as colons and exclamation marks, which are preceded by a space but should not appear at the beginning of a line during text wrap)
 
<p>Vive la France !</p>
 
<p>Vive la France !</p>
Á
represents
the uppercase letter Á, whereas á
represents the lowercase á.
<p>Vive la France !</p>
One point worth special note is that values of numeric character references (such as € or € for the euro sign €) are interpreted as Unicode characters – no matter what encoding you use for your document. It is a common error for people working on content encoded in Windows code page 1252, for example, to try to represent the euro sign using €. This is because the euro appears at position 80 (in hexadecimal) on the Windows 1252 code page. Using € in HTML should actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. (In fact, browsers tend to silently correct that error. See the test pages.)
CSS represents escaped characters in a different way. To represent a character, start with a backslash followed by the hexadecimal number that represents the character's Unicode code point value.
If there is a following character that is not in the range A–F, a–f or 0–9, that is all you need. The following example represents the word émotion.
.\E9motion { ... }
If, on the other hand, the next character is one that can be used in hexadecimal numbers, it won't be clear where the end of the number is. In these cases there are two options. The first is to use a space after the escape. This space is part of the escape syntax, and does not remain after the character escape is parsed. The following example shows how you could represent the word édition.
.\E9 dition { ... }
Alternatively, you can use a 6-digit hexadecimal number, with or without a space. Here is an alternative way of writing édition.
.\0000E9dition { ... }
Because any white-space following the hexadecimal number is swallowed up as part of the escape, if you actually want a space to appear after the escaped character you will need to add two spaces (after a hexadecimal number of any length).
The backslash can also be used in CSS before a syntax character to prevent it being read as part of the code. For more information about CSS escapes, see the CSS 2.1 specification.
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.
Take for example the following passage in Czech.
Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.
If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.
Jako efektivnĕjší se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.
As we said before, use characters rather than escapes for ordinary text.
Use in XHTML. Using character entity references in a document that is parsed as XML may become problematic if the entities are defined externally to your document and the tools that process the XML do not read the external files. In such cases the entity references will not be replaced by characters. For this reason, if you need to use escapes, it may be safer to use numeric character references, or define the character entities you need inside the document.
If you use HTML-defined character entity references (such as á) to represent characters in XHTML, you should take care any time your content is processed using XML parsers or other tools.
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
You may also want to represent the double-quote (") as " and the single quote (') as ' – particularly in attribute text when you need to use the same type of quotes as those that surround the attribute value. Note, however, that, although it is part of the XML language, ' is not defined in HTML 4.01 and some browsers do not support ' in HTML. For this reason the XHTML specification recommends instead the use of ' if text may be passed to an HTML browser.
Encoding gaps. Escapes can be useful to represent characters not supported by the encoding you choose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).
Input problems. If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters – it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters. Alternatively, if you only need the occasional character, use a character map tool or character picker.
Invisible or ambiguous characters. A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.
One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ‏ (or its NCR equivalent ‏) instead makes it very easy to spot these characters.
An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using (or  ) makes it quite clear where such spaces appear in the text.
It is best to choose the right encoding so that you can just use characters in CSS declarations. This section addresses what should be a very rare circumstance where you may have decided to use escapes.
It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.
A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.
Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.
For example, it is better to use
<span style="font-family: L\FC beck">...</span>
than
<span style="font-family: Lübeck">...</span>
Changing to UTF-8 means re-saving your file. Using an encoding such as UTF-8 means that you can avoid the need for most escapes and just work with characters. To change the encoding of your document, however, it is not enough to just change the encoding declaration at the top of the page or on the server. You need to re-save your document in that encoding. For help understanding how to do that with your application read Setting encoding in web authoring applications.
Hex vs. decimal. Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes, ie. á could be represented as á.
Supplementary characters. Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect – you must use the single, code point value for that character. For example, use 𣎴 rather than ��.
Single ampersands. Although HTML user agents have tended to turn a blind eye, you should never have a single ampersand (&) in your document. You should pay particular attention to URIs that include parameters. For example, your document should contain http://example.org/my-script.php?class=guest&name=user
, rather than http://example.org/my-script.php?class=guest&name=user
.