Choosing & applying a character encoding

Use UTF-8, if you can

An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings.

A Unicode encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.

A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice of encoding.

Any barriers to using Unicode are very low these days. In fact, in August 2010 Google reported that over 50% of the Web in their sample of several billion pages was now using UTF-8. Add to that the figure for ASCII-only web pages (since ASCII is a subset of UTF-8), and the figure rises near to 70%.

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32 (see Character sets, coded character sets, and encodings). Of these three, UTF-8 is recommended for use with Web content. In fact the HTML5 specification draft currently says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."

Note, in particular, that all ASCII characters in UTF-8 use exactly the same bytes as an ASCII encoding, which often helps with interoperability and backwards compatibility.

Support for a given encoding, especially one like Unicode, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.

If you don't use Unicode. Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.

Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.

Consider a solution that minimizes complexity when dealing with multiple languages and scripts.

Avoid these encodings

The HTML5 specification calls out a number of encodings that you should avoid.

Documents should not use JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, or encodings based on EBCDIC. This is because they allow ASCII code points to represent non-ASCII characters, which poses a security threat.

Documents must not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they were never intended for Web content.

The specification also advises against the use of UTF-32.

Applying an encoding to your content

As a content author you need to check that your editor or scripts are saving text in the encoding of your choice.

Developers also need to ensure that the various parts of the system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters.

It is important to understand that just declaring an encoding inside a document or on the server using one of the methods described below won't usually change the bytes; you need to save the text in that encoding to apply it to your content. (The declaration just helps the browser interpret the sequences of bytes in which the text is stored.)

The article Setting encoding in web authoring applications provides advice on how to set the encoding of a page while saving it, for a number of editing environments.

If you can, it is best to set up an encoding such as UTF-8 as the default for new documents in your editor. The picture that follows shows how you would do that in the preferences of an editor such as DreamWeaver.

DreamWeaver's new document preferences allow you to specify a default encoding.

You may also need to check that your server is serving documents with the right HTTP declarations, since it will otherwise override the in-document information (see the next section).

Why does the browser still not recognize the encoding?

Let's say, for example, that you saved your data as UTF-8. Although you saved your data in the right encoding, and even if you declared in the page that the page encoding is UTF-8, your server may still be serving the page with an accompanying HTTP header that says it is something else.

Any declaration in the HTTP header will override information inside the page, causing problems for your content.

You may not have control over the declarations that come with the HTTP header, and may have to contact the people who manage the server for help. On the other hand there are sometimes ways you can fix things on the server if you have limited access to server setup files or are generating pages using scripting languages. For example, see Setting the HTTP charset parameter for more information about how to change the encoding information, either locally for a set of files on a server, or for content generated using a scripting language.

Typically, before doing so, you need to check whether this is actually the root of the problem or not. You could use the W3C Internationalization Checker to find out what character encoding, if any, is specified in the HTTP header. Alternatively, the article Checking HTTP Headers points to some other tools for checking the encoding information passed by the server.