The byte-order mark (BOM) in HTML

What is a byte-order mark?

At the beginning of a Unicode file you may find some bytes that represent the Unicode code point U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP). This combination of bytes is known as a byte-order mark (BOM).

When a character is encoded in UTF-16, its 2 or 4 bytes can be ordered in two different ways (little-endian or big-endian). The picture below illustrates this. The byte-order mark indicates which order is used, so that applications can immediately decode the content. UTF-16 content should always begin with the BOM.

Bytes representing the BOM.

In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoded text, however, either as a by-product of an encoding conversion or because it was added by an editor. In this situation, the BOM is often called the UTF-8 signature.

What do I need to know about the BOM?

When the BOM is used in web pages or editors for UTF-8 encoded content it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ï»¿). For this reason, it is usually best for interoperability to omit the BOM, when given a choice, for UTF-8 content.

For more information about how to detect and remove a byte-order mark, see Display problems caused by the UTF-8 BOM. You can find out whether a page contains a BOM at the start or further down in the content using the W3C Internationalization Checker.

If your editor allows you to specify whether you want a BOM while saving content as UTF-8, you should usually say no.

BOM preferences on a dialog panel.

If you use UTF-16. According to the HTML5 specification, if your page is encoded as UTF-16, you must use the byte-order mark in HTML. This is what will be used to indicate the encoding of the page to the browser.

It's recommended to use UTF-8, rather than UTF-16, if you use a Unicode encoding. So for most people, this will be academic.

The HTML5 specification currently disallows the use of any other in-document encoding declaration for UTF-16, although this is still under discussion and may change. In effect, this means that the BOM is, itself, the declaration that you have to add.

The requirement to use a BOM for UTF-16 encoded content in HTML5 means that you should not, however, serve HTML5 documents labeled as "UTF16BE" or "UTF16LE". This is because the Unicode Standard says you should not use a BOM when the text is labeled as one of those encodings. If, therefore, you want to declare the encoding in the HTTP header (which is not disallowed by the HTML5 spec), you should only use the IANA charset name "UTF-16".

Note that this is solely about the labeling of the content. Of course, the actual sequence of bytes is the same, whether you label content as UTF-16 and add a BOM, or whether you label it as UTF16LE or UTF-16BE.