XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, Web project managers, and anyone who needs to better understand what the BOM is, and how it affects HTML.
What is the byte-order mark, and what do I need to know about it when creating HTML?
At the beginning of a Unicode file you may find some bytes that represent the Unicode code point U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP). This combination of bytes is known as a byte-order mark (BOM).
When a character is encoded in UTF-16, its 2 or 4 bytes can be ordered in two different ways (little-endian or big-endian). The picture below illustrates this. The byte-order mark indicates which order is used, so that applications can immediately decode the content. UTF-16 content should always begin with the BOM.
In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoded text, however, either as a by-product of an encoding conversion or because it was added by an editor. In this situation, the BOM is often called the UTF-8 signature.
When the BOM is used in web pages or editors for UTF-8 encoded content it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best for interoperability to omit the BOM, when given a choice, for UTF-8 content.
For more information about how to detect and remove a byte-order mark, see Display problems caused by the UTF-8 BOM. You can find out whether a page contains a BOM at the start or further down in the content using the W3C Internationalization Checker.
If your editor allows you to specify whether you want a BOM while saving content as UTF-8, you should usually say no.
If you use UTF-16. According to the HTML5 specification, if your page is encoded as UTF-16, you must use the byte-order mark in HTML. This is what will be used to indicate the encoding of the page to the browser.
The HTML5 specification currently disallows the use of any other in-document encoding declaration for UTF-16, although this is still under discussion and may change. In effect, this means that the BOM is, itself, the declaration that you have to add.
The requirement to use a BOM for UTF-16 encoded content in HTML5 means that you should not, however, serve HTML5 documents labeled as "UTF16BE" or "UTF16LE". This is because the Unicode Standard says you should not use a BOM when the text is labeled as one of those encodings. If, therefore, you want to declare the encoding in the HTTP header (which is not disallowed by the HTML5 spec), you should only use the IANA charset name "UTF-16".
The byte-order mark is also used for text labeled as UTF-32, and should not be used for text labeled as UTF-32BE or UTF-32LE. The use of UTF-32 for HTML content, however, is strongly discouraged, so we haven't mentioned it until now.