The byte order mark or BOM is an invisible Unicode magic number that can be found at the beginning of a text stream.
We recently changed static HTML on a web page and all Unicode characters displayed incorrectly. This was caused by the missing byte order mark in the UTF-8 file which Windows actually requires.
This post was inspired by my curiosity and investigation of the byte order mark but further grew as it touched on UTF, endianness and the usage of the byte order mark in UTF-8.
If you are new to Unicode then I suggest you read up on it before continuing. You can start by reading Joel Spolsky's article from 2003.
In this post I make references to Unicode code points. They are simply the numerical positions of Unicode characters in the code space of 1,114,112 characters.
For Unicode, the particular sequence of bits is called a code unit – for the UCS-4 encoding, any code point is encoded as 4-byte (octet) binary numbers, while in the UTF-8 encoding, different code points are encoded as sequences from one to four bytes long, forming a self-synchronizing code. ~ Wikipedia
UTF stands for Unicode Transformation Format. This encoding maps
Unicode code points (from
U+10FFFF) to one or more code units or
word depending on
the encoding form (eg. UTF-8, UTF-16 and UTF-32) used.
Conversions between encoding forms are algorithmic making it fast and lossless.
The table below shows the different encoding forms with some of their properties.
|Code unit or word size||8-bit||16-bit||32-bit|
|Fewest bytes used per character||1 byte||2 bytes||4 bytes|
|Most bytes used per character||4 bytes||4 bytes||4 bytes|
|Byte width for a code unit||Variable||Variable||Fixed|
Essentially a single character represented as a sequence of
<fewest bytes> to
<code unit size>-bit code units or word,
depending on the encoding form:
As UTF-32 is fixed length, it can get rather bloated and use up unnecessary memory and storage space for strings of characters. Therefore it's main usage is in internal APIs where the data is single code points or glyphs.
Let's look at an example using the Pilcrow sign (¶). It's Unicode code point
In UTF-8 it converts to
C2 B6 which is a two 8-bit sequence of
[1100 0010] [1011 0110].
In UTF-16 it converts to
00 B6 which is a one 16-bit sequence of
[0000 0000 1011 0110].
In UTF-32 it converts to
00 00 00 B6 which is a one 32-bit sequence of
[1011 0110 0000 0000 0000 0000 0000 0000]
This is just a fancy way of saying how the bytes must be ordered during the read and the write of the stream. This order is categorized by big-endian and little-endian.
Bytes can be processed from left to right (big-endian: most significant byte first) or from right to left (little-endian: least significant byte first).
Source of image: Clarice Bouwer
As the word size for UTF-8 streams is 8-bits, one byte is read or written at a time. This means the encoding form is byte-oriented so there is no byte order problem for UTF-8.
With UTF-16 and UTF-32 they have word sizes of 2 (16-bits) and 4 (32-bits) bytes respectively making the byte order matter when it comes to certain hardware, protocols or programs. The byte order is indicated with the byte order mark or in short, the BOM.
This table shows UTF-16 and UTF-32 with their big-endian (BE) and little-endian (LE) equivalents. If a stream is not saved with the BOM then it will default to big-endian.
It is an invisible Unicode magic number found at the beginning of a data stream indicating the encoding and endianness.
This table shows encoding forms with their BOM byte sequence and byte order.
|UTF-8||N/A||EF BB BF|
|UTF-32||Big-endian||00 00 FE FF|
|UTF-32||Little-endian||FF FE 00 00|
Let's say we have a text stream of the following characters:
¶@«®. The Unicode
code points for each are
Source of image: Clarice Bouwer
When the UTF-16 streams are opened, the BOM defines the order the bytes must be read in and are read two bytes (16-bits) at a time. The way the bytes are mapped back to the Unicode code points are based on the endianness.
When the UTF-8 stream is opened, the BOM has no impact on the byte order as the bytes are read one at a time.
Looking at the Heavy Black Heart (❤) character at code point
In UTF-8 it converts to
E2 9D A4 which is a three 8-bit sequence of
[1110 0010] [1001 1101] [1010 0100].
In UTF-16BE it converts to
27 64 which is a one 16-bit sequence of
[0010 0111 0110 0100].
In UTF-16LE it converts to
64 27 which is a one 16-bit sequence of
[0110 0100 0010 0111].
In UTF-32BE it converts to
00 00 27 64 which is a one 32-bit sequence of
[0000 0000 0000 0000 0010 0111 0110 0100].
In UTF-32LE it converts to
64 27 00 00 which is a one 32-bit sequence of
[0110 0100 0010 0111 0000 0000 0000 0000].
When the BOM exists it acts as an encoding signature only. In this case it is referred to as the UTF-8 signature.
Although it is optional and doesn't signify the byte order, some applications require its presence. I learned the hard way with Microsoft Windows and some of its applications.
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present, or the file contains only ASCII bytes. ~ Wikipedia
There are also instances where the BOM must not exist. This can include files
that need to start with specific characters like the human readable shebang
magic number (
#!) in Unix shell scripts.
It's also nice to know that ASCII is a subset of Unicode. It is a 7-bit encoding but 1 bit goes unused and is always saved as 1 byte. This makes UTF-8 backwards compatible with ASCII assuming the file doesn't use characters outside of the ASCII range.
If the BOM exists but cannot be interpreted correctly the file will start with
ï»¿. An example is viewing the BOM file with the Latin 1 (ISO 8859-1)
If you are working directly with a stream or are uncertain of which encoding to use, Unicode.org has the following guidelines for dealing with the BOM:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
This excerpt can be found under the question "How I should deal with BOMs?"
The BOM is usually handled in the background by the systems you are working on. As it is invisible it generally goes unnoticed.
If you start to experience some funny visuals, you are most likely experiencing an encoding problem that can easily be fixed. Using a Hex Editor you can identify the invisible bytes in a file.
Copyright © 2016 W3C® (MIT, ERCIM, Keio, Beihang). This document includes material copied from or derived from The byte-order mark (BOM) in HTML." The BOM image in this post adapted from their endianness image.