BOM in iconv

Behavior around BOM in iconv
Remove BOM in UTF-8 Encoded Files
Remove BOM in UTF-16/UTF-32 Encoded Files

iconv is a command line utility to do character set conversion, or we can say it converts text from one encoding to another. As to the encoding, Unicode encoding such as UTF-8, UTF-16 and UTF-32 have the concept of Byte Order Mark (BOM).

Behavior around BOM in iconv

The iconv utility dose not document its behavior regarding BOM as far as I know. So I've done some experiments on its behavior, and here are the results I've observed.

Basically, commonly used Unicode encodings fall into two groups:

No BOM needed: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE are of a group, which do not need BOM;
BOM required: UTF-16 and UTF-32 are of another group, which require a BOM.

When a file with encoding in group 1 is converted to an encoding in group 2, a BOM is added regardless whether it has a BOM already or not.

$ printf '\x31\x32' | iconv -f UTF-8 -t UTF-16 | hexdump -C
00000000  fe ff 00 31 00 32                                 |...1.2|
00000006

$ printf '\xef\xbb\xbf\x31\x32' | iconv -f UTF-8 -t UTF-16 | hexdump -C
00000000  fe ff fe ff 00 31 00 32                           |.....1.2|
00000008

$ printf '\x00\x31\x00\x32' | iconv -f UTF-16BE -t UTF-16 | hexdump -C
00000000  fe ff 00 31 00 32                                 |...1.2|
00000006

$ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16BE -t UTF-16 | hexdump -C
00000000  fe ff fe ff 00 31 00 32                           |.....1.2|
00000008

When a file with encoding in group 2 is converted to an encoding in group 1, a BOM will get removed if it has one.

$ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-16BE | hexdump -C
00000000  00 31 00 32                                       |.1.2|
00000004

$ printf '\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-16BE | hexdump -C
00000000  00 31 00 32                                       |.1.2|
00000004

$ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-8 | hexdump -C
00000000  31 32                                             |12|
00000002

$ printf '\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-8 | hexdump -C
00000000  31 32                                             |12|
00000002

When a file in an encoding which does not need BOM is converted to another encoding which does not need BOM either, the BOM will be kept if exists.

$ printf '\xef\xbb\xbf\x31\x32' | iconv -f UTF-8 -t UTF-16BE | hexdump -C
00000000  fe ff 00 31 00 32                                 |...1.2|
00000006

$ printf '\x31\x32' | iconv -f UTF-8 -t UTF-16BE | hexdump -C
00000000  00 31 00 32                                       |.1.2|
00000004

$ printf '\xff\xfe\x31\x00\x32\x00' | iconv -f UTF-16LE -t UTF-32BE | hexdump -C
00000000  00 00 fe ff 00 00 00 31  00 00 00 32              |.......1...2|
0000000c

$ printf '\x31\x00\x32\x00' | iconv -f UTF-16LE -t UTF-32BE | hexdump -C
00000000  00 00 00 31 00 00 00 32                           |...1...2|
00000008

When a file in an encoding which requires BOM is converted to another encoding which requires BOM as well, the BOM will be kept if exists, and will get added if not exists.

$ printf '\xff\xfe\x31\x00\x32\x00' | iconv -f UTF-16 -t UTF-32 | hexdump -C
00000000  00 00 fe ff 00 00 00 31  00 00 00 32              |.......1...2|
0000000c

$ printf '\x31\x00\x32\x00' | iconv -f UTF-16 -t UTF-32 | hexdump -C
00000000  00 00 fe ff 00 00 31 00  00 00 32 00              |......1...2.|
0000000c

Remove BOM in UTF-8 Encoded Files

If a file is encoded in UTF-8 with BOM (verify that with file $filename command, and check the output to see if it contains UTF-8 Unicode (with BOM)), we can remove it with sed command:

sed -i 's/\xef\xbb\xbf//' $filename

Then verify it with file $filename command. If the output contains UTF-8 Unicode, but without (with BOM) string, we've done the conversion.

Remove BOM in UTF-16/UTF-32 Encoded Files

If a file is encoded in UTF-16/UTF-32 with BOM, we can simply use iconv to do the conversion. For example,

iconv -f UTF-16 -t $targetEncoding $filename > $newfilename

where $targetEncoding may be UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE.