BOM in iconv
- Behavior around BOM in iconv
- Remove BOM in UTF-8 Encoded Files
- Remove BOM in UTF-16/UTF-32 Encoded Files
iconv
is a command line utility to do character set conversion, or we can say it converts text from one encoding to another. As to the encoding, Unicode encoding such as UTF-8, UTF-16 and UTF-32 have the concept of Byte Order Mark (BOM).
Behavior around BOM in iconv
The iconv
utility dose not document its behavior regarding BOM as far as I know. So I've done some experiments on its behavior, and here are the results I've observed.
Basically, commonly used Unicode encodings fall into two groups:
- No BOM needed: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE are of a group, which do not need BOM;
- BOM required: UTF-16 and UTF-32 are of another group, which require a BOM.
-
When a file with encoding in group 1 is converted to an encoding in group 2, a BOM is added regardless whether it has a BOM already or not.
$ printf '\x31\x32' | iconv -f UTF-8 -t UTF-16 | hexdump -C 00000000 fe ff 00 31 00 32 |...1.2| 00000006 $ printf '\xef\xbb\xbf\x31\x32' | iconv -f UTF-8 -t UTF-16 | hexdump -C 00000000 fe ff fe ff 00 31 00 32 |.....1.2| 00000008
$ printf '\x00\x31\x00\x32' | iconv -f UTF-16BE -t UTF-16 | hexdump -C 00000000 fe ff 00 31 00 32 |...1.2| 00000006 $ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16BE -t UTF-16 | hexdump -C 00000000 fe ff fe ff 00 31 00 32 |.....1.2| 00000008
-
When a file with encoding in group 2 is converted to an encoding in group 1, a BOM will get removed if it has one.
$ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-16BE | hexdump -C 00000000 00 31 00 32 |.1.2| 00000004 $ printf '\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-16BE | hexdump -C 00000000 00 31 00 32 |.1.2| 00000004
$ printf '\xfe\xff\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-8 | hexdump -C 00000000 31 32 |12| 00000002 $ printf '\x00\x31\x00\x32' | iconv -f UTF-16 -t UTF-8 | hexdump -C 00000000 31 32 |12| 00000002
-
When a file in an encoding which does not need BOM is converted to another encoding which does not need BOM either, the BOM will be kept if exists.
$ printf '\xef\xbb\xbf\x31\x32' | iconv -f UTF-8 -t UTF-16BE | hexdump -C 00000000 fe ff 00 31 00 32 |...1.2| 00000006 $ printf '\x31\x32' | iconv -f UTF-8 -t UTF-16BE | hexdump -C 00000000 00 31 00 32 |.1.2| 00000004
$ printf '\xff\xfe\x31\x00\x32\x00' | iconv -f UTF-16LE -t UTF-32BE | hexdump -C 00000000 00 00 fe ff 00 00 00 31 00 00 00 32 |.......1...2| 0000000c $ printf '\x31\x00\x32\x00' | iconv -f UTF-16LE -t UTF-32BE | hexdump -C 00000000 00 00 00 31 00 00 00 32 |...1...2| 00000008
-
When a file in an encoding which requires BOM is converted to another encoding which requires BOM as well, the BOM will be kept if exists, and will get added if not exists.
$ printf '\xff\xfe\x31\x00\x32\x00' | iconv -f UTF-16 -t UTF-32 | hexdump -C 00000000 00 00 fe ff 00 00 00 31 00 00 00 32 |.......1...2| 0000000c $ printf '\x31\x00\x32\x00' | iconv -f UTF-16 -t UTF-32 | hexdump -C 00000000 00 00 fe ff 00 00 31 00 00 00 32 00 |......1...2.| 0000000c
Remove BOM in UTF-8 Encoded Files
If a file is encoded in UTF-8 with BOM (verify that with file $filename
command, and check the output to see if it contains UTF-8 Unicode (with BOM)
), we can remove it with sed
command:
sed -i 's/\xef\xbb\xbf//' $filename
Then verify it with file $filename
command. If the output contains UTF-8 Unicode
, but without (with BOM)
string, we've done the conversion.
Remove BOM in UTF-16/UTF-32 Encoded Files
If a file is encoded in UTF-16/UTF-32 with BOM, we can simply use iconv
to do the conversion. For example,
iconv -f UTF-16 -t $targetEncoding $filename > $newfilename
where $targetEncoding
may be UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE.