HDLBOM

Synopsis

HELP:   Reverse byte order after codepoint U+FFFE (reversed BOM) found in data (only multibyte charsets)
TYPE:   NUMBER
SYNTAX: HDLBOM/BOM=FAIL/ON/OFF

Description

This selection controls whether to reverse the endianness (byte order) in which character codepoints are processed if codepoint U+FFFE is encountered in the text data, which is a noncharacter by the Unicode standard. It can also be interpreted as the BOM character U+FEFF (aka. zero-length whitespace) in reversed byte order. This option can be used to set the desired behavior. It only has an effect for multi-byte input charsets (e.g. UTF-16/32 or UCS-2/4).

There are three options:

When selecting OFF, codepoint U+FFFE still might make the character conversion fail, depending on the destination charset. Non-unicode charsets do not have a representation for this codepoint. You may also want to set NONCHR to a reasonable value (e.g. IGNORE).

Concatenations encoded in different UTF/UCS variants are not supported (e.g. UTF16+UTF32).

Background:

Normally, a byte order mark (aka. BOM or codepoint U+FEFF) is found at the start of a text document that is encoded in a multi-byte UTF or UCS variant so that the endianness (byte order) can be detected without user input. If multiple of these documents are concatenated, the BOM character can appear somewhere in middle of text data. The Unicode standard defines codepoint U+FEFF within text data as zero-length whitespace.

If documents are concatenated that are encoded in the same UTF/UCS variant (e.g. UTF-16) but in different endianness (e.g. UTF-16BE and UTF-16LE) and they start with a BOM, codepoint U+FFFE will be encountered within text data. The Unicode standard defines codepoint U+FFFE as noncharacter, even though it is a valid codepoint.

Selections