CMBFRM

Synopsis

HELP:   Default combined character handling / normalization form [AUTO]
TYPE:   NUMBER
SYNTAX: CMBFRM=NFD/NFC/AUTO/ON/OFF

Description

Combined character form parameter to manage combined character support.

Some characters consist of more than one codepoint and some characters have multiple different codepoint sequences with the same meaning:

For example:

German small letter "u" with diaeresis (u umlaut) has the codepoint:

 0x00FC

but you can write the same character with codepoint "u" and combined character "..":

 0x0075  // letter "u"
 0x0308  // combining diaeresis

So, the same character has two different representations. Comparison of two identical strings (to the human reader) in binary mode fails if the strings to be compared use different encoding schemes.

Normalization forms

To solve this problem, one of the Unicode normalization forms can be used:

To achieve NFD, the string is transformed by decomposing all characters by canonical equivalence and putting any combining characters in a well-defined order (sort on Canonical-Combining-Class "CCC" value of combining characters).

For NFC, characters are decomposed like in NFD and then recomposed by canonical equivalence.

Example:

    SOURCE                           NFD                        NFC

 0x1E0E 0x0323             0x0064 0x0323 0x0307             0x1E0D 0x0307
(d with dot               (d, dot below, above)           (d with dot
 above and below)                                          below and above)

 0x0071 0x0307 0x0323      0x0071 0x0323 0x0307             0x0071 0x0323 0x0307
(q with dot               (q, dot below, above)           (q with dot
 above and below)                                          below and above)

For more information see: unicode normalization forms

With the values NFD/NFC, the data is normalized, and is usually only useful in multi-byte character sets (UTF). It is not possible to use user-tables and normalization in one step.

With the default parameter value "AUTO", combined character detection is attempted on the first block. If deemed useful (conversion to single byte charset), combined character support is activated for character conversion.

The parameter value "ON" enforces character conversion with combined character support for any destination charset. In contrast to normalization, with combined character support, any character conversion feature (user-table, translitaration, case-mapping, ....) can be used.

The parameter value "OFF" deactivates combined character support, which results in a much faster character converion procedure being used.

Selections