HELP: Default combined character handling / normalization form [AUTO] TYPE: NUMBER SYNTAX: CMBFRM=NFD/NFC/AUTO/ON/OFF
Combined character form parameter to manage combined character support.
Some characters consist of more than one codepoint and some characters have multiple different codepoint sequences with the same meaning:
For example:German small letter "u" with diaeresis (u umlaut) has the codepoint:
0x00FC
but you can write the same character with codepoint "u" and combined character "..":
0x0075 // letter "u" 0x0308 // combining diaeresis
So, the same character has two different representations. Comparison of two identical strings (to the human reader) in binary mode fails if the strings to be compared use different encoding schemes.
Normalization formsTo solve this problem, one of the Unicode normalization forms can be used:
To achieve NFD, the string is transformed by decomposing all characters by canonical equivalence and putting any combining characters in a well-defined order (sort on Canonical-Combining-Class "CCC" value of combining characters).
For NFC, characters are decomposed like in NFD and then recomposed by canonical equivalence.
Example:
SOURCE NFD NFC 0x1E0E 0x0323 0x0064 0x0323 0x0307 0x1E0D 0x0307 (d with dot (d, dot below, above) (d with dot above and below) below and above) 0x0071 0x0307 0x0323 0x0071 0x0323 0x0307 0x0071 0x0323 0x0307 (q with dot (q, dot below, above) (q with dot above and below) below and above)
For more information see: unicode normalization forms
With the values NFD/NFC, the data is normalized, and is usually only useful in multi-byte character sets (UTF). It is not possible to use user-tables and normalization in one step.
With the default parameter value "AUTO", combined character detection is attempted on the first block. If deemed useful (conversion to single byte charset), combined character support is activated for character conversion.
The parameter value "ON" enforces character conversion with combined character support for any destination charset. In contrast to normalization, with combined character support, any character conversion feature (user-table, translitaration, case-mapping, ....) can be used.
The parameter value "OFF" deactivates combined character support, which results in a much faster character converion procedure being used.
NFD - Normalization form D (Canonical Decomposition)
NFC - Normalization form C (Canonical Decomposition,followed by Canonical Composition)
AUTO - Detect combined character and compose this if useful
ON - Run with combined character support (slow)
OFF - Do not use combined character support