HELP: Combined character handling / normalization form [AUTO] TYPE: NUMBER SYNTAX: CMBFRM=NFD/NFC/AUTO/ON/OFF
Combined character form parameter to manage combined character support.
Some characters consist of more than one codepoint and some characters have multiple different codepoint sequences with the same meaning:
For example:
German small letter "u" with diaeresis (u umlaut) has the codepoint:
0x00FC
but you can write the same character with codepoint "u" and combined character "..":
0x0075 // letter "u" 0x0308 // combining diaeresis
So, the same character has two different representations. Comparison of two identical strings (to the human reader) in binary mode fails if the strings to be compared use different encoding schemes.
Normalization forms
To solve this problem, one of the Unicode normalization forms can be used:
To achieve NFD, the string is transformed by decomposing all characters by canonical equivalence and putting any combining characters in a well-defined order (sort on Canonical-Combining-Class "CCC" value of combining characters).
For NFC, characters are decomposed like in NFD and then recomposed by canonical equivalence.
Example:
SOURCE NFD NFC 0x1E0E 0x0323 0x0064 0x0323 0x0307 0x1E0D 0x0307 (d with dot (d, dot below, above) (d with dot above and below) below and above) 0x0071 0x0307 0x0323 0x0071 0x0323 0x0307 0x0071 0x0323 0x0307 (q with dot (q, dot below, above) (q with dot above and below) below and above)
For more information see: unicode normalization forms
With the values NFD/NFC, the data is normalized, and is usually only useful in multi-byte character sets (UTF). It is not possible to use user-tables and normalization in one step.
With the default parameter value "AUTO", combined character detection is attempted on the first block. If deemed useful (conversion to single byte charset), combined character support is activated for character conversion.
The parameter value "ON" enforces character conversion with combined character support for any destination charset. In contrast to normalization, with combined character support, any character conversion feature (user-table, translitaration, case-mapping, ....) can be used.
The parameter value "OFF" deactivates combined character support, which results in a much faster character conversion procedure being used.
A good example of the use of combined character supports is best-fit mapping without expansion. It often happens that standards or allocations require that UTF-8 may only contain the first 128 codepoints as a subset and that an umlaut should become a U/A/O, regardless of whether the umlaut was encoded as a single codepoint or as a U/A/O with a combined character.
flcl conv "read.char(file='utf8tst.bin' ccsid='UTF-8' combined=NFD) write.char(file='out.txt' ccsid='us-ascii' chrmode=ignore combined=off)"
The call above reads UTF-8 and converts it to UTF-8 in NFD (Base-Character with Combined-Character). The second conversion generates US-ASCII, which corresponds to UTF-8 for the first 128 code points. Here, all non-convertible characters are ignored and as the combined-character support is switched off, all combined characters are ignored, which means that only a U/A/O, i.e. the base character, remains from a umlaut.
NFD - Normalization form D (Canonical Decomposition)
NFC - Normalization form C (Canonical Decomposition,followed by Canonical Composition)
AUTO - Detect combined character and compose this if useful
ON - Run with combined character support (slow)
OFF - Do not use combined character support