HELP: Default keyword or filename containing the user character conversion table TYPE: STRING SYNTAX: USRTAB='str'/NPAS/SEPA/DELA/DLAX
With this parameter, you can choose a predefined subset or standard
manipulation for the system tables via keyword (see
flcl help "xcnv.inp.sav.fil.cnv.chr.usrtab"
for available keywords).
Alternatively, you can use a file containing a user-defined substitution
table (USRTAB) or you can specify a string containing the user table
definitions. To distinguish such a string from a file name, the user
table string must start with a colon (:). A keyword definition results
in a prefix with two colons (::CCUT). Please don't use two colons for a
string definition. The definition of a string is mainly useful for short
user tables or if the user table is generated by an application. For the
user table file, all rules and replacements are valid which are
described under "FILENAME HANDLING" above. For example:
USRTAB=NPAS ; use XOEV predefined subset USRTAB=DELA ; German Latin single byte subset USRTAB='::CCUTNPAS' ; same as key word NPAS USRTAB='~.MYUSRTAB(USRTAB1)' ; PDS on z/OS USRTAB='<SYSUID>.MYUSRT02' ; PS dataset on z/OS USRTAB='~/myusrtab3.txt' ; Unix path name USRTAB='<HOME>\\usrtab\\myusrtab4.txt' ; Windows path name USRTAB='ssh://user@server/myusertab.txt'; read per SSH from other system USRTAB=':(00C4=0041,0045)(00D6=004F,0045)(00DC=0055,0045)(00E4=0061,0065)(00F6=006F,0065)(00FC=0075,0065)(00DF=0073,0073)'
A user defined substitution table has the syntax listed below:
usr_tab_file -> usr_definition_list usr_definition_list -> usr_definition usr_definition_list | EMPTY usr_definition -> '(' codepoint_definition ')' codepoint_definition -> cp_activation | cp_deactivation | cp_transliteration | cp_standard_mapping | cp_usr_case_mapping cp_activation -> ['+'] CODEPOINT [ '-' CODEPOINT ] cp_deactivation -> '-' CODEPOINT [ '-' CODEPOINT ] cp_transliteration -> ['+'] CODEPOINT '=' code_point_list cp_standard_mapping -> '*' code_point_list '=' code_point_list cp_usr_case_mapping -> '^' CODEPOINT '=' code_point_list code_point_list -> CODEPOINT separator code_point_list | CODEPOINT | EMPTY separator -> '/' | ',' | ';'
In the description below "/" is used as code point separator.
The optional signs in front of a code point (CP) have the following meaning:
'+' - a valid transliteration (only if CP not in the target set) '*' - a valid mapping (this translation is always done) '^' - a case mapping/folding (only if case=usrtab is activated) '-' - an invalid definition (removes CP for subset definitions)
If no sign is used, then a transliteration for a valid code point is
defined. Valid code points are accepted, invalid code points result in
in an error. Error handling depends on the mode (STOP,IGNORE,
SUBSTITUTE). Definition of +
code points without code point list, thus
without =
is equal definition of valid code points. With an asterisk
*
, you can define an enforced mapping for this code point. For
example, this can be used to delete this character if no code point list
is provided or to translate this character always in another value. This
can also be used to convert code points outside of a subset into this
subset.
In contrast to the transliteration (+
is only done if a character
doesn't exist in the target encoding), the mapping is always done for
each character. Invalid (-
) code points can be used to deactivate a
code point. To activate (+
) or deactivate (-
) a range of code points
you can use the minus mark (-
) followed by one sub code point. The
range goes from the smaller code point to the bigger one. For example:
(-00-7F) deactivate all US-ASCII code points (+39-30) activate all decimal digits
The range operator (-
) with the optional plus sign (+
) can also be
used to activate code points.
To define an own case mapping or folding (required for certain subsets)
the caret sign (^
) can be used. This mapping is only done if the user
table is activated for case mapping (CASE=USRTAB).
To define transliterations/substitutions/mappings an assignment (=
) of
a code point list is required. The code point list can contain a maximum
of 8 code points. If no code point list is specified, then the character
is ignored. A transliteration to itself simply activates this code
point. The same is valid for mapping definitions. A code point is a
hexadecimal number representing a 21 bit Unicode codepoint.
00000000 to 001FFFFF hex
If a code point is not defined in the substitution table (USRTAB and/or SYSTAB), the appropriate substitution character (SUBCHR) is used. If SUBCHR is not set, no substitution is performed. Depending on the MODE, character conversion stops with an error (STOP) or ignores the character (IGNORE).
Text before and after brackets are comments.
Between "(" and "=" or "/" or ")" you can define hex digits until the first non-hex digit. All non-hex digits up to the next separator are interpreted as comment. Leading whitespace is ignored.
REPLACE GERMAN SZ (00DF=0073/0073) WITH ss THIS IS AN EXAMPLE FOR COMMENTS IN CODE POINT LIST: REPLACE EURO MARK (20AC= 45#E / 55#U / 52#R / 4F#O) WITH EURO REPLACE BOM MARK (EFFF=) WITH NOTHING
Leading zeros are allowed. If you don't supply a hex value, then 0x00 is used.
REPLACE GERMAN SZ (00DF=/) WITH 0x00 0x00 REPLACE EURO MARK (20AC=00000045/00055/000052/000004F) WITH EURO
Don't use parentheses "()" or the operators in comments.
To describe your own subsets, you can use a user table without a system table. The USRTAB can also be used to overwrite or add transliterations when a system substitution table (for example SYSTAB=ICONV) is used. The transliteration works recursively, that is if one of the substitution code points is not in the target set, this substitution will be used instead and so on.
REPLACE GERMAN OE (0000D6=004F/0045) WITH O E REPLACE GERMAN SZ (0000DF=0073/0073) WITH s s REPLACE EURO MARK (0020AC=00D6/00DF/52/4F) WITH OE SZ R O
If you have a EURO sign in your text and convert it to 'Latin-1', the resulting byte string will be 'D6DF524F'. On the other hand, if you convert it to 'ASCII', the byte string will be '4F457373524F'.
The mapping as described above is always done and is also done recursive
including the transliteration result. For example, if you define a
mapping (*30=39)
then the target data won't have any zero in the
resulting text.
By replacing other code points recursively, you could easily cause infinite loops. To prevent this, the amount of replacements and the length of one replacement is limited to a maximum of 64. Note that this could still result in a large expansion of data. Therefore, be careful about defining recursive substitutions.
You can also define mappings of more than one codepoint to other codepoints (multiple mapping). This can be used, e.g., to define "tagging" mapping:
REPLACE GERMAN OE (*D6=2F,75,27,4F,45,27) WITH /u'OE'
and undo tagging:
REPLACE /u'OE' (*2F,75,27,4F,45,27=D6) WITH GERMAN OE
Another example for multiple mappings is to convert HTML entities to their respective characters:
REPLACE < (*26,6C,74,3B=3C) WITH < REPLACE > (*26,67,74,3B=3E) WITH >
If multiple mappings are defined in a user table, it is important to know that character conversion (cnvchr) becomes much slower because finding multibyte sequences is a more expensive operation. An example of String.Latin subset (XOEF) normalisation mapping NFD (Normalization Form D) and NFC (Normalization Form C) is located in the SAMPLE directory under NPASNFD and NPASNFC.
A sample user table for the ICONV system table to change the transliteration of German umlauts to AE, OE or UE is located in the SAMPLE directory under CCUTDEXL. Another sample user table called CCUTNPAS defines the 'string.latin' subset (XOEF) which is mainly used for statutory reporting.
The order of definition is the order of processing the definitions. For exampel, if you define a mapping or a transliteration for code point X and later you make this code point invalid, then the code point is invalid. On the other hand, you can first deactivate all code points, then activate your subset and define your transliterations (best fit) or mappings.
For example: To define a non-expansion best fit mapping a user table can be used to delete all combined characters for single byte code pages. For the ICONV system transliteration table, a sample user table can be found under the name CCUTBFM1.
Some other subset user table definitions were added. For example:
All user table definitions are pre-calculated once at the beginning of execution to reduce the required CPU time for the conversion. So, the use of a user table increases the cost to open a file but has no effect on CPU utilization when converting the data.
To minimize the effort and to simplify the usage of user table, the most common subsets can be selected by keyword. In this case, a pre-calculated load module (DLL) is used. The subset definition can be for UTF (21 Bit Unicode) or UCS (16 Bit Unicode). If it is a UCS subset, then no valid code point is greater than 0xFFFF and you must use UTF CCSIDS to ensure that no code point greater than 0xFFFF is accepted.
If you have a working user table definition for a popular subset or system table manipulation, you can request a keyword for it. Please report this requirement over our issue tracking system
https://www.flam.de/en/technology/support/issues/Feel free to attach the corresponding user table definition.
NPAS - UCS subset for statutory reporting (New passport, XOEF, String.Latin)
SEPA - UCS subset of valid SEPA character smaller than 128
DELA - UCS subset for German Latin of IBM1141, ISO8859-15 and CP1252
DLAX - UCS subset for German Latin of IBM1141, ISO8859-15, CP1252 and XOEF