USRTAB

Synopsis

HELP:   Default keyword or filename containing the user character conversion table
TYPE:   STRING
SYNTAX: USRTAB='str'/NPAS/SEPA/DELA/DLAX

Description

With this parameter, you can choose a predefined subset or standard manipulation for the system tables via keyword (see flcl help "xcnv.inp.sav.fil.cnv.chr.usrtab" for available keywords). Alternatively, you can use a file containing a user-defined substitution table (USRTAB) or you can specify a string containing the user table definitions. To distinguish such a string from a file name, the user table string must start with a colon (:). A keyword definition results in a prefix with two colons (::CCUT). Please don't use two colons for a string definition. The definition of a string is mainly useful for short user tables or if the user table is generated by an application. For the user table file, all rules and replacements are valid which are described under "FILENAME HANDLING" above. For example:

   USRTAB=NPAS                         ; use XOEV predefined subset
   USRTAB=DELA                         ; German Latin single byte subset
   USRTAB='::CCUTNPAS'                 ; same as key word NPAS
   USRTAB='~.MYUSRTAB(USRTAB1)'        ; PDS on z/OS
   USRTAB='<SYSUID>.MYUSRT02'          ; PS dataset on z/OS
   USRTAB='~/myusrtab3.txt'            ; Unix path name
   USRTAB='<HOME>\\usrtab\\myusrtab4.txt' ; Windows path name
   USRTAB='ssh://user@server/myusertab.txt'; read per SSH from other system
   USRTAB=':(00C4=0041,0045)(00D6=004F,0045)(00DC=0055,0045)(00E4=0061,0065)(00F6=006F,0065)(00FC=0075,0065)(00DF=0073,0073)'

A user defined substitution table has the syntax listed below:

   usr_tab_file         -> usr_definition_list
   usr_definition_list  -> usr_definition usr_definition_list
                        |  EMPTY
   usr_definition       -> '(' codepoint_definition ')'
   codepoint_definition -> cp_activation
                        |  cp_deactivation
                        |  cp_transliteration
                        |  cp_standard_mapping
                        |  cp_usr_case_mapping
   cp_activation        -> ['+'] CODEPOINT  [ '-' CODEPOINT ]
   cp_deactivation      ->  '-'  CODEPOINT  [ '-' CODEPOINT ]
   cp_transliteration   -> ['+'] CODEPOINT '=' code_point_list
   cp_standard_mapping  ->  '*'  code_point_list '=' code_point_list
   cp_usr_case_mapping  ->  '^'  CODEPOINT '=' code_point_list
   code_point_list      -> CODEPOINT separator code_point_list
                        |  CODEPOINT
                        |  EMPTY
   separator            -> '/'
                        |  ','
                        |  ';'

In the description below "/" is used as code point separator.

The optional signs in front of a code point (CP) have the following meaning:

   '+' - a valid transliteration (only if CP not in the target set)
   '*' - a valid mapping         (this translation is always done)
   '^' - a case mapping/folding  (only if case=usrtab is activated)
   '-' - an invalid definition   (removes CP for subset definitions)

If no sign is used, then a transliteration for a valid code point is defined. Valid code points are accepted, invalid code points result in in an error. Error handling depends on the mode (STOP,IGNORE, SUBSTITUTE). Definition of + code points without code point list, thus without = is equal definition of valid code points. With an asterisk *, you can define an enforced mapping for this code point. For example, this can be used to delete this character if no code point list is provided or to translate this character always in another value. This can also be used to convert code points outside of a subset into this subset.

In contrast to the transliteration (+ is only done if a character doesn't exist in the target encoding), the mapping is always done for each character. Invalid (-) code points can be used to deactivate a code point. To activate (+) or deactivate (-) a range of code points you can use the minus mark (-) followed by one sub code point. The range goes from the smaller code point to the bigger one. For example:

   (-00-7F) deactivate all US-ASCII code points
   (+39-30) activate all decimal digits

The range operator (-) with the optional plus sign (+) can also be used to activate code points.

To define an own case mapping or folding (required for certain subsets) the caret sign (^) can be used. This mapping is only done if the user table is activated for case mapping (CASE=USRTAB).

To define transliterations/substitutions/mappings an assignment (=) of a code point list is required. The code point list can contain a maximum of 8 code points. If no code point list is specified, then the character is ignored. A transliteration to itself simply activates this code point. The same is valid for mapping definitions. A code point is a hexadecimal number representing a 21 bit Unicode codepoint.

   00000000 to 001FFFFF hex

If a code point is not defined in the substitution table (USRTAB and/or SYSTAB), the appropriate substitution character (SUBCHR) is used. If SUBCHR is not set, no substitution is performed. Depending on the MODE, character conversion stops with an error (STOP) or ignores the character (IGNORE).

Text before and after brackets are comments.

Between "(" and "=" or "/" or ")" you can define hex digits until the first non-hex digit. All non-hex digits up to the next separator are interpreted as comment. Leading whitespace is ignored.

   REPLACE GERMAN SZ (00DF=0073/0073)                  WITH ss
 THIS IS AN EXAMPLE FOR COMMENTS IN CODE POINT LIST:
   REPLACE EURO MARK (20AC= 45#E / 55#U / 52#R / 4F#O) WITH EURO
   REPLACE BOM MARK  (EFFF=)                           WITH NOTHING

Leading zeros are allowed. If you don't supply a hex value, then 0x00 is used.

   REPLACE GERMAN SZ (00DF=/)                   WITH 0x00 0x00
   REPLACE EURO MARK (20AC=00000045/00055/000052/000004F) WITH EURO

Don't use parentheses "()" or the operators in comments.

To describe your own subsets, you can use a user table without a system table. The USRTAB can also be used to overwrite or add transliterations when a system substitution table (for example SYSTAB=ICONV) is used. The transliteration works recursively, that is if one of the substitution code points is not in the target set, this substitution will be used instead and so on.

   REPLACE GERMAN OE (0000D6=004F/0045)       WITH O E
   REPLACE GERMAN SZ (0000DF=0073/0073)       WITH s s
   REPLACE EURO MARK (0020AC=00D6/00DF/52/4F) WITH OE SZ R O

If you have a EURO sign in your text and convert it to 'Latin-1', the resulting byte string will be 'D6DF524F'. On the other hand, if you convert it to 'ASCII', the byte string will be '4F457373524F'.

The mapping as described above is always done and is also done recursive including the transliteration result. For example, if you define a mapping (*30=39) then the target data won't have any zero in the resulting text.

By replacing other code points recursively, you could easily cause infinite loops. To prevent this, the amount of replacements and the length of one replacement is limited to a maximum of 64. Note that this could still result in a large expansion of data. Therefore, be careful about defining recursive substitutions.

You can also define mappings of more than one codepoint to other codepoints (multiple mapping). This can be used, e.g., to define "tagging" mapping:

   REPLACE GERMAN OE (*D6=2F,75,27,4F,45,27)       WITH /u'OE'

and undo tagging:

   REPLACE /u'OE'    (*2F,75,27,4F,45,27=D6)       WITH GERMAN OE

Another example for multiple mappings is to convert HTML entities to their respective characters:

   REPLACE &lt;    (*26,6C,74,3B=3C)       WITH <
   REPLACE &gt;    (*26,67,74,3B=3E)       WITH >

If multiple mappings are defined in a user table, it is important to know that character conversion (cnvchr) becomes much slower because finding multibyte sequences is a more expensive operation. An example of String.Latin subset (XOEF) normalisation mapping NFD (Normalization Form D) and NFC (Normalization Form C) is located in the SAMPLE directory under NPASNFD and NPASNFC.

A sample user table for the ICONV system table to change the transliteration of German umlauts to AE, OE or UE is located in the SAMPLE directory under CCUTDEXL. Another sample user table called CCUTNPAS defines the 'string.latin' subset (XOEF) which is mainly used for statutory reporting.

The order of definition is the order of processing the definitions. For exampel, if you define a mapping or a transliteration for code point X and later you make this code point invalid, then the code point is invalid. On the other hand, you can first deactivate all code points, then activate your subset and define your transliterations (best fit) or mappings.

For example: To define a non-expansion best fit mapping a user table can be used to delete all combined characters for single byte code pages. For the ICONV system transliteration table, a sample user table can be found under the name CCUTBFM1.

Some other subset user table definitions were added. For example:

All user table definitions are pre-calculated once at the beginning of execution to reduce the required CPU time for the conversion. So, the use of a user table increases the cost to open a file but has no effect on CPU utilization when converting the data.

To minimize the effort and to simplify the usage of user table, the most common subsets can be selected by keyword. In this case, a pre-calculated load module (DLL) is used. The subset definition can be for UTF (21 Bit Unicode) or UCS (16 Bit Unicode). If it is a UCS subset, then no valid code point is greater than 0xFFFF and you must use UTF CCSIDS to ensure that no code point greater than 0xFFFF is accepted.

If you have a working user table definition for a popular subset or system table manipulation, you can request a keyword for it. Please report this requirement over our issue tracking system

https://www.flam.de/en/technology/support/issues/

Feel free to attach the corresponding user table definition.

Selections