TVD

Synopsis

HELP:   Read a data field between a tag and a corresponding delimiter
TYPE:   OBJECT
SYNTAX: TVD(ROOT='str',PATH='str',VALUE='str',CCSID='str'/DEFAULT/ASCII/EBCDIC/SYSTEM/LOCAL,SEPCHR[num/COMMA/COLON/SEMICOLON/TABULATOR/BLANK...],NOCASE)

Description

This object is used to read a column with data that is enclosed by a certain tag and a trailing delimiter. The format was mainly developed to parse SWIFT MT transaction data.

The format can have an arbitrary hierarchical depth. This is expressed via a path specification with an opening tag and a closing delimiter, with each level separated by a slash. If TVD data from an underlying tag and delimiter are included, they can occur in any order. If this is not the case, the order must correspond to the information in the row specification. In such hierarchical formats, the beginning of the table is determined by the beginning of the repeating data. This beginning is indicated by the longest root path. The repeating data can itself be structured arbitrarily deep, which is indicated by a further path specification. The repeating data can be preceded by so-called headers or followed by trailer data. This header data is distinguished by a shorter root specification which must be placed before the repeating data as a template in the row specifications. The same applies to trailer data, except that it must follow after the repeating data. A short root in between will result in an error. A change in the header or trailer data, leads to an interruption of the table to output the changed header data.

The complete or parts of the read delimiter can be pushed back to be matched again in the next column as the starting tag. This push back mechanism is also available for tags to include parts of these tag into the data portion. The push back defines the amount of characters (not bytes). Trailing whitespace (e.g. if \W is used) is not included.

For the tags and delimiters, the CCSID must be specified. Case sensitive comparison can be disabled with a switch. For each tag and delimiter, an unlimited amount of alternatives (separated by a | character) can be defined. This can be used for aliases or is required for different possible next tags as part of the previous delimiter in conjunction with the push back mechanism. For example, to parse an XML attribute lists, the delimiter list could look like this: "W>W|"B

The first alternative matches if the attribute list ends at the closing bracket of the starting tag >. The other alternative matches if another attribute follows, separated by a whitespace character. In this case, the order is important. If you have alternatives where one is a prefix of the other (e.g. "abc" and "abcdef"), the longest must come first to ensure it matches if it occurs in the data.

If the minimum and/or maximum length of the data value is known, it is recommended to set these limits via the respective parameters as safety measure. For variable length data (i.e. only terminated by a delimiter) the default maximum length is 512.

The component uses root and path variables to describe the format. The data format can have an unlimited hierarchical depth. The longest root defines the start of the repeating part of the table. A shorter root can be used to read values which are the same for every table row and therefore appear only once before the repeating structure in the input data begins (as a kind of "header"). These (short root) values are copied to each table row.

Example use case: A file containing banking transactions starting with the currency which is valid for all the following transactions so that it doesn't have to be stored repeatedly in each transaction record. When transforming this file to a table, the currency can be copied to each table row using the "short root" approach to have the currency copied as a column to each row in the resulting table.

If a column is defined which is valid for more than one row, but this element was not defined with a short root, damaged data can be the result. In this case, the pointer into the input data can point to the next data read for upcoming rows and not to the previously read data. Be careful at this point in your design of row definitions.

The character sequences below can be used in root and path specifications.

Root/Path parsing:
    /   The path separator (starts the next hierarchical level)

At the beginning of a hierarchical level:
    ?   Marks an optional field (i.e. may not exist)
    -   Marks a field that is optional only when reading
        When writing, this field is always written (with a default value),
        even if it was missing in the input.
    +   Marks a field that is optional only when writing
        The field is optional in the sense that it is not written if it equals
        a defined default value. The field must be present in the input.
    !   at the beginning marks a mandatory field (default, not necessary to specify)

At the end of a tag or delimiter:
    <   push back operator at the end of a tag or delimiter
        May be followed by zero or more decimal digits.
        Performs push back of one or more characters. Pushed back characters
        are read again in a following parsing step. A missing number after the
        bracket or a value of 0 implies pushing back the whole tag / delimiter.
        Otherwise the given number of characters are pushed back.
        The number cannot be larger than the length of the tag or delimiter.
        Allows, for example, to read (a part of) a tag as column value or
        the a remaining part of an delimiter as beginning of the next tag.

Separator between tag and delimiter:
    *   This is a placeholder for a column value of variable length
        (including empty fields (use MIN/MAXLEN)).
    %   This is a placeholder for a column value of fixed length.
        The length of the fixed-length string can be specified by:
        - One % for each character or
        - The number of characters as decimal number after a single %
    &   This is a placeholder for a column value of variable length that is prefixed
        by a decimal number which contains the length of the value that immediately follows.
        The number if decimal digits that form the length field can be specified in two ways:
        - The number of & used is the number of decimal digits which are treated as length of the data
        - Decimal digits after a single & define the length of the length field
        The length field defines the length of the data to read after the length field,
        not including the length field itself in characters.

Separator between tags or delimiters:
    |   Separates alternative tags or delimiters (including push back amount) (i.e. works like OR)

Special character sequences to match in tags or delimiters:
    \n  new line (including carriage return, line feed, form feed)
    \t  tab (horizontal and vertical)
    \m  must be an alphanumeric character (e.g. A-Z|a-z|0-9)
    \a  must be an alphabetic character (e.g. a-z|A-Z)
    \l  must be a lower case alphabetic character (e.g. a-z)
    \u  must be an upper case alphabetic character (e.g. A-Z)
    \d  must be a digit (0-9)
    \x  must be a hexadecimal character (0-9|a-f|A-F)
    \p  must be a punctuation character
    \s  must be whitespace (including control characters)
    \z  matches any character (can be used as wildcard)

Special character sequences to terminate a tag or delimiter:
    \M  read and ignore everything until encountering an
        alphanumeric character 
    \A  read and ignore everything until encountering an
        alphabetic character (e.g. a-z|A-Z)
    \L  read and ignore everything until encountering a
        lower case alphabetic character (e.g. a-z)
    \U  read and ignore everything until encountering an
        upper case alphabetic character (e.g. A-Z)
    \D  read and ignore everything until encountering a digit (0-9)
    \X  read and ignore everything until encountering a
        hexadecimal digit (0-9|a-f|A-F)
    \P  matches any characters up to a punctuation character
    \S  read and ignore everything until encountering a whitespace character

Special character sequences to read over the input:
    \W  matches all whitespace (including control characters) and ignores it
        Zero or more whitespace characters are matched and ignored.
        The whitespace is treated as if it was not there (e.g. indentation).
        Nothing is written out when writing.
    \B  same as \W except that a whitespace character is output when writing
    \N  same as \W except that a newline character is output when writing

Special character sequences to handle separators:
    \C  matches a separator character and surrounding whitespace
        The separator is comma by default and can be configured via SepChr parameter.
        Can be used to split lists like this into separate values: abc, def , ghi,jkl
        Adds newline when writing.
    \c  like \C but adds whitespace after the separator when writing
    \k  like \C but adds nothing after the separator when writing

All character classes are determined by the supported Unicode standard. For example, if \u refers to all uppercase letters, then this applies not only to the Latin A-Z but also to the Cyrillic, Greek and all other letters that are supported in Unicode.

The upper case escape sequences to terminate a tag or delimiter in the above list match the corresponding character class (if found) but the internal reading position is not advanced by the number of characters matched. As a result, the push back operator (>) is not necessary. If defined more than once, all classes must match on this character.

The special escape sequences \C, \c and \k exist for formats like JSON where comma separators between arguments are used. The separator's character code point defaults to comma, but can be changed by setting the SEPCHR parameter. To separate the input, all closing parentheses are additionally supported here, whereby the read position remains in front of them, so that they can be read later as close delimiters. This makes it possible, among other things, to clearly recognize the end of numbers in JSON, which are not explicitly in quotes.

When writing, \C adds a newline character after the separator, \c adds an additional space character after the separator and \k only prints the separator if another element follows on the same hierarchical level. If no other element follows, separator and whitespace will not be printed. For \C only the newline character is printed in this case. This special character sequences was defined to support comma separated lists, where the last item is not terminated by a certain separator.

The special escape sequences \B, \N and \W ignore whitespace and control characters when reading. When writing, \B adds a whitespace character to the formatted output and \N a newline character.

To use any of the special characters or escape sequences as literal characters, they must be preceded by a backslash (e.g. \< or \\S):

COL(name='menu.id' root='\W\<menu<*\<\/menu\>\N' path='\<menu*\>\N/\Bid="*"\W')

We recommend to escape all punctuation characters. In the example above, the > does not have to be escaped because it does not have any special meaning, but this may change in a later version of FLAM.

If the * separator is used, then the data is read until the delimiter is encountered. In this case a default limit of 512 is used unless a different maximum length is defined by setting the MAXLEN parameter of the column.

This syntax can be used to parse quite complex text formats and transform the data into tables. Below is an example for a MT103 SWIFT transaction including the HSPF header and TSPF trailer. The complete definitions are provided together with other SWIFT-MT formats as example file in your FLAM installation.

name='MT103' FORMAT=TVD DEFAULTS(TYPE=STRING,FLTFMT=STR,ALWTRC)
                 root='!\WHSPF*TSPF\n\W'
COL(name='HSPF'  root=''  path='!\WHSPF*TSPF\n\W/*\n{<1'          )
COL(name='BHB'            path='!{1:*}'                           )
COL(name='AHB'            path='?{2:*}'                           )
COL(name='UHB-BPC'        path='?{3:*}/?{113:%%%%}'               )
COL(name='UHB-MUR'        path='?{3:*}/?{108:*}' maxlen=16        )
COL(name='CB-SENDREF'     path='!{4:\n*-}\n/!:20:*\n'  maxlen=16  )
COL(name='CB-TIMEIND'     path='!{4:\n*-}\n/?:13C:*\n'            )
COL(name='CB-BOPCODE'     path='!{4:\n*-}\n/!:23B:%%%%\n'         )
COL(name='CB-INSCODE'     path='!{4:\n*-}\n/?:23E:*\n' maxlen=35  )
COL(name='CB-TRACODE'     path='!{4:\n*-}\n/?:26T:%%%\n'          )
COL(name='CB-VDCUISA-VD'  path='!{4:\n*-}\n/!:32A:%6'             )
COL(name='CB-VDCUISA-CUR' path='!{4:\n*-}\n/!%%%'                 )
COL(name='CB-VDCUISA-AMT' path='!{4:\n*-}\n/!*\n' 
                          TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2))    )
COL(name='CB-CUINSAM-CUR' path='!{4:\n*-}\n/?:33B:%%%'            )
COL(name='CB-CUINSAM-AMT' path='!{4:\n*-}\n/?*\n'
                          TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2))    )
    ...
COL(name='CB-REMITNF'     path='!{4:\n*-}\n/?:70:*\n:<1'          )
COL(name='CB-DETCHRG'     path='!{4:\n*-}\n/!:71A:%3\n'           ) 
COL(name='CB-SNDCHRG-CUR' path='!{4:\n*-}\n/?:71F:%3'             ) 
COL(name='CB-SNDCHRG-AMT' path='!{4:\n*-}\n/?*\n'
                          TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2))    )
COL(name='CB-RECCHRG-CUR' path='!{4:\n*-}\n/?:71G:%3'             )
COL(name='CB-RECCHRG-AMT' path='!{4:\n*-}\n/?*\n'
                          TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2))    )
COL(name='CB-SNDRECI'     path='!{4:\n*-}\n/?:72:*\n:<1|\n-}<2'   )
COL(name='CB-REGUREP'     path='!{4:\n*-}\n/?:77B:*\n:<1|\n-}<2'  )
COL(name='OTB'            path='?{5:*}\n'                         )

To convert a SWIFT-MT transaction into a corresponding SEPA XML transaction, the field CB-VDCUISA (:32A:*n) must be split into three parts to get the date, the currency and the amount as separate columns. For such things, it is possible to define fixed-length fields without delimiter and also fixed or variable-length fields without a tag. If no tag is specified, then the next field must directly follow inside the data or a format error occurs. A special consideration is needed for :71x: fields. For example the field :77: could be at the end, is optional and can span multiple lines. In such a case the \n delimiter is not sufficient since there are 2 possible tags which can follow. Either the optional tag :77B: or the closing tag (delimiter) -}. In this case, the 2 different alternatives must be defined and can be separated with the OR operator.

If fields are enclosed by a root tag and delimiter, the order of the fields is arbitrary. If no enclosing tag and delimiter are found, the order must fit the order of the defined columns (optional fields may be missing). In SWIFT-MT messages, you can find repeating groups where a mandatory field signals that a new row begins. In such a case, the mandatory field (e.g. MT101 :21:) is the opening tag and the closing tag for the transaction, but these closing tags must be completely pushed back in order to be read as opening tag of the next transaction. See example SWFMT101 in your FLAM installation.

Another special case in MT101 is the Message Index field. This is divided into the current index and the number of transactions. The separator between the two numbers is therefore a delimiter, which is followed by another number with n as delimiter. Since this has no tag, the data must occur here in direct sequential order.

COL(name='CB-MSGIND-CUR'  root='!\WHSPF*TSPF\n\W/!{4:\n*-}\n'
                          path='!:28D:*\/' maxlen=11         )
COL(name='CB-MSGIND-TOT'  root='!\WHSPF*TSPF\n\W/!{4:\n*-}\n'
                          path='!*\n'      maxlen=11         )

In SWIFT-MT exist fields which have the same tag but different kinds of content. For example, there might be 3 options for the tag :hugo:: # The data starts with 6 digits followed by 2 upper case characters (A-Z). # 3 digits followed by 5 upper case characters (A-Z) # 4 upper case character followed by 4 digits In such a case, the path definition below can be used:

   col(name='hugo-1' path='-:hugo:\d\d\d\d\d\d\u\u<8*\n')
   col(name='hugo-2' path='?:hugo:\d\d\d\u\u\u\u\u<8*\n')
   col(name='hugo-3' path='?:hugo:\u\u\u\u\d\d\d\d<8*\n')

If :hugo: a required field, then it must be defined as optional when reading (-) because only one of the 3 options can apply, which implies the other ones must be optional. The match of character classes must be unique in such a case. This means that in the example above a push back of 4 should be enough. The character class \z can be used as wildcard which matches any character. This is useful if something can come in between.

Below a sample to parse a XML message with the TVD support, where a corresponding sample message are add as comment to the the row definition.

name='XML2' default(type=string)
COL(name='menu.id'                     root='\W\<menu<*\<\/menu\>\N'                           path='\<menu*\>\N/\Bid="*"\W')
COL(name='menu.value'                  root='\W\<menu<*\<\/menu\>\N'                           path='\<menu*\>\N/\Bvalue="*"\W')
COL(name='menu.popup.menuitem.value'   root='\W\<menu<*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bvalue="*"\W')
COL(name='menu.popup.menuitem.onclick' root='\W\<menu<*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bonclick="*"\W')
#
<menu id="file" value="File">
  <popup>
    <menuitem value="New" onclick="CreateNewDoc()" />
    <menuitem value="Open" onclick="OpenDoc()" />
    <menuitem value="Close" onclick="CloseDoc()" />
  </popup>
</menu>
#

The attribute list in the XML tags are parsed as additional level, where the start tag (<menu) and start end tag (>) are used as tag and delimiter. The start tag (<menu) has also a corresponding end tag (</menu>) in the level below. For this to work, this start tag must be provided with a push back (\<menu\W<* (in full length (no digit behind <))), so that it can be read again in the next level for the attribute list.

A special case is an "end must list". This is data behind a closing tag and this can also be parsed as a sequential list of data. If such data has a root definition on a higher level in the hierarchy (short root), its value is not available as column data for preceding rows (since it has not been read, yet). Such trailer data is handled as optional by default, otherwise an error will occur because the column was not found in the input data. See the extended XML sample below

name='XML2' default(type=string)
COL(name='menu.id'                     root='\W\<menu<\W*\<\/menu\>\N'                           path='\<menu*\>\N/\Bid="*"\W')
COL(name='menu.value'                  root='\W\<menu<\W*\<\/menu\>\N'                           path='\<menu*\>\N/\Bvalue="*"\W')
COL(name='menu.popup.menuitem.value'   root='\W\<menu<\W*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bvalue="*"\W')
COL(name='menu.popup.menuitem.onclick' root='\W\<menu<\W*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bonclick="*"\W')
COL(name='menu.enddata'                root='\W\<menu<\W*\<\/menu\>\N'                           path='?*\W\<<')
#
<menu id="file" value="File">
  <popup>
    <menuitem value="New" onclick="CreateNewDoc()" />
    <menuitem value="Open" onclick="OpenDoc()" />
    <menuitem value="Close" onclick="CloseDoc()" />
  </popup>
  hugo
</menu>
#

The last column specification is a trailer as must list delimited by < where the < will be push back to be read a begin of the delimiter for the XML root tag.

All column definitions of the TVD format, which are not interrupted by any other format, are read as a tree containing the possible next items or fixed length fields or variable length fields for each item up to a defined delimiter. This means that the TVD format can be used in conjunction with the FIX, LFD and DLM formats.

The main differences between TVD and explicit XML and JSON processors as part of the table support are:

When reading records (no delimiter) using the CONV command and defining TVD format as default and block handling as method, then a record to block conversion using the default delimiter is done before format parsing. With XCNV, such a record to block conversion must be added manually. This is required to parse for delimiters and use the same row definition for flat files with delimiters and host datasets with record length instead of delimiters.

Arguments