HELP: Read a data field between a tag and a corresponding delimiter TYPE: OBJECT SYNTAX: TVD(ROOT='str',PATH='str',VALUE='str',CCSID='str'/DEFAULT/ASCII/EBCDIC/SYSTEM/LOCAL,SEPCHR[num/COMMA/COLON/SEMICOLON/TABULATOR/BLANK...],NOCASE)
This object is used to read a column with data that is enclosed by a certain tag and a trailing delimiter. The format was mainly developed to parse SWIFT MT transaction data.
The format can have an arbitrary hierarchical depth. This is expressed via a path specification with an opening tag and a closing delimiter, with each level separated by a slash. If TVD data from an underlying tag and delimiter are included, they can occur in any order. If this is not the case, the order must correspond to the information in the row specification. In such hierarchical formats, the beginning of the table is determined by the beginning of the repeating data. This beginning is indicated by the longest root path. The repeating data can itself be structured arbitrarily deep, which is indicated by a further path specification. The repeating data can be preceded by so-called headers or followed by trailer data. This header data is distinguished by a shorter root specification which must be placed before the repeating data as a template in the row specifications. The same applies to trailer data, except that it must follow after the repeating data. A short root in between will result in an error. A change in the header or trailer data, leads to an interruption of the table to output the changed header data.
The complete or parts of the read delimiter can be pushed back to be
matched again in the next column as the starting tag. This push back
mechanism is also available for tags to include parts of these tag into
the data portion. The push back defines the amount of characters (not
bytes). Trailing whitespace (e.g. if \W
is used) is not included.
For the tags and delimiters, the CCSID must be specified. Case sensitive
comparison can be disabled with a switch. For each tag and delimiter,
an unlimited amount of alternatives (separated by a | character) can be
defined. This can be used for aliases or is required for different
possible next tags as part of the previous delimiter in conjunction with
the push back mechanism. For example, to parse an XML attribute lists,
the delimiter list could look like this: "W>W|"B
The first alternative matches if the attribute list ends at the closing
bracket of the starting tag >
. The other alternative matches if
another attribute follows, separated by a whitespace character. In this
case, the order is important. If you have alternatives where one is a
prefix of the other (e.g. "abc" and "abcdef"), the longest must come
first to ensure it matches if it occurs in the data.
If the minimum and/or maximum length of the data value is known, it is recommended to set these limits via the respective parameters as safety measure. For variable length data (i.e. only terminated by a delimiter) the default maximum length is 512.
The component uses root and path variables to describe the format. The data format can have an unlimited hierarchical depth. The longest root defines the start of the repeating part of the table. A shorter root can be used to read values which are the same for every table row and therefore appear only once before the repeating structure in the input data begins (as a kind of "header"). These (short root) values are copied to each table row.
Example use case: A file containing banking transactions starting with the currency which is valid for all the following transactions so that it doesn't have to be stored repeatedly in each transaction record. When transforming this file to a table, the currency can be copied to each table row using the "short root" approach to have the currency copied as a column to each row in the resulting table.
If a column is defined which is valid for more than one row, but this element was not defined with a short root, damaged data can be the result. In this case, the pointer into the input data can point to the next data read for upcoming rows and not to the previously read data. Be careful at this point in your design of row definitions.
The character sequences below can be used in root and path specifications.
Root/Path parsing: / The path separator (starts the next hierarchical level) At the beginning of a hierarchical level: ? Marks an optional field (i.e. may not exist) - Marks a field that is optional only when reading When writing, this field is always written (with a default value), even if it was missing in the input. + Marks a field that is optional only when writing The field is optional in the sense that it is not written if it equals a defined default value. The field must be present in the input. ! at the beginning marks a mandatory field (default, not necessary to specify) At the end of a tag or delimiter: < push back operator at the end of a tag or delimiter May be followed by zero or more decimal digits. Performs push back of one or more characters. Pushed back characters are read again in a following parsing step. A missing number after the bracket or a value of 0 implies pushing back the whole tag / delimiter. Otherwise the given number of characters are pushed back. The number cannot be larger than the length of the tag or delimiter. Allows, for example, to read (a part of) a tag as column value or the a remaining part of an delimiter as beginning of the next tag. Separator between tag and delimiter: * This is a placeholder for a column value of variable length (including empty fields (use MIN/MAXLEN)). % This is a placeholder for a column value of fixed length. The length of the fixed-length string can be specified by: - One % for each character or - The number of characters as decimal number after a single % & This is a placeholder for a column value of variable length that is prefixed by a decimal number which contains the length of the value that immediately follows. The number if decimal digits that form the length field can be specified in two ways: - The number of & used is the number of decimal digits which are treated as length of the data - Decimal digits after a single & define the length of the length field The length field defines the length of the data to read after the length field, not including the length field itself in characters. Separator between tags or delimiters: | Separates alternative tags or delimiters (including push back amount) (i.e. works like OR) Special character sequences to match in tags or delimiters: \n new line (including carriage return, line feed, form feed) \t tab (horizontal and vertical) \m must be an alphanumeric character (e.g. A-Z|a-z|0-9) \a must be an alphabetic character (e.g. a-z|A-Z) \l must be a lower case alphabetic character (e.g. a-z) \u must be an upper case alphabetic character (e.g. A-Z) \d must be a digit (0-9) \x must be a hexadecimal character (0-9|a-f|A-F) \p must be a punctuation character \s must be whitespace (including control characters) \z matches any character (can be used as wildcard) Special character sequences to terminate a tag or delimiter: \M read and ignore everything until encountering an alphanumeric character \A read and ignore everything until encountering an alphabetic character (e.g. a-z|A-Z) \L read and ignore everything until encountering a lower case alphabetic character (e.g. a-z) \U read and ignore everything until encountering an upper case alphabetic character (e.g. A-Z) \D read and ignore everything until encountering a digit (0-9) \X read and ignore everything until encountering a hexadecimal digit (0-9|a-f|A-F) \P matches any characters up to a punctuation character \S read and ignore everything until encountering a whitespace character Special character sequences to read over the input: \W matches all whitespace (including control characters) and ignores it Zero or more whitespace characters are matched and ignored. The whitespace is treated as if it was not there (e.g. indentation). Nothing is written out when writing. \B same as \W except that a whitespace character is output when writing \N same as \W except that a newline character is output when writing Special character sequences to handle separators: \C matches a separator character and surrounding whitespace The separator is comma by default and can be configured via SepChr parameter. Can be used to split lists like this into separate values: abc, def , ghi,jkl Adds newline when writing. \c like \C but adds whitespace after the separator when writing \k like \C but adds nothing after the separator when writing
All character classes are determined by the supported Unicode standard.
For example, if \u
refers to all uppercase letters, then this applies
not only to the Latin A-Z but also to the Cyrillic, Greek and all other
letters that are supported in Unicode.
The upper case escape sequences to terminate a tag or delimiter in the
above list match the corresponding character class (if found) but the
internal reading position is not advanced by the number of characters
matched. As a result, the push back operator (>
) is not necessary. If
defined more than once, all classes must match on this character.
The special escape sequences \C
, \c
and \k
exist for formats
like JSON where comma separators between arguments are used. The
separator's character code point defaults to comma, but can be changed
by setting the SEPCHR parameter. To separate the input, all closing
parentheses are additionally supported here, whereby the read position
remains in front of them, so that they can be read later as close
delimiters. This makes it possible, among other things, to clearly
recognize the end of numbers in JSON, which are not explicitly in
quotes.
When writing, \C
adds a newline character after the separator, \c
adds an additional space character after the separator and \k
only
prints the separator if another element follows on the same hierarchical
level. If no other element follows, separator and whitespace will not be
printed. For \C
only the newline character is printed in this case.
This special character sequences was defined to support comma separated
lists, where the last item is not terminated by a certain separator.
The special escape sequences \B
, \N
and \W
ignore whitespace and
control characters when reading. When writing, \B
adds a whitespace
character to the formatted output and \N
a newline character.
To use any of the special characters or escape sequences as literal
characters, they must be preceded by a backslash (e.g. \<
or \\S
):
COL(name='menu.id' root='\W\<menu<*\<\/menu\>\N' path='\<menu*\>\N/\Bid="*"\W')
We recommend to escape all punctuation characters. In the example above,
the >
does not have to be escaped because it does not have any special
meaning, but this may change in a later version of FLAM.
If the *
separator is used, then the data is read until the delimiter
is encountered. In this case a default limit of 512 is used unless a
different maximum length is defined by setting the MAXLEN
parameter
of the column.
This syntax can be used to parse quite complex text formats and transform the data into tables. Below is an example for a MT103 SWIFT transaction including the HSPF header and TSPF trailer. The complete definitions are provided together with other SWIFT-MT formats as example file in your FLAM installation.
name='MT103' FORMAT=TVD DEFAULTS(TYPE=STRING,FLTFMT=STR,ALWTRC) root='!\WHSPF*TSPF\n\W' COL(name='HSPF' root='' path='!\WHSPF*TSPF\n\W/*\n{<1' ) COL(name='BHB' path='!{1:*}' ) COL(name='AHB' path='?{2:*}' ) COL(name='UHB-BPC' path='?{3:*}/?{113:%%%%}' ) COL(name='UHB-MUR' path='?{3:*}/?{108:*}' maxlen=16 ) COL(name='CB-SENDREF' path='!{4:\n*-}\n/!:20:*\n' maxlen=16 ) COL(name='CB-TIMEIND' path='!{4:\n*-}\n/?:13C:*\n' ) COL(name='CB-BOPCODE' path='!{4:\n*-}\n/!:23B:%%%%\n' ) COL(name='CB-INSCODE' path='!{4:\n*-}\n/?:23E:*\n' maxlen=35 ) COL(name='CB-TRACODE' path='!{4:\n*-}\n/?:26T:%%%\n' ) COL(name='CB-VDCUISA-VD' path='!{4:\n*-}\n/!:32A:%6' ) COL(name='CB-VDCUISA-CUR' path='!{4:\n*-}\n/!%%%' ) COL(name='CB-VDCUISA-AMT' path='!{4:\n*-}\n/!*\n' TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2)) ) COL(name='CB-CUINSAM-CUR' path='!{4:\n*-}\n/?:33B:%%%' ) COL(name='CB-CUINSAM-AMT' path='!{4:\n*-}\n/?*\n' TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2)) ) ... COL(name='CB-REMITNF' path='!{4:\n*-}\n/?:70:*\n:<1' ) COL(name='CB-DETCHRG' path='!{4:\n*-}\n/!:71A:%3\n' ) COL(name='CB-SNDCHRG-CUR' path='!{4:\n*-}\n/?:71F:%3' ) COL(name='CB-SNDCHRG-AMT' path='!{4:\n*-}\n/?*\n' TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2)) ) COL(name='CB-RECCHRG-CUR' path='!{4:\n*-}\n/?:71G:%3' ) COL(name='CB-RECCHRG-AMT' path='!{4:\n*-}\n/?*\n' TYPE.FLOAT(FORMAT.STR(FRACDIGITS=2)) ) COL(name='CB-SNDRECI' path='!{4:\n*-}\n/?:72:*\n:<1|\n-}<2' ) COL(name='CB-REGUREP' path='!{4:\n*-}\n/?:77B:*\n:<1|\n-}<2' ) COL(name='OTB' path='?{5:*}\n' )
To convert a SWIFT-MT transaction into a corresponding SEPA XML
transaction, the field CB-VDCUISA (:32A:*n
) must be split into three
parts to get the date, the currency and the amount as separate columns.
For such things, it is possible to define fixed-length fields without
delimiter and also fixed or variable-length fields without a tag. If no
tag is specified, then the next field must directly follow inside the
data or a format error occurs. A special consideration is needed for
:71x:
fields. For example the field :77:
could be at the end, is
optional and can span multiple lines. In such a case the \n
delimiter
is not sufficient since there are 2 possible tags which can follow.
Either the optional tag :77B:
or the closing tag (delimiter) -}
.
In this case, the 2 different alternatives must be defined and can be
separated with the OR operator.
If fields are enclosed by a root tag and delimiter, the order of the
fields is arbitrary. If no enclosing tag and delimiter are found, the
order must fit the order of the defined columns (optional fields may be
missing). In SWIFT-MT messages, you can find repeating groups where a
mandatory field signals that a new row begins. In such a case, the
mandatory field (e.g. MT101 :21:
) is the opening tag and the closing
tag for the transaction, but these closing tags must be completely
pushed back in order to be read as opening tag of the next transaction.
See example SWFMT101 in your FLAM installation.
Another special case in MT101 is the Message Index field. This is divided into the current index and the number of transactions. The separator between the two numbers is therefore a delimiter, which is followed by another number with n as delimiter. Since this has no tag, the data must occur here in direct sequential order.
COL(name='CB-MSGIND-CUR' root='!\WHSPF*TSPF\n\W/!{4:\n*-}\n' path='!:28D:*\/' maxlen=11 ) COL(name='CB-MSGIND-TOT' root='!\WHSPF*TSPF\n\W/!{4:\n*-}\n' path='!*\n' maxlen=11 )
In SWIFT-MT exist fields which have the same tag but different kinds of
content. For example, there might be 3 options for the tag :hugo:
:
# The data starts with 6 digits followed by 2 upper case characters (A-Z).
# 3 digits followed by 5 upper case characters (A-Z)
# 4 upper case character followed by 4 digits
In such a case, the path definition below can be used:
col(name='hugo-1' path='-:hugo:\d\d\d\d\d\d\u\u<8*\n') col(name='hugo-2' path='?:hugo:\d\d\d\u\u\u\u\u<8*\n') col(name='hugo-3' path='?:hugo:\u\u\u\u\d\d\d\d<8*\n')
If :hugo:
a required field, then it must be defined as optional when
reading (-) because only one of the 3 options can apply, which implies
the other ones must be optional. The match of character classes must be
unique in such a case. This means that in the example above a push back
of 4 should be enough. The character class \z
can be used as wildcard
which matches any character. This is useful if something can come
in between.
Below a sample to parse a XML message with the TVD support, where a corresponding sample message are add as comment to the the row definition.
name='XML2' default(type=string) COL(name='menu.id' root='\W\<menu<*\<\/menu\>\N' path='\<menu*\>\N/\Bid="*"\W') COL(name='menu.value' root='\W\<menu<*\<\/menu\>\N' path='\<menu*\>\N/\Bvalue="*"\W') COL(name='menu.popup.menuitem.value' root='\W\<menu<*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bvalue="*"\W') COL(name='menu.popup.menuitem.onclick' root='\W\<menu<*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bonclick="*"\W') # <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu> #
The attribute list in the XML tags are parsed as additional level, where
the start tag (<menu
) and start end tag (>
) are used as tag and
delimiter. The start tag (<menu
) has also a corresponding end tag
(</menu>
) in the level below. For this to work, this start tag must
be provided with a push back (\<menu\W<*
(in full length (no digit
behind <
))), so that it can be read again in the next level for the
attribute list.
A special case is an "end must list". This is data behind a closing tag and this can also be parsed as a sequential list of data. If such data has a root definition on a higher level in the hierarchy (short root), its value is not available as column data for preceding rows (since it has not been read, yet). Such trailer data is handled as optional by default, otherwise an error will occur because the column was not found in the input data. See the extended XML sample below
name='XML2' default(type=string) COL(name='menu.id' root='\W\<menu<\W*\<\/menu\>\N' path='\<menu*\>\N/\Bid="*"\W') COL(name='menu.value' root='\W\<menu<\W*\<\/menu\>\N' path='\<menu*\>\N/\Bvalue="*"\W') COL(name='menu.popup.menuitem.value' root='\W\<menu<\W*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bvalue="*"\W') COL(name='menu.popup.menuitem.onclick' root='\W\<menu<\W*\<\/menu\>\N/\<popup\>\N*\<\/popup\>\N' path='\<menuitem*\/\>\N/\Bonclick="*"\W') COL(name='menu.enddata' root='\W\<menu<\W*\<\/menu\>\N' path='?*\W\<<') # <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> hugo </menu> #
The last column specification is a trailer as must list delimited by <
where the <
will be push back to be read a begin of the delimiter for
the XML root tag.
All column definitions of the TVD format, which are not interrupted by any other format, are read as a tree containing the possible next items or fixed length fields or variable length fields for each item up to a defined delimiter. This means that the TVD format can be used in conjunction with the FIX, LFD and DLM formats.
The main differences between TVD and explicit XML and JSON processors as part of the table support are:
When reading records (no delimiter) using the CONV command and defining TVD format as default and block handling as method, then a record to block conversion using the default delimiter is done before format parsing. With XCNV, such a record to block conversion must be added manually. This is required to parse for delimiters and use the same row definition for flat files with delimiters and host datasets with record length instead of delimiters.
STRING: ROOT='str' - Root to XML element [taken from table/row/column]
STRING: PATH='str' - Path to XML element [taken from column]
STRING: VALUE='str' - Optional default value, if element not in data structure ['0' for float and integer else '']
STRING: CCSID='str'/DEFAULT/ASCII/EBCDIC/SYSTEM/LOCAL - Conversion from this CCSID to UTF-8 [system]
DEFAULT - Use default CCSID (environment)
ASCII - Use default ASCII CCSID (environment)
EBCDIC - Use default EBCDIC CCSID (environment)
SYSTEM - Use system character set (same as DEFAULT)
LOCAL - Use local character set (system/physical)
NUMBER: SEPCHR[num/COMMA/COLON/SEMICOLON/TABULATOR/BLANK...] - List of separator characters as Unicode codepoints [COMMA]
COMMA - Comma separated (default)
COLON - Separated by colon (':')
SEMICOLON - Separated by semicolon (';')
TABULATOR - Separated by tabulator (x'0009')
BLANK - Separated by blank (x'0020')
SWITCH: NOCASE - Case sensitive or not [OFF]