flcl_manual-flcl_commands-conv-read-text

TEXT

Synopsis

HELP:   Read text data from a file
TYPE:   OBJECT
SYNTAX: TEXT(NET.{},FILE['str'/STREAM/DUMMY...],BLKSIZE=num,RECLENGTH=num,SUPTWS,NELDLM,CCSID='str'/DEFAULT/ASCII/EBCDIC/BOMUTF/BOMUCS/SYSTEM/LOCAL,CHRMODE=STOP/IGNORE/SUBSTITUTE/IDENTITY/TRANSLIT,SKIPEQUAL,USRTABLE='str'/NPAS/SEPA/DELA/DLAX,ONEMAP,COMBINED=NFD/NFC/AUTO/ON/OFF,BOM,KEEPBOM,ENL2LF,CONVBIN,RPLFFD[=num],PADCHAR=num,REGEXP(),DECODE=NONE/FIODEC/CRYDEC/CMPDEC/ALWAYS,DECRYPT[{}...],PRNCONTROL=DETACH/RETAIN/ERASE/REPLACE,SUBSYSTEM(),FRCBLK,REMOVE,RENAME='str',BINERROR,CHRERROR,LANG='str',PLATFORM=WIN/UNX/ZOS/USS/VSE/BS2/MAC,OWNER='str',ENVID='str',HASH(),SIGNATURE.{},CHECK,TABLE(),AVSCAN(),NOARCH,PREPROCESS[()...],POSTPROCESS/PSTPRO[()...])

Description

Read text works on blocks of text data or records. First, the text data is converted to the UTF-8 character set. Then the text data is split into record and rest elements based on text delimiters or record length. The data must contain a valid text delimiter within the provided record/ line length. If the data contains 4 byte length fields, the length fields are used to build a block as list of records. If no delimiter is found, the record length are used to form the text records. If the data is block-oriented (no record length) and neither delimiters nor 4 byte length fields are found, the data will be wrapped into UTF-8 records of the provided record length. The wrapping of UTF-8 character streams in records can result in incomplete multibyte sequences at the end and the rest of a multibyte character at the beginning of an record.

A text record contains all text between two delimiters, but without trailing whitespace if SUPTWS used. Trailing whitespace, the delimiter and possibly some padding characters make up the rest element. Taken together, the record and the rest elements represent the original data. The net text record (without the rest element) can be validated using Perl-compatible regular expressions. The reaction to records that don't match the regular expression can be configured to result in validation error (default), a FMTERR for better detection or to ignore the non-matching records, removing the from output.

List of valid text delimiters (Unicode codepoints):

0x00 (padding character (can be changed))
0x0A (carriage return (EBCDIC 0x25))
0x0D (line feed (EBCDIC(0x0D)))
0x0A, 0x0D (carriage return line feed)
0x0A, ..., 0x0A, 0x0D (dirty delimiters)
0x0A, 0x0D, ..., 0x0D (dirty delimiters)
0x0C (form feed (EBCDIC 0x0C))
0x85 (new/next line (EBCDIC 0x15, optional if single byte ASCII))

0x85 are used for UTF-8 (C285) and EBCDIC (0x15) but in single byte ASCII code pages the NELDLM flag must be defined to scan for 0x85, because 0x85 normally used as currency sign accept for ISO code pages. To prevent EBCDIC(0x15) to UNICODE(0x85) at character conversion FLAM supports the ENL2LF and ELF2NL switches.

If the text data is detected as being compressed, it is decompressed automatically. For encodings like Base64, automated decoding is useful in most cases. However, if you want to read the base encoded data as text, you must set the 'decode' parameter to the number of automatic decodings to perform. You can extent or limit the number of encoding layers. For example, to retrieve a Base64-encoded text from a GZIP file instead of the decoded version of this text you can set DECODE=CRYDEC. To decode encoded XML data after decompression you must define DECODE=CMPDEC. This is the default. The base decoding of text can result in valid but nonsensical text. There is no uniqueness for a clear decision in this case. To enforce decoding of possible base encodings where the result is still a text stream, you must set DECODE=ALWAYS.

This is the default because base decoding of text can result in valid but nonsensical text. There is no uniqueness for a clear decision in this case. To decode base encoded data after decompression, you must specify DECODE=CMPDEC or DECODE=ALWAYS. If the decompressed text is, for example, Base64 encoded XML, this must be explicitly activated to process the XML.

For a transparent decryption, you must provide the required parameters. Decryption requires at least a key reference as parameter. Several decryption methods can be enabled at once with the corresponding parameters. 4 byte length fields in the data are detected and cannot be part of valid text data. Therefore, these length fields are used automatically to build a list of records.

The character data is converted from the provided CCSID to UTF-8. If no CCSID is provided, auto detection is applied. If this not successful, a system dependent default CCSID is used: If available, the CCSID stored in the file system is used. If this is unset or not supported, the environment variable LANG is used. If this is also not set to a valid value, ISO-8859-1 (Latin-1) is used on ASCII and IBM-1047 (Open Systems Latin-1) on EBCDIC systems.

While text formatting, form feed characters may optionally be replaced by empty records.

A padding character may be defined which is regarded as a special delimiter. Consecutive occurrences of this padding character are regarded as one delimiter, i.e. the text "some text_____more text" will become two records "some text" and "more text". Not being a part of the original data, padding characters are simply dropped and are not part of the rest element.

With read.text() you can transparent read normal text files with delimiter in clear, compressed (GZIP, BZIP2, XZ(LZMA)), encrypted (FLAM, PGP) or encoded (Base64/32/16) form, normal record oriented data sets (FB(A/M)/VB(A/M)/VSAM/...), FLAMFILEs, ZIP or other archive formats from local or remote (SSH) locations.

If you provide row specifications through the table object, each text record is split into neutral FL5 table elements.

Arguments

STRING: FILE['str'/STREAM/DUMMY...] - Name/URL of file to read [''==stdin]
- STREAM - Read from stdin or write to stdout
- DUMMY - Read EOF or write nothing
NUMBER: RECLENGTH=num - Maximum record length (Host) / line length (Unix/Win) for text parsing [512]
SWITCH: SUPTWS - Suppress trailing whitespaces [FALSE]
SWITCH: NELDLM - Activate NEL (0x85) as delimiter for ASCII character sets [FALSE]
SWITCH: SKIPEQUAL - Skip conversion if formats are equal (e.g. UTF-8 to UTF-8)
SWITCH: KEEPBOM - Keep byte order mark for faster conversion
SWITCH: CONVBIN - Don't skip character conversion if binary data is detected [FALSE]
NUMBER: RPLFFD=num - Replace form feeds, filling rest of page with blank lines assuming n lines per page [60]
NUMBER: PADCHAR=num - Padding character to be regarded as additional delimiter [0x00]
NUMBER: DECODE=NONE/FIODEC/CRYDEC/CMPDEC/ALWAYS - Decode encoded data first (remove base encodings)
- NONE - No automated decoding
- FIODEC - Base decoding after file IO, before decryption
- CRYDEC - Base decoding after decryption, before decompression
- CMPDEC - Base decoding after decompression, before formatting if XML
- ALWAYS - Always decode encoded data (like CMPDEC but also if TEXT)
SWITCH: FRCBLK - Enforce block orientation on record oriented devices [FALSE]
SWITCH: BINERROR - Enforce an error if binary data detected [FALSE]
SWITCH: CHRERROR - Enforce an error if character data without delimiter detected [FALSE]
SWITCH: NOARCH - Disable the attempt to read archives (prevent multiple opens to the same file) [FALSE]