TEXT

Synopsis

HELP:   Read text data from a file
TYPE:   OBJECT
SYNTAX: TEXT(NET.{},FILE['str'/STREAM/DUMMY...],BLKSIZE=num,RECLENGTH=num,SUPTWS,NELDLM,CCSID='str'/DEFAULT/ASCII/EBCDIC/BOMUTF/BOMUCS/SYSTEM/LOCAL,CHRMODE=STOP/IGNORE/SUBSTITUTE/IDENTITY/TRANSLIT,SKIPEQUAL,USRTABLE='str'/NPAS/SEPA/DELA/DLAX,ONEMAP,COMBINED=NFD/NFC/AUTO/ON/OFF,BOM,KEEPBOM,ENL2LF,CONVBIN,RPLFFD[=num],PADCHAR=num,REGEXP(),DECODE=NONE/FIODEC/CRYDEC/CMPDEC/ALWAYS,DECRYPT[{}...],PRNCONTROL=DETACH/RETAIN/ERASE/REPLACE,SUBSYSTEM(),FRCBLK,REMOVE,BINERROR,CHRERROR,LANG='str',PLATFORM=WIN/UNX/ZOS/USS/VSE/BS2/MAC,OWNER='str',ENVID='str',HASH(),SIGNATURE.{},CHECK,TABLE(),AVSCAN(),NOARCH,PREPROCESS[()...],POSTPROCESS/PSTPRO[()...])

Description

Read text works on blocks of text data or records. First, the text data is converted to the UTF-8 character set. Then the text data is split into record and rest elements based on text delimiters or record length. The data must contain a valid text delimiter within the provided record/ line length. If the data contains 4 byte length fields, the length fields are used to build a block as list of records. If no delimiter is found, the record length are used to form the text records. If the data is block-oriented (no record length) and neither delimiters nor 4 byte length fields are found, the data will be wrapped into UTF-8 records of the provided record length. The wrapping of UTF-8 character streams in records can result in incomplete multibyte sequences at the end and the rest of a multibyte character at the beginning of an record.

A text record contains all text between two delimiters, but without trailing whitespace if SUPTWS used. Trailing whitespace, the delimiter and possibly some padding characters make up the rest element. Taken together, the record and the rest elements represent the original data. The net text record (without the rest element) can be validated using Perl-compatible regular expressions. The reaction to records that don't match the regular expression can be configured to result in validation error (default), a FMTERR for better detection or to ignore the non-matching records, removing the from output.

List of valid text delimiters (Unicode codepoints):

0x85 are used for UTF-8 (C285) and EBCDIC (0x15) but in single byte ASCII code pages the NELDLM flag must be defined to scan for 0x85, because 0x85 normally used as currency sign accept for ISO code pages. To prevent EBCDIC(0x15) to UNICODE(0x85) at character conversion FLAM supports the ENL2LF and ELF2NL switches.

If the text data is detected as being compressed, it is decompressed automatically. For encodings like Base64, automated decoding is useful in most cases. However, if you want to read the base encoded data as text, you must set the 'decode' parameter to the number of automatic decodings to perform. You can extent or limit the number of encoding layers. For example, to retrieve a Base64-encoded text from a GZIP file instead of the decoded version of this text you can set DECODE=CRYDEC. To decode encoded XML data after decompression you must define DECODE=CMPDEC. This is the default. The base decoding of text can result in valid but nonsensical text. There is no uniqueness for a clear decision in this case. To enforce decoding of possible base encodings where the result is still a text stream, you must set DECODE=ALWAYS.

This is the default because base decoding of text can result in valid but nonsensical text. There is no uniqueness for a clear decision in this case. To decode base encoded data after decompression, you must specify DECODE=CMPDEC or DECODE=ALWAYS. If the decompressed text is, for example, Base64 encoded XML, this must be explicitly activated to process the XML.

For a transparent decryption, you must provide the required parameters. Decryption requires at least a key reference as parameter. Several decryption methods can be enabled at once with the corresponding parameters. 4 byte length fields in the data are detected and cannot be part of valid text data. Therefore, these length fields are used automatically to build a list of records.

The character data is converted from the provided CCSID to UTF-8. If no CCSID is provided, auto detection is applied. If this not successful, a system dependent default CCSID is used: If available, the CCSID stored in the file system is used. If this is unset or not supported, the environment variable LANG is used. If this is also not set to a valid value, ISO-8859-1 (Latin-1) is used on ASCII and IBM-1047 (Open Systems Latin-1) on EBCDIC systems.

While text formatting, form feed characters may optionally be replaced by empty records.

A padding character may be defined which is regarded as a special delimiter. Consecutive occurrences of this padding character are regarded as one delimiter, i.e. the text "some text_____more text" will become two records "some text" and "more text". Not being a part of the original data, padding characters are simply dropped and are not part of the rest element.

With read.text() you can transparent read normal text files with delimiter in clear, compressed (GZIP, BZIP2, XZ(LZMA)), encrypted (FLAM, PGP) or encoded (Base64/32/16) form, normal record oriented data sets (FB(A/M)/VB(A/M)/VSAM/...), FLAMFILEs, ZIP or other archive formats from local or remote (SSH) locations.

If you provide row specifications through the table object, each text record is split into neutral FL5 table elements.

Arguments