CSV Encoding Normalization Tool — Auto-Detect & Convert

Stop maintaining 40 normalization scripts. One JSON rule file per supplier — version-controlled, drift-detecting, zero manual encoding fixes.

# Detect: scan supplier CSV, save schema
flfcsv IN='supplier_jan.csv'  NORUN  ROWOUT='rules/supplier.tab'

# Normalize: apply saved schema, write clean RFC-4180 UTF-8 CSV
flfcsv IN='supplier_feb.csv'  ROWIN='rules/supplier.tab' \
       OUT='normalized/supplier_feb.csv'

# Check (CI/CD gate): exit non-zero if any field needs normalization
flfcsv IN='supplier_mar.csv'  ROWIN='rules/supplier.tab' \
       OUT=DUMMY

# Detect with hints (known Windows-1252 semicolon file)
flfcsv IN='erp_export.csv'  NORUN  ROWOUT='rules/erp.tab' \
       RULES(ENCODING='Windows-1252'  SEPARATOR=SEMICOLON  DECIMAL=COMMA)

# Normalize in-place: delete original, rename output
flfcsv IN='supplier.csv'  ROWIN='rules/supplier.tab' \
       OUT='supplier.tmp'  REMOVE  RENAME='supplier.csv'

Auto-detects 20+ encodings EBCDIC supported Git-friendly rule files

The 40-supplier problem

📁

40 suppliers, 40 Python scripts

Every supplier delivers CSV in a different encoding, date format, or decimal separator. The result: one fragile normalization script per data feed — impossible to maintain at scale.

⚠️

Supplier changes ERP, import fails silently

A supplier upgrades their ERP and starts delivering Windows-1252 instead of UTF-8. Your pipeline accepts the file but silently corrupts every umlaut and special character downstream.

🐛

pandas issue #55197 — open since 2023

"Auto-detect encoding" has been a requested pandas feature for years. There is no built-in fix. Workarounds require chardet, manual configuration, or accepting corrupted data in production.

🔍

chardet solves only half the problem

chardet detects character encoding — but it cannot detect date formats, decimal separators, thousands separators, or null representations. You still need custom logic for everything else.

How flfcsv works

Run `flfcsv NORUN` to detect and save the schema

flfcsv analyzes the CSV, detects encoding, date format, decimal and thousands separators, and null representations. It writes a compact binary schema file via ROWOUT, ready for version control.

# Detect encoding, separator, decimal sep, null sentinels, column types flfcsv IN='supplier17_jan.csv' NORUN ROWOUT='rules/supplier17.tab' # With explicit hints when encoding is known: flfcsv IN='erp_export.csv' NORUN ROWOUT='rules/erp.tab' \ RULES(ENCODING='Windows-1252' DECIMAL=COMMA) # Output: rules/supplier17.tab (binary schema, like flbcsv ROWOUT) # Detected parameters are printed to stdout: # encoding: Windows-1252 separator: ; decimal_sep: , ...

Check the schema file into Git

The tab file is binary but small — version-controlled and reviewable in pull requests. Your team sees exactly what is expected from each supplier, and any format change detected at runtime shows up as a drift alert.

# Tab files are binary but small -- check them into version control git add rules/supplier17.tab git commit -m "Add encoding schema for supplier 17 (Windows-1252)" # Future supplier format changes show up at runtime via CHECK: # flfcsv OUT=DUMMY → exits non-zero when normalization is needed

Run `flfcsv ROWIN=...` in your pipeline

Every subsequent file from that supplier is normalized to UTF-8, ISO-8601 dates, dot decimal separator, and empty string for nulls — deterministic and reproducible from the schema file.

# Normalize all incoming files from supplier 17 flfcsv IN='supplier17_feb.csv' ROWIN='rules/supplier17.tab' \ OUT='normalized/supplier17_feb.csv' # Output is always: # Encoding: UTF-8 (no BOM) # Separator: comma # Dates: YYYY-MM-DD (ISO 8601) # Decimals: dot separator, no thousands separator # Nulls: empty field

Bonus: Drift Detection with `OUT=DUMMY`

Run flfcsv with OUT=DUMMY before normalization in CI/CD to catch supplier format changes before bad data enters your warehouse. Exits non-zero if any normalization would be needed — a clean file exits 0. 🕔 Structured JSON REPORT coming in next release

# CI/CD drift check -- exits non-zero if normalization would be needed flfcsv IN='supplier17_mar.csv' ROWIN='rules/supplier17.tab' \ OUT=DUMMY

What flfcsv detects and normalizes

Character Encoding

UTF-8 (with and without BOM), UTF-16 LE/BE, Windows-1252, ISO-8859-1, ISO-8859-15, and EBCDIC with all common IBM CCSIDs (037, 273, 500, 1047, and more). Detection is heuristic — no BOM or file extension hint required.

Date Formats

MM/DD/YYYY, DD.MM.YYYY, YYYY-MM-DD, DD-MON-YYYY in English and German month names, and Excel serial date numbers in both 1900 and 1904 base. All normalized to ISO-8601 (YYYY-MM-DD).

Decimal and Thousands Separators

Detects dot, comma, apostrophe, and space as thousands separators, and dot or comma as decimal separator — including ambiguous files where context is required. Normalized to dot decimal, no thousands separator.

Null Value Normalization

Recognizes empty string, "N/A", "NULL", "null", "–", "#N/A", "-999", and other common sentinel values as null. All are normalized to empty string or a configurable target representation.

Drift Detection with JSON Report

Compare any incoming file against its rule file and get a structured JSON report of every detected deviation. Exit code 0 on clean, non-zero on drift — ready for any CI/CD system.

Batch Mode

Process entire inbox directories in one command: IN='inbox/*.csv' with a rule directory applies the correct schema file to each supplier's files and writes all outputs to a target directory.

📋

REPORT — revision audit trail 🕔 next release

Every normalization action will be logged: which field, which row, original value, normalized value, and change type. Add REPORT= to any normalization run for a machine-readable audit trail. MAXDRIFT= controls the exit code threshold for CI/CD gates. Available in next release.

🔄

In-place normalization (REMOVE + RENAME)

Process a file and atomically replace the original: write to a temp file, delete the original (REMOVE), rename the output to the original name (RENAME). Safe ordering guarantees no data loss even if RENAME fails — the temp file remains intact.

How flfcsv compares

Feature	pandas	chardet	OpenRefine	CSVNormalize.com	Easy Data Transform	flfcsv
Auto-detect character encoding	Partial	✓	✓	✓	✓	✓
EBCDIC all CCSIDs	✗	✗	✗	✗	✗	✓
Auto-detect date format	✗	✗	Manual	Partial	Partial	✓
Detect decimal / thousands separator	✗	✗	✗	Partial	✓	✓
Null value normalization	na_values	✗	Manual	Partial	✓	✓
Git-friendly rule file	✗	✗	✗	✗	✗	✓
Drift detection / CI/CD integration	✗	✗	✗	✗	✗	✓
Excel serial date support	✓	✗	✗	✗	✓	✓
Command-line / scriptable	Python only	Python only	✗	✗	✗	✓
Free tier (no credit card)	✓ OSS	✓ OSS	✓ OSS	✗	✗	✓ 50 files/mo

Start for free

Free

€0 / month

50 files per month
All encodings including EBCDIC
Full date and decimal normalization
JSON rule file output
Community support

Download Free

No credit card required

Linux / Windows — Most Popular

€29 / month

Unlimited files
All Free tier features
Drift detection and JSON reports
Batch mode for inbox directories
CI/CD pipeline integration guide
Email support
Linux x86-64 & Windows x64 binaries

Start Trial

Annual billing: €296/year (save 15%)

AIX / Midrange

€49 / month

Unlimited files
All Linux tier features
IBM AIX on Power (32- & 64-bit)
Solaris SPARC and Solaris x86
Custom EBCDIC CCSID profiles
Priority support (SLA 4h)
Invoice billing available

Contact Sales

Annual billing: €499/year (save 15%)

Frequently asked questions

What encodings does flfcsv detect automatically?

flfcsv detects over 20 character encodings without any configuration. On the Unicode side: UTF-8 (with and without BOM), UTF-16 LE, and UTF-16 BE. On the legacy side: Windows-1252, ISO-8859-1, ISO-8859-15, and Latin-1 variants. For mainframe data: EBCDIC with all common IBM CCSIDs including 037 (US), 273 (Germany/Austria), 500 (International), 1047 (Unix), and custom CCSID profiles available in the Business tier. Detection is heuristic and does not require a BOM or file extension hint.

What is the difference between NORUN mode and normalize mode?

flfcsv uses a single command with two modes. NORUN mode (equivalent to 'learn') scans a sample file, detects encoding, separator, decimal separator, date format, null sentinels and column types, and writes the result to a binary schema file via ROWOUT. This schema file can be checked into version control. Normalization mode (ROWIN + OUT) loads the saved schema and applies it to every subsequent file from the same supplier, producing deterministic UTF-8/RFC-4180 output. Using OUT=DUMMY is the 'check' mode: flfcsv applies normalization logic without writing output and exits with a non-zero code if any changes would have been needed — making it suitable as a CI/CD gate. Structured JSON reporting via REPORT= is available in the next release.

Can flfcsv handle EBCDIC files from mainframe systems?

Yes. flfcsv is built on the same EBCDIC conversion engine used by FLAM, which has handled mainframe data for over 30 years. It supports all standard IBM CCSIDs out of the box. For Pro and Business users, custom CCSID profiles can be defined for non-standard codepage mappings. The Business tier also supports deployment on z/OS itself, allowing normalization to happen directly on the mainframe before any data transfer takes place.

Can flfcsv detect the exact EBCDIC codepage automatically?

flfcsv reliably distinguishes ASCII from EBCDIC via byte-level patterns. Within each family, the specific codepage (IBM-1141 vs IBM-1047 for EBCDIC; Latin-1 vs Windows-1252 for ASCII) is not automatically detectable when only characters below 0x80 are present. Provide an explicit hint via RULES(ENCODING='IBM-1141') or set the FL5_EBCDIC_CCSID environment variable. When diacritic bytes are found and no explicit CCSID is set, flfcsv falls back to a system default and the affected columns are flagged in the normalization output. Files mixing multiple EBCDIC codepages cannot be automatically decoded and require explicit preprocessing.

How does drift detection work?

Run flfcsv with OUT=DUMMY. flfcsv applies all normalization rules and discards the output. The exit code is non-zero if any normalization was needed (0 means the file was already clean). Structured JSON reporting and fine-grained MAXDRIFT thresholds are available in the next release.

Can I use flfcsv in a CI/CD pipeline?

Yes, flfcsv is designed for pipeline use. It is a command-line tool with deterministic exit codes: 0 on success, non-zero on error or drift. All output is written to files or stdout — no GUI, no interactive prompts. The JSON drift reports can be consumed by downstream steps or stored as build artifacts. Pro tier users receive a CI/CD integration guide with example configurations for GitHub Actions, GitLab CI, and Jenkins.

How is flfcsv different from chardet or pandas?

chardet detects character encoding only — it cannot detect date formats, decimal separators, thousands separators, or null representations. pandas has no built-in encoding auto-detection (see pandas issue #55197, open since 2023) and its na_values parameter requires manual configuration per file. Neither tool produces a versioned, reusable rule file, and neither has drift detection. flfcsv solves the entire normalization problem — encoding, dates, decimals, nulls — with a single scriptable command and a Git-friendly rule file that codifies what you expect from each data source.