Stop maintaining 40 normalization scripts. One JSON rule file per supplier — version-controlled, drift-detecting, zero manual encoding fixes.
Every supplier delivers CSV in a different encoding, date format, or decimal separator. The result: one fragile normalization script per data feed — impossible to maintain at scale.
A supplier upgrades their ERP and starts delivering Windows-1252 instead of UTF-8. Your pipeline accepts the file but silently corrupts every umlaut and special character downstream.
"Auto-detect encoding" has been a requested pandas feature for years. There is no built-in fix. Workarounds require chardet, manual configuration, or accepting corrupted data in production.
chardet detects character encoding — but it cannot detect date formats, decimal separators, thousands separators, or null representations. You still need custom logic for everything else.
flfcsv NORUN to detect and save the schemaflfcsv analyzes the CSV, detects encoding, date format, decimal and thousands separators, and null representations. It writes a compact binary schema file via ROWOUT, ready for version control.
The tab file is binary but small — version-controlled and reviewable in pull requests. Your team sees exactly what is expected from each supplier, and any format change detected at runtime shows up as a drift alert.
flfcsv ROWIN=... in your pipelineEvery subsequent file from that supplier is normalized to UTF-8, ISO-8601 dates, dot decimal separator, and empty string for nulls — deterministic and reproducible from the schema file.
OUT=DUMMYRun flfcsv with OUT=DUMMY before normalization in CI/CD to catch supplier format changes before bad data enters your warehouse. Exits non-zero if any normalization would be needed — a clean file exits 0. 🕔 Structured JSON REPORT coming in next release
UTF-8 (with and without BOM), UTF-16 LE/BE, Windows-1252, ISO-8859-1, ISO-8859-15, and EBCDIC with all common IBM CCSIDs (037, 273, 500, 1047, and more). Detection is heuristic — no BOM or file extension hint required.
MM/DD/YYYY, DD.MM.YYYY, YYYY-MM-DD, DD-MON-YYYY in English and German month names, and Excel serial date numbers in both 1900 and 1904 base. All normalized to ISO-8601 (YYYY-MM-DD).
Detects dot, comma, apostrophe, and space as thousands separators, and dot or comma as decimal separator — including ambiguous files where context is required. Normalized to dot decimal, no thousands separator.
Recognizes empty string, "N/A", "NULL", "null", "–", "#N/A", "-999", and other common sentinel values as null. All are normalized to empty string or a configurable target representation.
Compare any incoming file against its rule file and get a structured JSON report of every detected deviation. Exit code 0 on clean, non-zero on drift — ready for any CI/CD system.
Process entire inbox directories in one command: IN='inbox/*.csv' with a rule directory applies the correct schema file to each supplier's files and writes all outputs to a target directory.
Every normalization action will be logged: which field, which row, original value, normalized value, and change type. Add REPORT= to any normalization run for a machine-readable audit trail. MAXDRIFT= controls the exit code threshold for CI/CD gates. Available in next release.
Process a file and atomically replace the original: write to a temp file, delete the original (REMOVE), rename the output to the original name (RENAME). Safe ordering guarantees no data loss even if RENAME fails — the temp file remains intact.
| Feature | pandas | chardet | OpenRefine | CSVNormalize.com | Easy Data Transform | flfcsv |
|---|---|---|---|---|---|---|
| Auto-detect character encoding | Partial | ✓ | ✓ | ✓ | ✓ | ✓ |
| EBCDIC all CCSIDs | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Auto-detect date format | ✗ | ✗ | Manual | Partial | Partial | ✓ |
| Detect decimal / thousands separator | ✗ | ✗ | ✗ | Partial | ✓ | ✓ |
| Null value normalization | na_values | ✗ | Manual | Partial | ✓ | ✓ |
| Git-friendly rule file | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Drift detection / CI/CD integration | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Excel serial date support | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ |
| Command-line / scriptable | Python only | Python only | ✗ | ✗ | ✗ | ✓ |
| Free tier (no credit card) | ✓ OSS | ✓ OSS | ✓ OSS | ✗ | ✗ | ✓ 50 files/mo |
No credit card required
Annual billing: €296/year (save 15%)
Annual billing: €499/year (save 15%)
RULES(ENCODING='IBM-1141') or set the FL5_EBCDIC_CCSID environment variable. When diacritic bytes are found and no explicit CCSID is set, flfcsv falls back to a system default and the affected columns are flagged in the normalization output. Files mixing multiple EBCDIC codepages cannot be automatically decoded and require explicit preprocessing.
OUT=DUMMY. flfcsv applies all normalization rules and discards the output. The exit code is non-zero if any normalization was needed (0 means the file was already clean). Structured JSON reporting and fine-grained MAXDRIFT thresholds are available in the next release.
na_values parameter requires manual configuration per file. Neither tool produces a versioned, reusable rule file, and neither has drift detection. flfcsv solves the entire normalization problem — encoding, dates, decimals, nulls — with a single scriptable command and a Git-friendly rule file that codifies what you expect from each data source.
Free up to 50 files/month. No credit card required.
Start for Free