About

Cucumber data tables follow a strict pipe-delimited format defined by the Gherkin syntax specification. Manual conversion from CSV introduces alignment errors, broken quoting, and inconsistent column widths that cause step definition failures at runtime. This tool parses CSV input using an RFC 4180-compliant finite state machine, correctly handling double-quoted fields, escaped quotes (""), embedded newlines, and mixed delimiters. It auto-detects whether your CSV uses commas, semicolons, or tabs, then outputs a column-aligned Cucumber DataTable where every pipe character sits in a vertical line. The parser processes each character in O(n) time with no backtracking.

Note: this tool assumes well-formed CSV. Malformed input with mismatched quotes will be handled gracefully by treating unmatched quotes as literal characters, but the result may not match your intent. Pro tip: if your CSV originates from Excel on European locales, expect semicolon delimiters rather than commas. The auto-detection handles this, but verify the output on the first run.

Formulas

The conversion follows a two-phase pipeline: parse, then format.

parse(csv) → R^m×n

Where csv is the raw input string, R is a two-dimensional array of m rows and n columns. The parser operates as a finite state machine with three states: FIELD_START, UNQUOTED, and QUOTED.

w_j = max_i=0..m−1 len(R_i,j)

For each column j, compute the maximum cell width w_j. Then each cell is right-padded with spaces to width w_j.

row(i) = | R_i,0.pad(w₀) | R_i,1.pad(w₁) | … |

Where pad is a left-aligned space-fill function. Any literal pipe character | inside a cell value is escaped to \| per Gherkin specification to prevent parser ambiguity. Time complexity is O(m ⋅ n) for both phases.

Reference Data

CSV Feature	RFC 4180 Rule	This Tool
Comma delimiter	Default separator	Auto-detected
Semicolon delimiter	Not in spec (locale variant)	Auto-detected
Tab delimiter	Not in spec (TSV variant)	Auto-detected
Double-quoted fields	Fields MAY be enclosed in "	Fully supported
Escaped quotes	"" within quoted field	Unescaped to single "
Embedded newlines	Allowed inside quoted fields	Preserved as space in output
Trailing CRLF	Optional on last record	Trimmed
Empty fields	Allowed (,,)	Rendered as empty padded cell
Header row	Optional first record	Treated as first data row
Whitespace padding	Significant inside quotes	Trimmed unless quoted
BOM (Byte Order Mark)	Not addressed	Stripped if present
Mixed line endings	CRLF required	Accepts CR, LF, or CRLF
Column count mismatch	Should be uniform	Pads short rows with empty cells
Cucumber pipe escaping	N/A (Gherkin spec)	Literal \| escaped to \\|
Max columns	No limit	No limit
Max rows	No limit	Tested to 50000 rows

Frequently Asked Questions

The parser counts occurrences of comma, semicolon, and tab characters in the first 5 lines of unquoted text. The character with the highest consistent count across those lines is selected as the delimiter. If all counts are zero or tied, comma is used as the RFC 4180 default.

Per RFC 4180, newlines within double-quoted fields are part of the field value, not record separators. This tool replaces embedded newlines with a single space in the Cucumber output, because Gherkin data tables do not support multi-line cell values.

Literal pipe characters (|) are escaped to \| in the output. The Gherkin parser interprets unescaped pipes as column delimiters, so failing to escape them would break the data table structure and cause step definition binding errors.

All rows are output identically. In Cucumber, the first row of a data table is conventionally treated as a header by step definitions, but the table format itself does not distinguish headers. The tool preserves row order exactly as provided.

The tool normalizes all rows to the length of the longest row. Short rows are padded with empty cells on the right. This prevents alignment errors and ensures the Cucumber parser does not reject the table.

Yes. The parser strips the UTF-8 BOM (byte order mark, U+FEFF) if present at position 0 of the input. Excel on Windows adds this by default when saving as "CSV UTF-8". Failing to strip it would corrupt the first cell value.