About

Counting rows in a CSV file is not equivalent to counting newline characters. A conformant CSV per RFC 4180 permits quoted fields containing embedded newlines, meaning a single logical row can span multiple physical lines. Naive line-counting tools (wc -l, text editor line numbers) will overcount in these cases, producing incorrect data inventories. This matters when validating ETL pipelines, checking data exports against source record counts, or estimating processing time for batch operations where the row count drives resource allocation.

This tool implements a state-machine parser that tracks whether the cursor is inside a quoted field, correctly distinguishing data newlines from row-terminating newlines. It auto-detects the delimiter (comma, semicolon, tab, pipe) by frequency analysis of the first 5 lines, strips BOM prefixes, and excludes trailing empty lines. For files exceeding 1 MB, parsing is offloaded to a Web Worker to keep the interface responsive. Limitation: this tool assumes UTF-8 or ASCII encoding. Files in UTF-16 or legacy encodings (Shift-JIS, Windows-1252) may produce incorrect counts if they contain multi-byte sequences that alias delimiter or quote characters.

Formulas

The row counting algorithm uses a finite state machine with two states: UNQUOTED and QUOTED. The transition rules determine whether a newline character increments the row counter R.

R = 0, state = UNQUOTED

For each character c in input:

{

state → QUOTED if c = " ∧ state = UNQUOTEDstate → UNQUOTED if c = " ∧ state = QUOTED ∧ next ≠ "R = R + 1 if c ∈ {LF, CRLF} ∧ state = UNQUOTEDskip if c ∈ {LF, CRLF} ∧ state = QUOTED

After full traversal, if the file does not end with a newline and the last row contains data, R is incremented by 1. The total data row count is then R − 1 if the header toggle is enabled, otherwise R.

Delimiter auto-detection scores each candidate delimiter d by computing the standard deviation σ of occurrence counts across the first 5 lines. The delimiter with the lowest non-zero σ and highest mean count wins, as consistent frequency implies structural use rather than incidental appearance in data.

Reference Data

Delimiter	Common Name	Symbol	File Extension	Auto-Detected	Notes
Comma	CSV	,	.csv	Yes	RFC 4180 standard
Semicolon	CSV (European)	;	.csv	Yes	Common when locale uses comma as decimal separator
Tab	TSV	\t	.tsv, .tab	Yes	Rarely appears inside field values
Pipe	PSV	\|	.psv, .txt	Yes	Used in medical (HL7) and financial data
Caret	Caret-SV	^	.txt	No	Rare; use manual override
Tilde	Tilde-SV	~	.txt	No	Legacy mainframe exports
RFC 4180 Key Rules
Rule 1		Each record is on a separate line, delimited by a line break (CRLF)
Rule 2		Last record may or may not have an ending line break
Rule 3		Optional header line with same format as records
Rule 4		Fields may be enclosed in double quotes
Rule 5		Fields containing line breaks, double quotes, or commas must be quoted
Rule 6		Double quote inside a quoted field is escaped as ""
Common Row Count Discrepancies
Cause		Effect on naive count		This tool
Quoted newlines		Overcounts		Correct
Trailing empty line		+1 phantom row		Excluded
BOM prefix		First field corrupted		Stripped
Mixed line endings (CR/LF/CRLF)		Undercounts or overcounts		Normalized

Frequently Asked Questions

Your CSV likely contains quoted fields with embedded newline characters. Per RFC 4180, a field wrapped in double quotes may contain line breaks as literal data. A text editor counts physical lines (every LF or CRLF), while this tool counts logical rows by tracking whether the parser is inside a quoted field. The difference equals the number of embedded newlines within your data.

The tool samples the first 5 lines and counts occurrences of each candidate delimiter (comma, semicolon, tab, pipe). It selects the delimiter with the most consistent count across lines (lowest standard deviation) and a non-zero mean. Override manually when your file has fewer than 3 rows (insufficient sample), when multiple delimiters appear with equal frequency, or when using an uncommon delimiter like caret or tilde.

It affects the reported data row count. When enabled, the tool subtracts 1 from the total row count, reporting the header separately. The total parsed rows (including header) is always displayed alongside. This distinction matters when comparing against database record counts, which exclude headers.

The parser normalizes all line ending styles before counting. It treats standalone CR (old Mac), standalone LF (Unix/Mac), and CRLF (Windows) identically as row terminators. A CRLF sequence is consumed as a single delimiter, not two. This prevents double-counting that affects naive parsers on files transferred between operating systems.

There is no hard limit; the practical ceiling depends on your browser's available memory. Files under 1 MB are parsed on the main thread. Files over 1 MB are offloaded to a Web Worker to prevent the UI from freezing. For files exceeding roughly 500 MB, you may experience memory pressure. In such cases, consider splitting the file or using a command-line tool like awk.

Per RFC 4180 Rule 6, a double quote inside a quoted field is represented as two consecutive double quotes (""). The parser's state machine recognizes this pattern: when in QUOTED state and encountering a double quote followed by another double quote, it treats the pair as a literal quote character and remains in QUOTED state. Only a double quote followed by a delimiter, newline, or end-of-file transitions back to UNQUOTED state.