About

Processing large documents for LLM context windows, RAG pipelines, or embedding databases requires precise text segmentation. Incorrect chunk boundaries destroy semantic coherence. Chunks that are too large exceed token limits and get truncated silently. Chunks that are too small lose context and produce garbage retrieval results. This tool splits text by characters, words, sentences, or paragraphs with configurable overlap to preserve boundary context. Overlap creates a sliding window: each chunk advances by size − overlap units, ensuring no information falls into a gap between segments.

Token estimates use the 1 token ≈ 4 characters heuristic common to GPT-family tokenizers. This approximation breaks down for non-Latin scripts and code. For production pipelines, validate against your actual tokenizer. The tool assumes UTF-8 input and treats all whitespace as equivalent for word splitting.

Formulas

The sliding window chunking algorithm advances by a stride computed as the difference between chunk size and overlap. Given input segmented into N units:

stride = size − overlap

chunk_i = units[i ⋅ stride : i ⋅ stride + size]

The total number of chunks produced:

n_chunks = ceil(N − overlapsize − overlap)

Token estimation for GPT-family models:

tokens ≈ ceil(characters4)

Where size is the number of units per chunk, overlap is the number of units shared between consecutive chunks, stride is the forward step between chunk starts, N is total units in the input, and units are characters, words, sentences, or paragraphs depending on the selected method.

Reference Data

Chunk Method	Unit	Best For	Typical Size	Overlap Recommendation	Token Estimate Accuracy
Characters	Unicode chars	Fixed-width constraints, byte budgets	500 - 2000	50 - 200 chars	High for Latin text
Words	Whitespace tokens	Readable chunks, summaries	100 - 500	10 - 50 words	Moderate
Sentences	Sentence boundaries	Semantic search, Q&A systems	3 - 10	1 - 2 sentences	Variable
Paragraphs	Double newlines	Document summaries, long-form RAG	1 - 5	1 paragraph	Variable
Custom Delimiter	User string	CSV rows, log entries, custom formats	Varies	Not applicable	Depends on content
Common LLM Context Windows
GPT-3.5	Tokens	General tasks	4,096 tokens	-	≈ 16,384 chars
GPT-4	Tokens	Complex reasoning	8,192 tokens	-	≈ 32,768 chars
GPT-4 Turbo	Tokens	Large documents	128,000 tokens	-	≈ 512,000 chars
Claude 3	Tokens	Long context	200,000 tokens	-	≈ 800,000 chars
Gemini 1.5	Tokens	Ultra-long context	1,000,000 tokens	-	≈ 4,000,000 chars
Llama 3	Tokens	Open-source	8,192 tokens	-	≈ 32,768 chars
Mistral Large	Tokens	Enterprise	32,000 tokens	-	≈ 128,000 chars
Overlap Strategy Guidelines
No overlap	0%	Independent chunks, deduplication safe	-	Risk: boundary information loss	-
Light overlap	10%	General retrieval	-	Good balance of redundancy and coverage	-
Heavy overlap	25 - 50%	Dense semantic search	-	Higher storage cost, better recall	-

Frequently Asked Questions

Overlap increases chunk count. With size = 100 words and overlap = 20 words, the stride is 80 words. A 1000-word document produces ceil(980 ÷ 80) = 13 chunks instead of 10 with zero overlap. Storage grows roughly by factor size ÷ stride. For embedding databases, this means more vectors to index and higher retrieval costs, but better recall at chunk boundaries.

The 1 token ≈ 4 characters rule is a rough average for English prose with GPT-family BPE tokenizers. Code, URLs, non-Latin scripts, and rare words tokenize less efficiently (sometimes 1 token per 1 - 2 characters). Chinese and Japanese text averages roughly 1 token per 1.5 characters. Always validate with your model's actual tokenizer (e.g., tiktoken for OpenAI models) before setting production chunk sizes.

If overlap ≥ size, the stride becomes zero or negative, which means the window never advances. This tool clamps overlap to size − 1 to guarantee forward progress. You will see a warning toast if your overlap value is adjusted.

Sentence-based chunking with 1 - 2 sentence overlap generally produces the most semantically coherent chunks for retrieval-augmented generation. Paragraph chunking works well for structured documents with clear section breaks. Character and word chunking are faster but may split mid-sentence, degrading retrieval quality. For best results, chunk by sentences with a target of 3 - 8 sentences per chunk.

The sentence splitter uses a regex that detects boundaries after ., !, or ? followed by whitespace and an uppercase letter or end of string. This is a heuristic. It will incorrectly split on abbreviations like "Dr. Smith" or "U.S. Army". For production NLP pipelines with abbreviation-heavy text, preprocess with a dedicated sentence tokenizer (spaCy, NLTK) before pasting into this tool.

Yes. The custom delimiter field accepts any string, including multi-character sequences like "## " or "---" or "\n\n". The tool splits on exact string matches. Note that the delimiter itself is consumed during splitting and does not appear in the output chunks. If you need to preserve delimiters, add them back manually or use a regex-capable preprocessor.