User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
Chunks will appear here
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

Processing large documents for LLM context windows, RAG pipelines, or embedding databases requires precise text segmentation. Incorrect chunk boundaries destroy semantic coherence. Chunks that are too large exceed token limits and get truncated silently. Chunks that are too small lose context and produce garbage retrieval results. This tool splits text by characters, words, sentences, or paragraphs with configurable overlap to preserve boundary context. Overlap creates a sliding window: each chunk advances by size โˆ’ overlap units, ensuring no information falls into a gap between segments.

Token estimates use the 1 token 4 characters heuristic common to GPT-family tokenizers. This approximation breaks down for non-Latin scripts and code. For production pipelines, validate against your actual tokenizer. The tool assumes UTF-8 input and treats all whitespace as equivalent for word splitting.

text chunking split text text splitter chunk text text segmentation NLP preprocessing token chunking text overlap

Formulas

The sliding window chunking algorithm advances by a stride computed as the difference between chunk size and overlap. Given input segmented into N units:

stride = size โˆ’ overlap
chunki = units[i โ‹… stride : i โ‹… stride + size]

The total number of chunks produced:

nchunks = ceil(N โˆ’ overlapsize โˆ’ overlap)

Token estimation for GPT-family models:

tokens ceil(characters4)

Where size is the number of units per chunk, overlap is the number of units shared between consecutive chunks, stride is the forward step between chunk starts, N is total units in the input, and units are characters, words, sentences, or paragraphs depending on the selected method.

Reference Data

Chunk MethodUnitBest ForTypical SizeOverlap RecommendationToken Estimate Accuracy
CharactersUnicode charsFixed-width constraints, byte budgets500 - 200050 - 200 charsHigh for Latin text
WordsWhitespace tokensReadable chunks, summaries100 - 50010 - 50 wordsModerate
SentencesSentence boundariesSemantic search, Q&A systems3 - 101 - 2 sentencesVariable
ParagraphsDouble newlinesDocument summaries, long-form RAG1 - 51 paragraphVariable
Custom DelimiterUser stringCSV rows, log entries, custom formatsVariesNot applicableDepends on content
Common LLM Context Windows
GPT-3.5TokensGeneral tasks4,096 tokens - 16,384 chars
GPT-4TokensComplex reasoning8,192 tokens - 32,768 chars
GPT-4 TurboTokensLarge documents128,000 tokens - 512,000 chars
Claude 3TokensLong context200,000 tokens - 800,000 chars
Gemini 1.5TokensUltra-long context1,000,000 tokens - 4,000,000 chars
Llama 3TokensOpen-source8,192 tokens - 32,768 chars
Mistral LargeTokensEnterprise32,000 tokens - 128,000 chars
Overlap Strategy Guidelines
No overlap0%Independent chunks, deduplication safe - Risk: boundary information loss -
Light overlap10%General retrieval - Good balance of redundancy and coverage -
Heavy overlap25 - 50%Dense semantic search - Higher storage cost, better recall -

Frequently Asked Questions

Overlap increases chunk count. With size = 100 words and overlap = 20 words, the stride is 80 words. A 1000-word document produces ceil(980 รท 80) = 13 chunks instead of 10 with zero overlap. Storage grows roughly by factor size รท stride. For embedding databases, this means more vectors to index and higher retrieval costs, but better recall at chunk boundaries.
The 1 token โ‰ˆ 4 characters rule is a rough average for English prose with GPT-family BPE tokenizers. Code, URLs, non-Latin scripts, and rare words tokenize less efficiently (sometimes 1 token per 1 - 2 characters). Chinese and Japanese text averages roughly 1 token per 1.5 characters. Always validate with your model's actual tokenizer (e.g., tiktoken for OpenAI models) before setting production chunk sizes.
If overlap โ‰ฅ size, the stride becomes zero or negative, which means the window never advances. This tool clamps overlap to size โˆ’ 1 to guarantee forward progress. You will see a warning toast if your overlap value is adjusted.
Sentence-based chunking with 1 - 2 sentence overlap generally produces the most semantically coherent chunks for retrieval-augmented generation. Paragraph chunking works well for structured documents with clear section breaks. Character and word chunking are faster but may split mid-sentence, degrading retrieval quality. For best results, chunk by sentences with a target of 3 - 8 sentences per chunk.
The sentence splitter uses a regex that detects boundaries after ., !, or ? followed by whitespace and an uppercase letter or end of string. This is a heuristic. It will incorrectly split on abbreviations like "Dr. Smith" or "U.S. Army". For production NLP pipelines with abbreviation-heavy text, preprocess with a dedicated sentence tokenizer (spaCy, NLTK) before pasting into this tool.
Yes. The custom delimiter field accepts any string, including multi-character sequences like "## " or "---" or "\n\n". The tool splits on exact string matches. Note that the delimiter itself is consumed during splitting and does not appear in the output chunks. If you need to preserve delimiters, add them back manually or use a regex-capable preprocessor.