Chunkify Text
Split and chunk text by characters, words, sentences, or paragraphs with configurable overlap. Export chunks as JSON or TXT.
About
Processing large documents for LLM context windows, RAG pipelines, or embedding databases requires precise text segmentation. Incorrect chunk boundaries destroy semantic coherence. Chunks that are too large exceed token limits and get truncated silently. Chunks that are too small lose context and produce garbage retrieval results. This tool splits text by characters, words, sentences, or paragraphs with configurable overlap to preserve boundary context. Overlap creates a sliding window: each chunk advances by size โ overlap units, ensuring no information falls into a gap between segments.
Token estimates use the 1 token ≈ 4 characters heuristic common to GPT-family tokenizers. This approximation breaks down for non-Latin scripts and code. For production pipelines, validate against your actual tokenizer. The tool assumes UTF-8 input and treats all whitespace as equivalent for word splitting.
Formulas
The sliding window chunking algorithm advances by a stride computed as the difference between chunk size and overlap. Given input segmented into N units:
The total number of chunks produced:
Token estimation for GPT-family models:
Where size is the number of units per chunk, overlap is the number of units shared between consecutive chunks, stride is the forward step between chunk starts, N is total units in the input, and units are characters, words, sentences, or paragraphs depending on the selected method.
Reference Data
| Chunk Method | Unit | Best For | Typical Size | Overlap Recommendation | Token Estimate Accuracy |
|---|---|---|---|---|---|
| Characters | Unicode chars | Fixed-width constraints, byte budgets | 500 - 2000 | 50 - 200 chars | High for Latin text |
| Words | Whitespace tokens | Readable chunks, summaries | 100 - 500 | 10 - 50 words | Moderate |
| Sentences | Sentence boundaries | Semantic search, Q&A systems | 3 - 10 | 1 - 2 sentences | Variable |
| Paragraphs | Double newlines | Document summaries, long-form RAG | 1 - 5 | 1 paragraph | Variable |
| Custom Delimiter | User string | CSV rows, log entries, custom formats | Varies | Not applicable | Depends on content |
| Common LLM Context Windows | |||||
| GPT-3.5 | Tokens | General tasks | 4,096 tokens | - | ≈ 16,384 chars |
| GPT-4 | Tokens | Complex reasoning | 8,192 tokens | - | ≈ 32,768 chars |
| GPT-4 Turbo | Tokens | Large documents | 128,000 tokens | - | ≈ 512,000 chars |
| Claude 3 | Tokens | Long context | 200,000 tokens | - | ≈ 800,000 chars |
| Gemini 1.5 | Tokens | Ultra-long context | 1,000,000 tokens | - | ≈ 4,000,000 chars |
| Llama 3 | Tokens | Open-source | 8,192 tokens | - | ≈ 32,768 chars |
| Mistral Large | Tokens | Enterprise | 32,000 tokens | - | ≈ 128,000 chars |
| Overlap Strategy Guidelines | |||||
| No overlap | 0% | Independent chunks, deduplication safe | - | Risk: boundary information loss | - |
| Light overlap | 10% | General retrieval | - | Good balance of redundancy and coverage | - |
| Heavy overlap | 25 - 50% | Dense semantic search | - | Higher storage cost, better recall | - |