Chonkie
A fast and efficient text chunking library for processing large documents in AI and NLP applications
Chonkie: Intelligent Text Chunking for AI
Chonkie is a high-performance Python library designed for intelligent text chunking and document segmentation. It's optimized for AI and NLP applications that need to process large documents efficiently while maintaining semantic coherence.
🚀 Key Features
Smart Chunking Algorithms
- Semantic Chunking: Preserves meaning across chunk boundaries
- Sliding Window: Overlapping chunks for context preservation
- Sentence-Aware: Respects sentence boundaries
- Token-Based: Precise token count control for LLM inputs
Performance Optimized
- Fast Processing: Optimized algorithms for large documents
- Memory Efficient: Minimal memory footprint
- Parallel Processing: Multi-threaded chunking for speed
- Streaming Support: Process documents without loading entirely
Flexible Configuration
- Customizable Chunk Sizes: Adapt to different model requirements
- Multiple Strategies: Choose the best chunking method for your use case
- Format Support: Handle various document formats (PDF, TXT, MD, HTML)
- Language Agnostic: Works with multiple languages
💡 Use Cases
AI & Machine Learning
- RAG Systems: Prepare documents for retrieval-augmented generation
- Fine-tuning: Create training datasets from large documents
- Embeddings: Generate embeddings for document chunks
- Question Answering: Segment documents for QA systems
Document Processing
- Content Analysis: Break down documents for analysis
- Search Indexing: Create searchable document segments
- Translation: Chunk documents for translation workflows
- Summarization: Prepare content for summarization models
🛠 Installation & Usage
Quick Installation
# Install via pip
pip install chonkie
# Or install from source
git clone https://github.com/chonkie-ai/chonkie.git
cd chonkie
pip install -e .
Basic Usage
from chonkie import TextChunker
# Initialize chunker
chunker = TextChunker(
chunk_size=512,
overlap=50,
strategy='semantic'
)
# Chunk a document
with open('document.txt', 'r') as f:
text = f.read()
chunks = chunker.chunk(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {len(chunk.text)} characters")
print(f"Tokens: {chunk.token_count}")
print(f"Content: {chunk.text[:100]}...")
Advanced Configuration
from chonkie import SemanticChunker, SlidingWindowChunker
# Semantic chunking for better coherence
semantic_chunker = SemanticChunker(
model='sentence-transformers/all-MiniLM-L6-v2',
similarity_threshold=0.7,
max_chunk_size=1000,
min_chunk_size=100
)
# Sliding window for overlapping context
sliding_chunker = SlidingWindowChunker(
window_size=512,
step_size=256,
preserve_sentences=True
)
# Process documents
chunks = semantic_chunker.chunk_document('research_paper.pdf')
🌟 Advanced Features
Multiple Chunking Strategies
# Token-based chunking
from chonkie import TokenChunker
token_chunker = TokenChunker(
tokenizer='gpt-4',
max_tokens=2048,
overlap_tokens=100
)
# Paragraph-based chunking
from chonkie import ParagraphChunker
para_chunker = ParagraphChunker(
max_paragraphs=5,
preserve_structure=True
)
# Custom chunking logic
from chonkie import CustomChunker
def custom_split_logic(text):
# Your custom splitting logic here
return text.split('\n\n')
custom_chunker = CustomChunker(split_function=custom_split_logic)
Batch Processing
from chonkie import BatchProcessor
processor = BatchProcessor(
chunker=semantic_chunker,
batch_size=10,
num_workers=4
)
# Process multiple documents
documents = ['doc1.txt', 'doc2.pdf', 'doc3.md']
results = processor.process_batch(documents)
for doc_name, chunks in results.items():
print(f"{doc_name}: {len(chunks)} chunks created")
📊 Performance Benchmarks
Speed Comparison
| Document Size | Chonkie | LangChain | Custom Script |
|---|---|---|---|
| 1MB | 0.5s | 2.1s | 1.8s |
| 10MB | 3.2s | 18.7s | 15.3s |
| 100MB | 28.1s | 185.4s | 142.7s |
Memory Usage
- Streaming Mode: Constant memory usage regardless of document size
- Batch Mode: Linear scaling with configurable limits
- Optimization: 60% less memory than comparable libraries
🔧 Integration Examples
With LangChain
from chonkie import TextChunker
from langchain.text_splitter import ChonkieTextSplitter
# Use Chonkie with LangChain
splitter = ChonkieTextSplitter(
chunker=TextChunker(chunk_size=1000, overlap=100)
)
documents = splitter.split_documents(docs)
With Haystack
from chonkie import SemanticChunker
from haystack import Document
chunker = SemanticChunker()
def preprocess_documents(docs):
processed = []
for doc in docs:
chunks = chunker.chunk(doc.content)
for chunk in chunks:
processed.append(Document(content=chunk.text))
return processed
🤝 Contributing
Chonkie is open source and welcomes contributions!
Development Setup
git clone https://github.com/chonkie-ai/chonkie.git
cd chonkie
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run benchmarks
python benchmarks/run_benchmarks.py
Contribution Areas
- New Chunking Strategies: Implement novel chunking algorithms
- Performance Optimization: Improve speed and memory efficiency
- Format Support: Add support for new document formats
- Integration: Build connectors for popular frameworks
📈 Roadmap
- GPU Acceleration: CUDA support for faster processing
- Cloud Integration: Native cloud storage support
- Advanced Semantics: Better semantic understanding
- Real-time Processing: Streaming document processing
Chonkie makes intelligent text chunking accessible and efficient for AI applications of all scales.
Ready to optimize your text processing? Try Chonkie today!