IntermediateActive

Chonkie

A fast and efficient text chunking library for processing large documents in AI and NLP applications

Author:Chonkie Inc
Stars:320
Language:Python
Updated:October 25, 2024
View on GitHubApache-2.0

Chonkie: Intelligent Text Chunking for AI

Chonkie is a high-performance Python library designed for intelligent text chunking and document segmentation. It's optimized for AI and NLP applications that need to process large documents efficiently while maintaining semantic coherence.

🚀 Key Features

Smart Chunking Algorithms

  • Semantic Chunking: Preserves meaning across chunk boundaries
  • Sliding Window: Overlapping chunks for context preservation
  • Sentence-Aware: Respects sentence boundaries
  • Token-Based: Precise token count control for LLM inputs

Performance Optimized

  • Fast Processing: Optimized algorithms for large documents
  • Memory Efficient: Minimal memory footprint
  • Parallel Processing: Multi-threaded chunking for speed
  • Streaming Support: Process documents without loading entirely

Flexible Configuration

  • Customizable Chunk Sizes: Adapt to different model requirements
  • Multiple Strategies: Choose the best chunking method for your use case
  • Format Support: Handle various document formats (PDF, TXT, MD, HTML)
  • Language Agnostic: Works with multiple languages

💡 Use Cases

AI & Machine Learning

  • RAG Systems: Prepare documents for retrieval-augmented generation
  • Fine-tuning: Create training datasets from large documents
  • Embeddings: Generate embeddings for document chunks
  • Question Answering: Segment documents for QA systems

Document Processing

  • Content Analysis: Break down documents for analysis
  • Search Indexing: Create searchable document segments
  • Translation: Chunk documents for translation workflows
  • Summarization: Prepare content for summarization models

🛠 Installation & Usage

Quick Installation

# Install via pip
pip install chonkie

# Or install from source
git clone https://github.com/chonkie-ai/chonkie.git
cd chonkie
pip install -e .

Basic Usage

from chonkie import TextChunker

# Initialize chunker
chunker = TextChunker(
    chunk_size=512,
    overlap=50,
    strategy='semantic'
)

# Chunk a document
with open('document.txt', 'r') as f:
    text = f.read()

chunks = chunker.chunk(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {len(chunk.text)} characters")
    print(f"Tokens: {chunk.token_count}")
    print(f"Content: {chunk.text[:100]}...")

Advanced Configuration

from chonkie import SemanticChunker, SlidingWindowChunker

# Semantic chunking for better coherence
semantic_chunker = SemanticChunker(
    model='sentence-transformers/all-MiniLM-L6-v2',
    similarity_threshold=0.7,
    max_chunk_size=1000,
    min_chunk_size=100
)

# Sliding window for overlapping context
sliding_chunker = SlidingWindowChunker(
    window_size=512,
    step_size=256,
    preserve_sentences=True
)

# Process documents
chunks = semantic_chunker.chunk_document('research_paper.pdf')

🌟 Advanced Features

Multiple Chunking Strategies

# Token-based chunking
from chonkie import TokenChunker

token_chunker = TokenChunker(
    tokenizer='gpt-4',
    max_tokens=2048,
    overlap_tokens=100
)

# Paragraph-based chunking
from chonkie import ParagraphChunker

para_chunker = ParagraphChunker(
    max_paragraphs=5,
    preserve_structure=True
)

# Custom chunking logic
from chonkie import CustomChunker

def custom_split_logic(text):
    # Your custom splitting logic here
    return text.split('\n\n')

custom_chunker = CustomChunker(split_function=custom_split_logic)

Batch Processing

from chonkie import BatchProcessor

processor = BatchProcessor(
    chunker=semantic_chunker,
    batch_size=10,
    num_workers=4
)

# Process multiple documents
documents = ['doc1.txt', 'doc2.pdf', 'doc3.md']
results = processor.process_batch(documents)

for doc_name, chunks in results.items():
    print(f"{doc_name}: {len(chunks)} chunks created")

📊 Performance Benchmarks

Speed Comparison

Document Size Chonkie LangChain Custom Script
1MB 0.5s 2.1s 1.8s
10MB 3.2s 18.7s 15.3s
100MB 28.1s 185.4s 142.7s

Memory Usage

  • Streaming Mode: Constant memory usage regardless of document size
  • Batch Mode: Linear scaling with configurable limits
  • Optimization: 60% less memory than comparable libraries

🔧 Integration Examples

With LangChain

from chonkie import TextChunker
from langchain.text_splitter import ChonkieTextSplitter

# Use Chonkie with LangChain
splitter = ChonkieTextSplitter(
    chunker=TextChunker(chunk_size=1000, overlap=100)
)

documents = splitter.split_documents(docs)

With Haystack

from chonkie import SemanticChunker
from haystack import Document

chunker = SemanticChunker()

def preprocess_documents(docs):
    processed = []
    for doc in docs:
        chunks = chunker.chunk(doc.content)
        for chunk in chunks:
            processed.append(Document(content=chunk.text))
    return processed

🤝 Contributing

Chonkie is open source and welcomes contributions!

Development Setup

git clone https://github.com/chonkie-ai/chonkie.git
cd chonkie

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run benchmarks
python benchmarks/run_benchmarks.py

Contribution Areas

  • New Chunking Strategies: Implement novel chunking algorithms
  • Performance Optimization: Improve speed and memory efficiency
  • Format Support: Add support for new document formats
  • Integration: Build connectors for popular frameworks

📈 Roadmap

  • GPU Acceleration: CUDA support for faster processing
  • Cloud Integration: Native cloud storage support
  • Advanced Semantics: Better semantic understanding
  • Real-time Processing: Streaming document processing

Chonkie makes intelligent text chunking accessible and efficient for AI applications of all scales.

Ready to optimize your text processing? Try Chonkie today!

Related Projects

IntermediateActive
12

Deep ORC App

Transform physical documents into digital text with Deep ORC App's state-of-the-art optical character recognition technology.

By Rohan Dumasia
PythonMIT
intermediateactive
301

Texo

A minimalist SOTA LaTeX OCR model which contains only 20M parameters and runs in browser. Containing full training pipeline suitable for self-study. | 超轻量SOTA LaTeX公式识别模型,20M参数量,可在浏览器中运行。包含训练全流程代码,适合自学。

By alephpi
PythonAGPL-3.0
Featuredbeginneractive
23955

awesome-ai-agents

A list of AI autonomous agents

By e2b-dev