
The Art of Context Pruning: How to Send Less and Get More from LLMs
The Art of Context Pruning: How to Send Less and Get More from LLMs
Here's a counterintuitive truth about working with LLMs:
Sending less context often produces better results.
Most developers do the opposite. They think: "More context = more information = better answers."
So they dump everything into the context window:
- Full conversation history
- Entire documentation
- All available tool descriptions
- Detailed examples
- Verbose instructions
Then they wonder why:
- Responses are less accurate
- Performance is degrading
- Costs are skyrocketing
- The model seems "confused"
The fix: Context pruning — the art of sending only what matters.
What Is Context Pruning?
Context pruning is selectively removing information from the context window to improve performance.
Think of it like editing:
- First draft: Everything you might need (bloated)
- Final draft: Only what serves the purpose (lean)
The goal: Maximum signal, minimum noise.
Why Less Context Works Better
Reason #1: Attention Is Limited
LLMs use "attention mechanisms" to decide which parts of the input are relevant.
Problem: The more information you provide, the more the model has to juggle.
Analogy: Asking someone to find a name in a 10-page phone book vs. a 1,000-page phone book. Same task, but one is much harder.
Real impact:
- Small context (2K tokens): Model focuses precisely
- Large context (50K tokens): Attention is diluted
Reason #2: Noise Obscures Signal
Irrelevant information isn't just ignored — it actively interferes.
Example:
Query: "What's our return policy for electronics?"
Bloated context (includes entire employee handbook):
- Company history
- Employee benefits
- Return policy ✓ (what we need)
- Shipping policies
- Office locations
- IT policies
Result: Model might blend information, mentioning employee-related return policies or other irrelevant details.
Pruned context (only customer-facing return policy):
- Return policy ✓
Result: Accurate, focused answer.
Reason #3: Cost Compounds
You pay for every token, every time.
Scenario: Customer support bot handling 10,000 queries/day
Bloated approach (5,000 tokens per query):
- 10,000 × 5,000 = 50,000,000 tokens/day
- At $0.003 per 1K tokens = $150/day = $4,500/month
Pruned approach (800 tokens per query):
- 10,000 × 800 = 8,000,000 tokens/day
- At $0.003 per 1K tokens = $24/day = $720/month
Savings: $3,780/month ($45K/year)
Context Pruning Strategies
Strategy #1: Conversation History Pruning
Problem: Keeping entire conversation history in context.
After 50 turns, you're sending thousands of irrelevant tokens.
Approach: Sliding Window
Keep only the last N turns:
def get_context_for_turn(messages, window_size=10):
# Keep only last 10 exchanges
recent_messages = messages[-window_size:]
return recent_messages
Example:
Turn 1-40: Discussion about product features
Turn 41-50: Discussion about pricing
Turn 51: User asks about returns
Context sent:
→ Only turns 41-51 (last 10)
→ Early product discussion is irrelevant to returns
Approach: Semantic Filtering
Keep only messages relevant to the current topic:
def get_relevant_context(messages, current_query):
# Calculate relevance of each past message to current query
relevant_messages = []
for msg in messages:
similarity = calculate_similarity(msg, current_query)
if similarity > 0.7: # Threshold
relevant_messages.append(msg)
return relevant_messages[-10:] # Cap at 10
Example:
Turn 5: User asked about returns
Turn 12: User asked about shipping
Turn 28: User asked about warranty
Turn 51: User asks about returns again
Context sent:
→ Turn 5 (previous return question - highly relevant)
→ Turn 28 (warranty relates to returns - somewhat relevant)
→ Skip turn 12 (shipping not relevant)
Approach: Summarization
Summarize old context instead of sending verbatim:
def get_context_with_summary(messages, recent_count=5):
recent = messages[-recent_count:]
if len(messages) > recent_count:
old_messages = messages[:-recent_count]
summary = llm.summarize(old_messages)
return [summary] + recent
else:
return messages
Example:
Original: 40 messages (8,000 tokens)
Pruned context:
→ Summary: "User was inquiring about product features,
discussed pricing for Pro tier, asked about bulk
discounts" (150 tokens)
→ Last 5 messages: Full detail (600 tokens)
→ Total: 750 tokens (90% reduction)
Strategy #2: Document Retrieval Pruning
Problem: Sending entire documents when only excerpts are needed.
Approach: Chunk and Retrieve
- Break documents into chunks (500-1000 tokens each)
- Embed chunks in vector database
- Retrieve only top-k relevant chunks
def retrieve_relevant_context(query, k=3):
# Get top 3 most relevant chunks
relevant_chunks = vector_db.similarity_search(
query=query,
top_k=k
)
return relevant_chunks
Example:
Full document: 50-page product manual (25,000 tokens)
Query: "How do I reset my password?"
Retrieved context:
- Chunk #47: "Password Reset" section (400 tokens)
- Chunk #23: "Account Security" section (350 tokens)
- Chunk #8: "Login Troubleshooting" section (300 tokens)
Total sent: 1,050 tokens (96% reduction)
Approach: Hierarchical Retrieval
Start with summaries, drill down only if needed:
def hierarchical_retrieve(query):
# Level 1: Check summaries
summary_results = search_summaries(query)
if summary_results.confidence > 0.9:
return summary_results
# Level 2: Drill into relevant sections
section_results = search_sections(
sections=summary_results.relevant_sections,
query=query
)
return section_results
Example:
Query: "What's your return policy?"
Level 1: Search policy summaries
→ Found: "Returns accepted within 30 days" (high confidence)
→ Stop here, no need to drill down (200 tokens)
Query: "Can I return an opened electronic device?"
Level 1: Search policy summaries
→ Found: "Returns accepted within 30 days" (incomplete answer)
→ Level 2: Retrieve full electronics return policy
→ Total context: 800 tokens
Strategy #3: Tool Description Pruning
Problem: Loading all tool descriptions when only a few are needed.
Approach: Dynamic Tool Loading
Load only tools relevant to current task:
def get_relevant_tools(query, all_tools):
# Classify intent
intent = classify_intent(query)
# Map intent to relevant tools
tool_map = {
"order_inquiry": ["lookup_order", "check_shipping"],
"refund_request": ["lookup_order", "process_refund"],
"product_question": ["search_products", "get_details"],
}
relevant_tool_names = tool_map.get(intent, [])
relevant_tools = [t for t in all_tools if t.name in relevant_tool_names]
return relevant_tools
Example:
All tools (30 tools): 3,500 tokens
Query: "Where's my order?"
Intent: order_inquiry
Relevant tools (2 tools):
- lookup_order (100 tokens)
- check_shipping (80 tokens)
- Total: 180 tokens (95% reduction)
Approach: Tiered Tool Access
Group tools by usage frequency:
# Tier 1: Always loaded (common tools)
core_tools = ["clarify_question", "format_response"] # 150 tokens
# Tier 2: Load on demand (domain-specific)
domain_tools = {
"orders": ["lookup_order", "track_shipping"],
"refunds": ["process_refund", "check_eligibility"],
# ...
}
# Tier 3: Rarely used (specialized)
specialized_tools = ["bulk_import", "generate_report"] # Load only when explicitly needed
Strategy #4: Example Pruning
Problem: Sending 10 examples when 2 would suffice.
Approach: Few-Shot to Zero-Shot Progression
Start with examples, remove them as the model learns your pattern:
def get_examples_for_query(query, model_performance):
if model_performance.accuracy < 0.7:
# Model struggles, send 5 examples
return get_examples(count=5)
elif model_performance.accuracy < 0.9:
# Model is learning, send 2 examples
return get_examples(count=2)
else:
# Model is good, no examples needed
return []
Approach: Dynamic Example Selection
Choose examples similar to the current query:
def get_relevant_examples(query, example_bank):
# Find examples similar to current query
similar_examples = []
for example in example_bank:
similarity = calculate_similarity(query, example.input)
if similarity > 0.8:
similar_examples.append((similarity, example))
# Return top 2 most similar
similar_examples.sort(reverse=True)
return [ex for _, ex in similar_examples[:2]]
Strategy #5: Instruction Compression
Problem: Verbose instructions waste tokens.
Bad (verbose):
Please analyze the following customer feedback carefully and
thoroughly. Take into account the sentiment, key themes, and
any actionable insights. Make sure to structure your response
in a clear and organized manner. Be concise but comprehensive.
Good (compressed):
Analyze this feedback:
- Sentiment
- Key themes
- Actionable insights
Format: Concise bullet points.
Tokens saved: 45 → 15 (67% reduction)
Approach: Create instruction templates:
templates = {
"analyze_feedback": """
Analyze feedback:
- Sentiment: [positive/negative/neutral]
- Themes: [list]
- Actions: [list]
""",
"summarize_conversation": """
Summarize:
- Main topic
- Key decisions
- Next steps
""",
}
Dynamic Context Budgets
Advanced technique: Allocate a token budget and prioritize what fills it.
def build_context(query, max_tokens=4000):
context = []
tokens_used = 0
# Priority 1: Always include system prompt (200 tokens)
context.append(system_prompt)
tokens_used += 200
# Priority 2: Current query (variable)
context.append(query)
tokens_used += count_tokens(query)
# Priority 3: Essential tools (300 tokens)
essential_tools = get_essential_tools()
context.append(essential_tools)
tokens_used += 300
# Priority 4: Recent conversation (budget: remaining / 2)
conversation_budget = (max_tokens - tokens_used) // 2
recent_messages = get_recent_messages(max_tokens=conversation_budget)
context.append(recent_messages)
tokens_used += count_tokens(recent_messages)
# Priority 5: Retrieved documents (budget: remaining)
retrieval_budget = max_tokens - tokens_used
relevant_docs = retrieve_documents(query, max_tokens=retrieval_budget)
context.append(relevant_docs)
return context
Result: Context never exceeds budget, highest-priority info always included.
Pruning Heuristics
Rule #1: Recency Over Age
Recent information is usually more relevant than old information.
Exception: User explicitly references something from earlier ("Like you said before...")
Rule #2: Specificity Over Generality
Specific, detailed information beats generic summaries when answering specific questions.
Example: User asks about "electronics return policy" → Send electronics-specific policy, not general return policy.
Rule #3: Task-Relevant Over Comprehensive
Include only what's needed for the current task.
Example: Formatting a response doesn't need full conversation history.
Rule #4: Quality Over Quantity
3 highly relevant documents beat 10 moderately relevant documents.
Rule #5: Fast Over Perfect
Better to prune aggressively and miss 5% of relevant context than send everything and degrade performance by 20%.
Measuring Pruning Effectiveness
Track these metrics:
1. Context Size
avg_context_tokens = sum(context_sizes) / num_requests
Goal: Minimize without hurting accuracy
2. Cost Per Request
cost_per_request = (input_tokens * input_price + output_tokens * output_price)
Goal: Reduce by 50-80% through pruning
3. Response Accuracy
accuracy = correct_responses / total_responses
Goal: Maintain or improve (pruning should help, not hurt)
4. Response Time
avg_response_time = sum(response_times) / num_requests
Goal: Reduce (smaller context = faster processing)
Common Mistakes
Mistake #1: Pruning Too Aggressively
Problem: Removing critical context breaks functionality.
Example: Pruning conversation history but user says "like I mentioned earlier..."
Solution: Use semantic analysis to detect references to prior context before pruning.
Mistake #2: Static Pruning Rules
Problem: Same pruning logic for all queries.
Example: Always keeping last 10 messages, even when current query references turn 15.
Solution: Dynamic pruning based on current query.
Mistake #3: Not Monitoring Impact
Problem: Prune context but don't measure if accuracy drops.
Solution: A/B test pruning strategies, track accuracy metrics.
Mistake #4: Ignoring Cost/Accuracy Tradeoff
Problem: Over-optimizing for cost, sacrificing too much accuracy.
Solution: Find the sweet spot (e.g., 60% cost reduction with 98% of accuracy maintained).
The Bottom Line
Context pruning is essential for production LLM systems.
Benefits:
- 50-80% cost reduction
- 20-40% faster responses
- 10-20% accuracy improvement (from reduced noise)
Core principle: Send only what matters for the current task.
Strategies:
- Prune conversation history (sliding window, semantic filtering, summarization)
- Retrieve relevant chunks (not full documents)
- Load tools dynamically (based on intent)
- Compress instructions
- Use dynamic context budgets
Start simple: Implement sliding window for conversation history, measure impact, iterate.
Getting Started
Quick wins (implement today):
- Limit conversation history to last 10 turns
- Compress system instructions (remove verbosity)
- Load only relevant tools (not all 30)
- Measure context size per request
Expected impact: 40-60% token reduction immediately
Need help optimizing your LLM context usage? We've helped teams reduce token costs by 70%+ while improving accuracy.
About the Author
DomAIn Labs Team
The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.