The Art of Context Pruning: How to Send Less and Get More from LLMs
Agent Guides

The Art of Context Pruning: How to Send Less and Get More from LLMs

DomAIn Labs Team
May 22, 2025
10 min read

The Art of Context Pruning: How to Send Less and Get More from LLMs

Here's a counterintuitive truth about working with LLMs:

Sending less context often produces better results.

Most developers do the opposite. They think: "More context = more information = better answers."

So they dump everything into the context window:

  • Full conversation history
  • Entire documentation
  • All available tool descriptions
  • Detailed examples
  • Verbose instructions

Then they wonder why:

  • Responses are less accurate
  • Performance is degrading
  • Costs are skyrocketing
  • The model seems "confused"

The fix: Context pruning — the art of sending only what matters.

What Is Context Pruning?

Context pruning is selectively removing information from the context window to improve performance.

Think of it like editing:

  • First draft: Everything you might need (bloated)
  • Final draft: Only what serves the purpose (lean)

The goal: Maximum signal, minimum noise.

Why Less Context Works Better

Reason #1: Attention Is Limited

LLMs use "attention mechanisms" to decide which parts of the input are relevant.

Problem: The more information you provide, the more the model has to juggle.

Analogy: Asking someone to find a name in a 10-page phone book vs. a 1,000-page phone book. Same task, but one is much harder.

Real impact:

  • Small context (2K tokens): Model focuses precisely
  • Large context (50K tokens): Attention is diluted

Reason #2: Noise Obscures Signal

Irrelevant information isn't just ignored — it actively interferes.

Example:

Query: "What's our return policy for electronics?"

Bloated context (includes entire employee handbook):

- Company history
- Employee benefits
- Return policy ✓ (what we need)
- Shipping policies
- Office locations
- IT policies

Result: Model might blend information, mentioning employee-related return policies or other irrelevant details.

Pruned context (only customer-facing return policy):

- Return policy ✓

Result: Accurate, focused answer.

Reason #3: Cost Compounds

You pay for every token, every time.

Scenario: Customer support bot handling 10,000 queries/day

Bloated approach (5,000 tokens per query):

  • 10,000 × 5,000 = 50,000,000 tokens/day
  • At $0.003 per 1K tokens = $150/day = $4,500/month

Pruned approach (800 tokens per query):

  • 10,000 × 800 = 8,000,000 tokens/day
  • At $0.003 per 1K tokens = $24/day = $720/month

Savings: $3,780/month ($45K/year)

Context Pruning Strategies

Strategy #1: Conversation History Pruning

Problem: Keeping entire conversation history in context.

After 50 turns, you're sending thousands of irrelevant tokens.

Approach: Sliding Window

Keep only the last N turns:

def get_context_for_turn(messages, window_size=10):
    # Keep only last 10 exchanges
    recent_messages = messages[-window_size:]
    return recent_messages

Example:

Turn 1-40: Discussion about product features
Turn 41-50: Discussion about pricing
Turn 51: User asks about returns

Context sent:
→ Only turns 41-51 (last 10)
→ Early product discussion is irrelevant to returns

Approach: Semantic Filtering

Keep only messages relevant to the current topic:

def get_relevant_context(messages, current_query):
    # Calculate relevance of each past message to current query
    relevant_messages = []

    for msg in messages:
        similarity = calculate_similarity(msg, current_query)
        if similarity > 0.7:  # Threshold
            relevant_messages.append(msg)

    return relevant_messages[-10:]  # Cap at 10

Example:

Turn 5: User asked about returns
Turn 12: User asked about shipping
Turn 28: User asked about warranty
Turn 51: User asks about returns again

Context sent:
→ Turn 5 (previous return question - highly relevant)
→ Turn 28 (warranty relates to returns - somewhat relevant)
→ Skip turn 12 (shipping not relevant)

Approach: Summarization

Summarize old context instead of sending verbatim:

def get_context_with_summary(messages, recent_count=5):
    recent = messages[-recent_count:]

    if len(messages) > recent_count:
        old_messages = messages[:-recent_count]
        summary = llm.summarize(old_messages)
        return [summary] + recent
    else:
        return messages

Example:

Original: 40 messages (8,000 tokens)

Pruned context:
→ Summary: "User was inquiring about product features,
    discussed pricing for Pro tier, asked about bulk
    discounts" (150 tokens)
→ Last 5 messages: Full detail (600 tokens)
→ Total: 750 tokens (90% reduction)

Strategy #2: Document Retrieval Pruning

Problem: Sending entire documents when only excerpts are needed.

Approach: Chunk and Retrieve

  1. Break documents into chunks (500-1000 tokens each)
  2. Embed chunks in vector database
  3. Retrieve only top-k relevant chunks
def retrieve_relevant_context(query, k=3):
    # Get top 3 most relevant chunks
    relevant_chunks = vector_db.similarity_search(
        query=query,
        top_k=k
    )

    return relevant_chunks

Example:

Full document: 50-page product manual (25,000 tokens)

Query: "How do I reset my password?"

Retrieved context:

  • Chunk #47: "Password Reset" section (400 tokens)
  • Chunk #23: "Account Security" section (350 tokens)
  • Chunk #8: "Login Troubleshooting" section (300 tokens)

Total sent: 1,050 tokens (96% reduction)

Approach: Hierarchical Retrieval

Start with summaries, drill down only if needed:

def hierarchical_retrieve(query):
    # Level 1: Check summaries
    summary_results = search_summaries(query)

    if summary_results.confidence > 0.9:
        return summary_results

    # Level 2: Drill into relevant sections
    section_results = search_sections(
        sections=summary_results.relevant_sections,
        query=query
    )

    return section_results

Example:

Query: "What's your return policy?"

Level 1: Search policy summaries
→ Found: "Returns accepted within 30 days" (high confidence)
→ Stop here, no need to drill down (200 tokens)

Query: "Can I return an opened electronic device?"

Level 1: Search policy summaries
→ Found: "Returns accepted within 30 days" (incomplete answer)
→ Level 2: Retrieve full electronics return policy
→ Total context: 800 tokens

Strategy #3: Tool Description Pruning

Problem: Loading all tool descriptions when only a few are needed.

Approach: Dynamic Tool Loading

Load only tools relevant to current task:

def get_relevant_tools(query, all_tools):
    # Classify intent
    intent = classify_intent(query)

    # Map intent to relevant tools
    tool_map = {
        "order_inquiry": ["lookup_order", "check_shipping"],
        "refund_request": ["lookup_order", "process_refund"],
        "product_question": ["search_products", "get_details"],
    }

    relevant_tool_names = tool_map.get(intent, [])
    relevant_tools = [t for t in all_tools if t.name in relevant_tool_names]

    return relevant_tools

Example:

All tools (30 tools): 3,500 tokens

Query: "Where's my order?"

Intent: order_inquiry

Relevant tools (2 tools):

  • lookup_order (100 tokens)
  • check_shipping (80 tokens)
  • Total: 180 tokens (95% reduction)

Approach: Tiered Tool Access

Group tools by usage frequency:

# Tier 1: Always loaded (common tools)
core_tools = ["clarify_question", "format_response"]  # 150 tokens

# Tier 2: Load on demand (domain-specific)
domain_tools = {
    "orders": ["lookup_order", "track_shipping"],
    "refunds": ["process_refund", "check_eligibility"],
    # ...
}

# Tier 3: Rarely used (specialized)
specialized_tools = ["bulk_import", "generate_report"]  # Load only when explicitly needed

Strategy #4: Example Pruning

Problem: Sending 10 examples when 2 would suffice.

Approach: Few-Shot to Zero-Shot Progression

Start with examples, remove them as the model learns your pattern:

def get_examples_for_query(query, model_performance):
    if model_performance.accuracy < 0.7:
        # Model struggles, send 5 examples
        return get_examples(count=5)
    elif model_performance.accuracy < 0.9:
        # Model is learning, send 2 examples
        return get_examples(count=2)
    else:
        # Model is good, no examples needed
        return []

Approach: Dynamic Example Selection

Choose examples similar to the current query:

def get_relevant_examples(query, example_bank):
    # Find examples similar to current query
    similar_examples = []

    for example in example_bank:
        similarity = calculate_similarity(query, example.input)
        if similarity > 0.8:
            similar_examples.append((similarity, example))

    # Return top 2 most similar
    similar_examples.sort(reverse=True)
    return [ex for _, ex in similar_examples[:2]]

Strategy #5: Instruction Compression

Problem: Verbose instructions waste tokens.

Bad (verbose):

Please analyze the following customer feedback carefully and
thoroughly. Take into account the sentiment, key themes, and
any actionable insights. Make sure to structure your response
in a clear and organized manner. Be concise but comprehensive.

Good (compressed):

Analyze this feedback:
- Sentiment
- Key themes
- Actionable insights

Format: Concise bullet points.

Tokens saved: 45 → 15 (67% reduction)

Approach: Create instruction templates:

templates = {
    "analyze_feedback": """
Analyze feedback:
- Sentiment: [positive/negative/neutral]
- Themes: [list]
- Actions: [list]
""",

    "summarize_conversation": """
Summarize:
- Main topic
- Key decisions
- Next steps
""",
}

Dynamic Context Budgets

Advanced technique: Allocate a token budget and prioritize what fills it.

def build_context(query, max_tokens=4000):
    context = []
    tokens_used = 0

    # Priority 1: Always include system prompt (200 tokens)
    context.append(system_prompt)
    tokens_used += 200

    # Priority 2: Current query (variable)
    context.append(query)
    tokens_used += count_tokens(query)

    # Priority 3: Essential tools (300 tokens)
    essential_tools = get_essential_tools()
    context.append(essential_tools)
    tokens_used += 300

    # Priority 4: Recent conversation (budget: remaining / 2)
    conversation_budget = (max_tokens - tokens_used) // 2
    recent_messages = get_recent_messages(max_tokens=conversation_budget)
    context.append(recent_messages)
    tokens_used += count_tokens(recent_messages)

    # Priority 5: Retrieved documents (budget: remaining)
    retrieval_budget = max_tokens - tokens_used
    relevant_docs = retrieve_documents(query, max_tokens=retrieval_budget)
    context.append(relevant_docs)

    return context

Result: Context never exceeds budget, highest-priority info always included.

Pruning Heuristics

Rule #1: Recency Over Age

Recent information is usually more relevant than old information.

Exception: User explicitly references something from earlier ("Like you said before...")

Rule #2: Specificity Over Generality

Specific, detailed information beats generic summaries when answering specific questions.

Example: User asks about "electronics return policy" → Send electronics-specific policy, not general return policy.

Rule #3: Task-Relevant Over Comprehensive

Include only what's needed for the current task.

Example: Formatting a response doesn't need full conversation history.

Rule #4: Quality Over Quantity

3 highly relevant documents beat 10 moderately relevant documents.

Rule #5: Fast Over Perfect

Better to prune aggressively and miss 5% of relevant context than send everything and degrade performance by 20%.

Measuring Pruning Effectiveness

Track these metrics:

1. Context Size

avg_context_tokens = sum(context_sizes) / num_requests

Goal: Minimize without hurting accuracy

2. Cost Per Request

cost_per_request = (input_tokens * input_price + output_tokens * output_price)

Goal: Reduce by 50-80% through pruning

3. Response Accuracy

accuracy = correct_responses / total_responses

Goal: Maintain or improve (pruning should help, not hurt)

4. Response Time

avg_response_time = sum(response_times) / num_requests

Goal: Reduce (smaller context = faster processing)

Common Mistakes

Mistake #1: Pruning Too Aggressively

Problem: Removing critical context breaks functionality.

Example: Pruning conversation history but user says "like I mentioned earlier..."

Solution: Use semantic analysis to detect references to prior context before pruning.

Mistake #2: Static Pruning Rules

Problem: Same pruning logic for all queries.

Example: Always keeping last 10 messages, even when current query references turn 15.

Solution: Dynamic pruning based on current query.

Mistake #3: Not Monitoring Impact

Problem: Prune context but don't measure if accuracy drops.

Solution: A/B test pruning strategies, track accuracy metrics.

Mistake #4: Ignoring Cost/Accuracy Tradeoff

Problem: Over-optimizing for cost, sacrificing too much accuracy.

Solution: Find the sweet spot (e.g., 60% cost reduction with 98% of accuracy maintained).

The Bottom Line

Context pruning is essential for production LLM systems.

Benefits:

  • 50-80% cost reduction
  • 20-40% faster responses
  • 10-20% accuracy improvement (from reduced noise)

Core principle: Send only what matters for the current task.

Strategies:

  • Prune conversation history (sliding window, semantic filtering, summarization)
  • Retrieve relevant chunks (not full documents)
  • Load tools dynamically (based on intent)
  • Compress instructions
  • Use dynamic context budgets

Start simple: Implement sliding window for conversation history, measure impact, iterate.

Getting Started

Quick wins (implement today):

  1. Limit conversation history to last 10 turns
  2. Compress system instructions (remove verbosity)
  3. Load only relevant tools (not all 30)
  4. Measure context size per request

Expected impact: 40-60% token reduction immediately

Need help optimizing your LLM context usage? We've helped teams reduce token costs by 70%+ while improving accuracy.

Get a context optimization audit →

Tags:Context EngineeringOptimizationPromptingPerformance

About the Author

DomAIn Labs Team

The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.