
When LLMs Hallucinate Your Workflow: Debugging Agent Chains Gone Rogue
When LLMs Hallucinate Your Workflow: Debugging Agent Chains Gone Rogue
You built an AI agent. It worked perfectly in testing.
Then you deployed it. And things got... weird:
- The agent calls tools in random order
- It invents tool names that don't exist
- It gets stuck in loops
- It returns confident answers that are completely wrong
- Sometimes it just stops mid-workflow
Welcome to the frustrating world of agent debugging.
Unlike traditional code where you get stack traces and clear error messages, debugging LLM agents means deciphering why an AI "decided" to do something completely unexpected.
Let me show you how to debug agent workflows effectively.
The Core Problem: Non-Deterministic Execution
Traditional code:
def process_order(order_id):
order = lookup_order(order_id) # Always step 1
validate(order) # Always step 2
charge_payment(order) # Always step 3
return "Success" # Always returns this
Debugging: If step 2 fails, you know exactly where and why.
AI agents:
agent.run("Process order #12345")
# Agent decides:
# → Maybe I should lookup the order?
# → Or should I validate the user first?
# → Actually, let me check inventory...
# → Wait, what was I doing again?
Debugging: Why did it check inventory before looking up the order? Who knows. The LLM made that choice.
Common Failure Patterns
Pattern #1: Tool Hallucination
What happens: Agent invents tools that don't exist.
Example:
Agent: I'll use the get_customer_lifetime_value tool
System: Error - tool not found
Agent: Let me try calculate_customer_worth instead
System: Error - tool not found
Agent: How about customer_value_estimator?
System: Error - tool not found
Why it happens:
- Tool descriptions are vague or incomplete
- Agent "reasons" that such a tool should exist
- LLM fills gaps with plausible-sounding names
How to debug:
from langchain.callbacks import StdOutCallbackHandler
# See what tools agent is attempting
agent.run("query", callbacks=[StdOutCallbackHandler()])
# Output shows:
# > Entering new AgentExecutor chain...
# Thought: I need to calculate customer value
# Action: get_customer_lifetime_value ← Hallucinated!
# Action Input: {"customer_id": "12345"}
# Observation: Error - tool not found
Fix:
- Make tool names explicit and descriptive
- Add examples of valid tools in system prompt
- Implement tool validation that suggests alternatives
def validate_tool_call(tool_name, available_tools):
if tool_name not in available_tools:
# Suggest closest match
similar = find_similar_tool_names(tool_name, available_tools)
raise ValueError(f"Tool '{tool_name}' not found. Did you mean: {similar}?")
Pattern #2: Infinite Loops
What happens: Agent calls the same tools repeatedly, never completing.
Example:
Turn 1: lookup_order(12345) → Returns order data
Turn 2: lookup_order(12345) → Returns same data
Turn 3: lookup_order(12345) → Returns same data again
...
Turn 50: [timeout]
Why it happens:
- Agent doesn't realize it already has the information
- Tool output isn't being properly added to context
- Agent doesn't know when to stop
How to debug:
from langchain.callbacks import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# Track tool calls
class ToolTracker(StreamingStdOutCallbackHandler):
def __init__(self):
self.tool_calls = []
def on_tool_start(self, serialized, input_str, **kwargs):
self.tool_calls.append({
"tool": serialized.get("name"),
"input": input_str,
"timestamp": time.time()
})
# Detect loops
if len(self.tool_calls) > 5:
recent = self.tool_calls[-5:]
if all(t["tool"] == recent[0]["tool"] for t in recent):
raise Exception(f"Loop detected: {recent[0]['tool']} called 5x in a row")
tracker = ToolTracker()
agent.run("query", callbacks=[tracker])
Fix:
- Add max iterations limit
- Track tool call history, detect repetition
- Include "task completed" signal in tool outputs
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
max_iterations=10, # Hard limit
early_stopping_method="generate" # Stop gracefully
)
Pattern #3: Context Confusion
What happens: Agent "forgets" information from earlier in the conversation.
Example:
User: "My order number is 12345"
Agent: "Got it! Order 12345..."
[5 turns later]
User: "Can you check the shipping status?"
Agent: "Sure! What's your order number?" ← Forgot!
Why it happens:
- Context window getting full
- Important info pushed out by verbose tool outputs
- Poor conversation history management
How to debug:
def debug_context(agent_executor):
# Inspect what's actually in context
current_context = agent_executor.memory.buffer
print("=== Current Context ===")
print(f"Total tokens: {count_tokens(current_context)}")
print(f"Messages: {len(current_context)}")
for i, msg in enumerate(current_context):
print(f"\n[{i}] {msg['role']}: {msg['content'][:100]}...")
# Run this periodically during conversation
debug_context(agent_executor)
Fix:
- Implement conversation summarization
- Extract and persist key facts
- Prune verbose tool outputs
from langchain.memory import ConversationSummaryMemory
memory = ConversationSummaryMemory(
llm=llm,
max_token_limit=2000,
return_messages=True
)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
memory=memory # Automatically summarizes old turns
)
Pattern #4: Tool Sequencing Errors
What happens: Agent calls tools in wrong order, causing failures.
Example:
Agent: Let me process the refund first
[Calls process_refund before checking eligibility]
System: Error - cannot refund ineligible order
Agent: [confused] Let me try again
[Calls process_refund again]
System: Error - cannot refund ineligible order
Why it happens:
- No explicit dependencies between tools
- Agent doesn't understand prerequisites
- Tool descriptions don't mention requirements
How to debug:
# Add dependency tracking
class ToolWithDeps:
def __init__(self, name, func, requires=None):
self.name = name
self.func = func
self.requires = requires or []
def can_execute(self, executed_tools):
return all(req in executed_tools for req in self.requires)
# Track execution
executed_tools = set()
def execute_tool(tool, *args):
if not tool.can_execute(executed_tools):
missing = [r for r in tool.requires if r not in executed_tools]
raise Exception(f"Cannot execute {tool.name}. Missing: {missing}")
result = tool.func(*args)
executed_tools.add(tool.name)
return result
Fix:
- Add tool descriptions that mention prerequisites
- Implement validation that checks dependencies
- Use structured workflows (LangGraph) instead of fully autonomous agents
# Better tool description
Tool(
name="process_refund",
description="Process refund for eligible orders. REQUIRES: Must call check_refund_eligibility first.",
func=process_refund
)
Pattern #5: Silent Failures
What happens: Tool fails, but agent continues as if it succeeded.
Example:
Agent: I'll look up your order
[Tool call fails silently - network error]
Agent: Your order status is "shipped" ← Made this up!
Why it happens:
- Poor error handling in tools
- Agent hallucinates responses when it doesn't get expected data
- No validation of tool outputs
How to debug:
from langchain.tools import Tool
# Wrap tools with error logging
def logged_tool(func):
def wrapper(*args, **kwargs):
try:
result = func(*args, **kwargs)
logger.info(f"{func.__name__} succeeded: {result}")
return result
except Exception as e:
logger.error(f"{func.__name__} failed: {e}")
raise # Don't swallow errors
return wrapper
# Apply to all tools
lookup_order_tool = Tool(
name="lookup_order",
func=logged_tool(lookup_order),
description="..."
)
Fix:
- Never swallow exceptions in tools
- Return explicit error messages to agent
- Validate tool outputs before agent sees them
def safe_tool_execution(tool, *args):
try:
result = tool.func(*args)
# Validate result
if result is None:
return {"error": f"{tool.name} returned no data"}
if isinstance(result, dict) and result.get("error"):
return result # Pass error to agent explicitly
return {"success": True, "data": result}
except Exception as e:
return {
"error": f"{tool.name} failed: {str(e)}",
"success": False
}
LangChain Debug Tools
Tool #1: Verbose Mode
Simplest debugging:
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True # Prints every step
)
agent_executor.run("What's the status of order 12345?")
# Output:
# > Entering new AgentExecutor chain...
# Thought: I need to look up the order
# Action: lookup_order
# Action Input: {"order_id": "12345"}
# Observation: Order 12345 status is "shipped"
# Thought: I now know the answer
# Final Answer: Your order 12345 has been shipped
# > Finished chain.
Benefit: See agent's reasoning at each step.
Tool #2: Callbacks
More control over debugging:
from langchain.callbacks import BaseCallbackHandler
class DebugCallback(BaseCallbackHandler):
def on_llm_start(self, serialized, prompts, **kwargs):
print(f"\n=== LLM Called ===")
print(f"Prompt: {prompts[0][:200]}...")
def on_llm_end(self, response, **kwargs):
print(f"Response: {response.generations[0][0].text[:200]}...")
def on_tool_start(self, serialized, input_str, **kwargs):
print(f"\n=== Tool: {serialized.get('name')} ===")
print(f"Input: {input_str}")
def on_tool_end(self, output, **kwargs):
print(f"Output: {output[:200]}...")
def on_agent_action(self, action, **kwargs):
print(f"\n=== Agent Action ===")
print(f"Tool: {action.tool}")
print(f"Input: {action.tool_input}")
def on_agent_finish(self, finish, **kwargs):
print(f"\n=== Agent Finished ===")
print(f"Output: {finish.return_values}")
# Use callback
debug_callback = DebugCallback()
agent_executor.run("query", callbacks=[debug_callback])
Tool #3: LangSmith
Production debugging (requires LangSmith account):
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"
# Now all agent runs are automatically traced
agent_executor.run("query")
# View traces in LangSmith dashboard:
# - Full conversation history
# - Token usage per call
# - Latency per step
# - Error rates
# - Cost tracking
Benefit: Persistent traces you can analyze later, share with team.
Tool #4: Custom Logging
Log everything for analysis:
import json
import logging
class AgentLogger:
def __init__(self, log_file="agent_debug.jsonl"):
self.log_file = log_file
def log_run(self, query, result, metadata):
log_entry = {
"timestamp": time.time(),
"query": query,
"result": result,
"metadata": metadata,
"tool_calls": metadata.get("intermediate_steps", []),
"token_usage": metadata.get("token_usage", {}),
"duration": metadata.get("duration", 0),
}
with open(self.log_file, "a") as f:
f.write(json.dumps(log_entry) + "\n")
logger = AgentLogger()
# After each run
result = agent_executor.run(query)
logger.log_run(query, result, metadata=agent_executor.get_metadata())
Benefit: Analyze patterns across many runs, identify systemic issues.
Debugging Checklist
When an agent misbehaves, work through this checklist:
1. Check Tool Definitions
- Are tool names descriptive?
- Do descriptions explain WHAT the tool does?
- Do descriptions mention prerequisites?
- Are parameter types specified clearly?
2. Inspect Context
- Print current context size (tokens)
- Check if important info is being pushed out
- Verify tool outputs are in context
- Look for redundant/verbose messages
3. Trace Execution
- Enable verbose mode
- Log all tool calls
- Track tool call sequence
- Measure latency per step
4. Validate Tool Outputs
- Check for None/empty returns
- Verify error handling
- Ensure outputs are JSON-parseable (if expected)
- Look for silent failures
5. Test Edge Cases
- Invalid inputs
- API failures
- Timeout scenarios
- Missing data
Prevention Strategies
Better than debugging: Design agents that fail gracefully.
Strategy #1: Constrained Workflows
Use LangGraph instead of fully autonomous agents:
from langgraph.graph import StateGraph
# Define explicit workflow
workflow = StateGraph(State)
workflow.add_node("validate", validate_node)
workflow.add_node("process", process_node)
workflow.add_edge("validate", "process")
# Agent can't skip steps or invent new ones
app = workflow.compile()
Strategy #2: Tool Validation Layer
Validate before letting agent call tools:
def validate_tool_call(tool_name, tool_input, available_tools):
if tool_name not in available_tools:
raise ValueError(f"Invalid tool: {tool_name}")
tool = available_tools[tool_name]
# Validate input schema
if not tool.validate_input(tool_input):
raise ValueError(f"Invalid input for {tool_name}")
return True
Strategy #3: Semantic Monitoring
Monitor for nonsensical outputs:
def validate_response(response, expected_type):
# Check for hallucination patterns
hallucination_indicators = [
"I apologize, but I don't have access to",
"As an AI, I cannot",
"[placeholder]",
"[TODO]"
]
for indicator in hallucination_indicators:
if indicator.lower() in response.lower():
raise ValueError(f"Hallucination detected: {indicator}")
return True
Strategy #4: Fallback Mechanisms
Always have a backup plan:
def agent_with_fallback(query):
try:
return agent_executor.run(query)
except Exception as e:
logger.error(f"Agent failed: {e}")
# Fallback: Simple keyword-based response
return fallback_handler(query)
The Bottom Line
Debugging LLM agents is hard because execution is non-deterministic.
Common failure patterns:
- Tool hallucination (inventing non-existent tools)
- Infinite loops (repeating same actions)
- Context confusion (forgetting info)
- Tool sequencing errors (wrong order)
- Silent failures (continuing after errors)
Debug tools:
- Verbose mode (see reasoning)
- Callbacks (track execution)
- LangSmith (production traces)
- Custom logging (analyze patterns)
Prevention:
- Use constrained workflows (LangGraph)
- Validate tool calls
- Monitor for hallucinations
- Implement fallbacks
Pro tip: Design for debuggability from day one. Logging is cheap, debugging production failures is expensive.
Getting Started
Quick debugging setup (< 30 min):
- Enable verbose mode
- Add custom callback to log tool calls
- Track token usage per request
- Implement max iterations limit
- Add tool call validation
Need help debugging a production agent issue? We've debugged hundreds of agent failures.
Related reading:
- LangChain debugging docs: https://python.langchain.com/docs/guides/debugging
- LangSmith tracing: https://smith.langchain.com
About the Author
DomAIn Labs Team
The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.