When LLMs Hallucinate Your Workflow: Debugging Agent Chains Gone Rogue
Agent Guides

When LLMs Hallucinate Your Workflow: Debugging Agent Chains Gone Rogue

DomAIn Labs Team
August 30, 2025
11 min read

When LLMs Hallucinate Your Workflow: Debugging Agent Chains Gone Rogue

You built an AI agent. It worked perfectly in testing.

Then you deployed it. And things got... weird:

  • The agent calls tools in random order
  • It invents tool names that don't exist
  • It gets stuck in loops
  • It returns confident answers that are completely wrong
  • Sometimes it just stops mid-workflow

Welcome to the frustrating world of agent debugging.

Unlike traditional code where you get stack traces and clear error messages, debugging LLM agents means deciphering why an AI "decided" to do something completely unexpected.

Let me show you how to debug agent workflows effectively.

The Core Problem: Non-Deterministic Execution

Traditional code:

def process_order(order_id):
    order = lookup_order(order_id)  # Always step 1
    validate(order)                  # Always step 2
    charge_payment(order)             # Always step 3
    return "Success"                  # Always returns this

Debugging: If step 2 fails, you know exactly where and why.

AI agents:

agent.run("Process order #12345")
# Agent decides:
# → Maybe I should lookup the order?
# → Or should I validate the user first?
# → Actually, let me check inventory...
# → Wait, what was I doing again?

Debugging: Why did it check inventory before looking up the order? Who knows. The LLM made that choice.

Common Failure Patterns

Pattern #1: Tool Hallucination

What happens: Agent invents tools that don't exist.

Example:

Agent: I'll use the get_customer_lifetime_value tool
System: Error - tool not found
Agent: Let me try calculate_customer_worth instead
System: Error - tool not found
Agent: How about customer_value_estimator?
System: Error - tool not found

Why it happens:

  • Tool descriptions are vague or incomplete
  • Agent "reasons" that such a tool should exist
  • LLM fills gaps with plausible-sounding names

How to debug:

from langchain.callbacks import StdOutCallbackHandler

# See what tools agent is attempting
agent.run("query", callbacks=[StdOutCallbackHandler()])

# Output shows:
# > Entering new AgentExecutor chain...
# Thought: I need to calculate customer value
# Action: get_customer_lifetime_value  ← Hallucinated!
# Action Input: {"customer_id": "12345"}
# Observation: Error - tool not found

Fix:

  1. Make tool names explicit and descriptive
  2. Add examples of valid tools in system prompt
  3. Implement tool validation that suggests alternatives
def validate_tool_call(tool_name, available_tools):
    if tool_name not in available_tools:
        # Suggest closest match
        similar = find_similar_tool_names(tool_name, available_tools)
        raise ValueError(f"Tool '{tool_name}' not found. Did you mean: {similar}?")

Pattern #2: Infinite Loops

What happens: Agent calls the same tools repeatedly, never completing.

Example:

Turn 1: lookup_order(12345) → Returns order data
Turn 2: lookup_order(12345) → Returns same data
Turn 3: lookup_order(12345) → Returns same data again
...
Turn 50: [timeout]

Why it happens:

  • Agent doesn't realize it already has the information
  • Tool output isn't being properly added to context
  • Agent doesn't know when to stop

How to debug:

from langchain.callbacks import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Track tool calls
class ToolTracker(StreamingStdOutCallbackHandler):
    def __init__(self):
        self.tool_calls = []

    def on_tool_start(self, serialized, input_str, **kwargs):
        self.tool_calls.append({
            "tool": serialized.get("name"),
            "input": input_str,
            "timestamp": time.time()
        })

        # Detect loops
        if len(self.tool_calls) > 5:
            recent = self.tool_calls[-5:]
            if all(t["tool"] == recent[0]["tool"] for t in recent):
                raise Exception(f"Loop detected: {recent[0]['tool']} called 5x in a row")

tracker = ToolTracker()
agent.run("query", callbacks=[tracker])

Fix:

  1. Add max iterations limit
  2. Track tool call history, detect repetition
  3. Include "task completed" signal in tool outputs
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=10,  # Hard limit
    early_stopping_method="generate"  # Stop gracefully
)

Pattern #3: Context Confusion

What happens: Agent "forgets" information from earlier in the conversation.

Example:

User: "My order number is 12345"
Agent: "Got it! Order 12345..."
[5 turns later]
User: "Can you check the shipping status?"
Agent: "Sure! What's your order number?"  ← Forgot!

Why it happens:

  • Context window getting full
  • Important info pushed out by verbose tool outputs
  • Poor conversation history management

How to debug:

def debug_context(agent_executor):
    # Inspect what's actually in context
    current_context = agent_executor.memory.buffer

    print("=== Current Context ===")
    print(f"Total tokens: {count_tokens(current_context)}")
    print(f"Messages: {len(current_context)}")

    for i, msg in enumerate(current_context):
        print(f"\n[{i}] {msg['role']}: {msg['content'][:100]}...")

# Run this periodically during conversation
debug_context(agent_executor)

Fix:

  1. Implement conversation summarization
  2. Extract and persist key facts
  3. Prune verbose tool outputs
from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(
    llm=llm,
    max_token_limit=2000,
    return_messages=True
)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory  # Automatically summarizes old turns
)

Pattern #4: Tool Sequencing Errors

What happens: Agent calls tools in wrong order, causing failures.

Example:

Agent: Let me process the refund first
[Calls process_refund before checking eligibility]
System: Error - cannot refund ineligible order
Agent: [confused] Let me try again
[Calls process_refund again]
System: Error - cannot refund ineligible order

Why it happens:

  • No explicit dependencies between tools
  • Agent doesn't understand prerequisites
  • Tool descriptions don't mention requirements

How to debug:

# Add dependency tracking
class ToolWithDeps:
    def __init__(self, name, func, requires=None):
        self.name = name
        self.func = func
        self.requires = requires or []

    def can_execute(self, executed_tools):
        return all(req in executed_tools for req in self.requires)

# Track execution
executed_tools = set()

def execute_tool(tool, *args):
    if not tool.can_execute(executed_tools):
        missing = [r for r in tool.requires if r not in executed_tools]
        raise Exception(f"Cannot execute {tool.name}. Missing: {missing}")

    result = tool.func(*args)
    executed_tools.add(tool.name)
    return result

Fix:

  1. Add tool descriptions that mention prerequisites
  2. Implement validation that checks dependencies
  3. Use structured workflows (LangGraph) instead of fully autonomous agents
# Better tool description
Tool(
    name="process_refund",
    description="Process refund for eligible orders. REQUIRES: Must call check_refund_eligibility first.",
    func=process_refund
)

Pattern #5: Silent Failures

What happens: Tool fails, but agent continues as if it succeeded.

Example:

Agent: I'll look up your order
[Tool call fails silently - network error]
Agent: Your order status is "shipped"  ← Made this up!

Why it happens:

  • Poor error handling in tools
  • Agent hallucinates responses when it doesn't get expected data
  • No validation of tool outputs

How to debug:

from langchain.tools import Tool

# Wrap tools with error logging
def logged_tool(func):
    def wrapper(*args, **kwargs):
        try:
            result = func(*args, **kwargs)
            logger.info(f"{func.__name__} succeeded: {result}")
            return result
        except Exception as e:
            logger.error(f"{func.__name__} failed: {e}")
            raise  # Don't swallow errors

    return wrapper

# Apply to all tools
lookup_order_tool = Tool(
    name="lookup_order",
    func=logged_tool(lookup_order),
    description="..."
)

Fix:

  1. Never swallow exceptions in tools
  2. Return explicit error messages to agent
  3. Validate tool outputs before agent sees them
def safe_tool_execution(tool, *args):
    try:
        result = tool.func(*args)

        # Validate result
        if result is None:
            return {"error": f"{tool.name} returned no data"}

        if isinstance(result, dict) and result.get("error"):
            return result  # Pass error to agent explicitly

        return {"success": True, "data": result}

    except Exception as e:
        return {
            "error": f"{tool.name} failed: {str(e)}",
            "success": False
        }

LangChain Debug Tools

Tool #1: Verbose Mode

Simplest debugging:

from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True  # Prints every step
)

agent_executor.run("What's the status of order 12345?")

# Output:
# > Entering new AgentExecutor chain...
# Thought: I need to look up the order
# Action: lookup_order
# Action Input: {"order_id": "12345"}
# Observation: Order 12345 status is "shipped"
# Thought: I now know the answer
# Final Answer: Your order 12345 has been shipped
# > Finished chain.

Benefit: See agent's reasoning at each step.

Tool #2: Callbacks

More control over debugging:

from langchain.callbacks import BaseCallbackHandler

class DebugCallback(BaseCallbackHandler):
    def on_llm_start(self, serialized, prompts, **kwargs):
        print(f"\n=== LLM Called ===")
        print(f"Prompt: {prompts[0][:200]}...")

    def on_llm_end(self, response, **kwargs):
        print(f"Response: {response.generations[0][0].text[:200]}...")

    def on_tool_start(self, serialized, input_str, **kwargs):
        print(f"\n=== Tool: {serialized.get('name')} ===")
        print(f"Input: {input_str}")

    def on_tool_end(self, output, **kwargs):
        print(f"Output: {output[:200]}...")

    def on_agent_action(self, action, **kwargs):
        print(f"\n=== Agent Action ===")
        print(f"Tool: {action.tool}")
        print(f"Input: {action.tool_input}")

    def on_agent_finish(self, finish, **kwargs):
        print(f"\n=== Agent Finished ===")
        print(f"Output: {finish.return_values}")

# Use callback
debug_callback = DebugCallback()
agent_executor.run("query", callbacks=[debug_callback])

Tool #3: LangSmith

Production debugging (requires LangSmith account):

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"

# Now all agent runs are automatically traced
agent_executor.run("query")

# View traces in LangSmith dashboard:
# - Full conversation history
# - Token usage per call
# - Latency per step
# - Error rates
# - Cost tracking

Benefit: Persistent traces you can analyze later, share with team.

Tool #4: Custom Logging

Log everything for analysis:

import json
import logging

class AgentLogger:
    def __init__(self, log_file="agent_debug.jsonl"):
        self.log_file = log_file

    def log_run(self, query, result, metadata):
        log_entry = {
            "timestamp": time.time(),
            "query": query,
            "result": result,
            "metadata": metadata,
            "tool_calls": metadata.get("intermediate_steps", []),
            "token_usage": metadata.get("token_usage", {}),
            "duration": metadata.get("duration", 0),
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")

logger = AgentLogger()

# After each run
result = agent_executor.run(query)
logger.log_run(query, result, metadata=agent_executor.get_metadata())

Benefit: Analyze patterns across many runs, identify systemic issues.

Debugging Checklist

When an agent misbehaves, work through this checklist:

1. Check Tool Definitions

  • Are tool names descriptive?
  • Do descriptions explain WHAT the tool does?
  • Do descriptions mention prerequisites?
  • Are parameter types specified clearly?

2. Inspect Context

  • Print current context size (tokens)
  • Check if important info is being pushed out
  • Verify tool outputs are in context
  • Look for redundant/verbose messages

3. Trace Execution

  • Enable verbose mode
  • Log all tool calls
  • Track tool call sequence
  • Measure latency per step

4. Validate Tool Outputs

  • Check for None/empty returns
  • Verify error handling
  • Ensure outputs are JSON-parseable (if expected)
  • Look for silent failures

5. Test Edge Cases

  • Invalid inputs
  • API failures
  • Timeout scenarios
  • Missing data

Prevention Strategies

Better than debugging: Design agents that fail gracefully.

Strategy #1: Constrained Workflows

Use LangGraph instead of fully autonomous agents:

from langgraph.graph import StateGraph

# Define explicit workflow
workflow = StateGraph(State)
workflow.add_node("validate", validate_node)
workflow.add_node("process", process_node)
workflow.add_edge("validate", "process")

# Agent can't skip steps or invent new ones
app = workflow.compile()

Strategy #2: Tool Validation Layer

Validate before letting agent call tools:

def validate_tool_call(tool_name, tool_input, available_tools):
    if tool_name not in available_tools:
        raise ValueError(f"Invalid tool: {tool_name}")

    tool = available_tools[tool_name]

    # Validate input schema
    if not tool.validate_input(tool_input):
        raise ValueError(f"Invalid input for {tool_name}")

    return True

Strategy #3: Semantic Monitoring

Monitor for nonsensical outputs:

def validate_response(response, expected_type):
    # Check for hallucination patterns
    hallucination_indicators = [
        "I apologize, but I don't have access to",
        "As an AI, I cannot",
        "[placeholder]",
        "[TODO]"
    ]

    for indicator in hallucination_indicators:
        if indicator.lower() in response.lower():
            raise ValueError(f"Hallucination detected: {indicator}")

    return True

Strategy #4: Fallback Mechanisms

Always have a backup plan:

def agent_with_fallback(query):
    try:
        return agent_executor.run(query)
    except Exception as e:
        logger.error(f"Agent failed: {e}")

        # Fallback: Simple keyword-based response
        return fallback_handler(query)

The Bottom Line

Debugging LLM agents is hard because execution is non-deterministic.

Common failure patterns:

  • Tool hallucination (inventing non-existent tools)
  • Infinite loops (repeating same actions)
  • Context confusion (forgetting info)
  • Tool sequencing errors (wrong order)
  • Silent failures (continuing after errors)

Debug tools:

  • Verbose mode (see reasoning)
  • Callbacks (track execution)
  • LangSmith (production traces)
  • Custom logging (analyze patterns)

Prevention:

  • Use constrained workflows (LangGraph)
  • Validate tool calls
  • Monitor for hallucinations
  • Implement fallbacks

Pro tip: Design for debuggability from day one. Logging is cheap, debugging production failures is expensive.

Getting Started

Quick debugging setup (< 30 min):

  1. Enable verbose mode
  2. Add custom callback to log tool calls
  3. Track token usage per request
  4. Implement max iterations limit
  5. Add tool call validation

Need help debugging a production agent issue? We've debugged hundreds of agent failures.

Get agent debugging help →


Related reading:

Tags:DebuggingLangChainAgentsTroubleshooting

About the Author

DomAIn Labs Team

The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.