PromptOps Is the New DevOps: Managing Token Budgets, Skill Graphs, and Logs
AI Pulse

PromptOps Is the New DevOps: Managing Token Budgets, Skill Graphs, and Logs

DomAIn Labs Team
May 8, 2025
9 min read

PromptOps Is the New DevOps: Managing Token Budgets, Skill Graphs, and Logs

DevOps transformed how we deploy and manage traditional software. Build pipelines, monitoring, logging, infrastructure as code — all became standard practice.

Now we're seeing the same evolution for AI systems. But the primitives are different:

DevOps manages: Servers, databases, deployments PromptOps manages: Prompts, tokens, model versions, LLM calls

Welcome to PromptOps — the operational discipline for AI systems.

What Is PromptOps?

PromptOps = Prompt Engineering + Operations

It's the practice of managing AI systems in production:

  • Version control for prompts
  • Token budget management
  • Monitoring LLM calls
  • Optimizing costs
  • Testing prompt changes
  • Managing skill/tool configurations
  • Analyzing performance

Why it's emerging now: Companies are moving from AI prototypes to production systems. And production AI has operational requirements that traditional DevOps doesn't cover.

The PromptOps Stack

Layer 1: Prompt Management

The problem: Prompts scattered across codebase, no version control, no testing.

PromptOps solution: Centralized prompt management.

Tools:

  • LangSmith: Prompt versioning and testing
  • Promptfoo: Prompt evaluation framework
  • Custom solutions: Prompt registry in your codebase

Example:

# prompts/customer_support.py

VERSION = "2.1.0"
LAST_UPDATED = "2025-05-01"
TESTED_ON = ["claude-3-5-sonnet", "gpt-4"]

SYSTEM_PROMPT = """
You are a customer support agent for {company_name}.

Guidelines:
- Be concise and helpful
- Always verify customer identity
- Escalate to human if unsure

Available tools: {tool_list}
"""

def get_prompt(company_name: str, tools: list) -> str:
    tool_list = ", ".join([t.name for t in tools])
    return SYSTEM_PROMPT.format(
        company_name=company_name,
        tool_list=tool_list
    )

Benefits:

  • Prompts are version-controlled
  • Changes can be reviewed/tested
  • Rollback is possible
  • Documentation is embedded

Layer 2: Token Budget Management

The problem: Token costs spiral out of control, no visibility into where tokens go.

PromptOps solution: Token budgets per request/user/feature.

Implementation:

class TokenBudgetManager:
    def __init__(self):
        self.budgets = {
            "customer_support": 5000,  # Max tokens per request
            "data_analysis": 15000,
            "content_generation": 8000,
        }

    def check_budget(self, feature: str, estimated_tokens: int):
        budget = self.budgets.get(feature)

        if estimated_tokens > budget:
            raise TokenBudgetExceeded(
                f"{feature} estimated {estimated_tokens} tokens, "
                f"budget is {budget}"
            )

    def track_usage(self, feature: str, actual_tokens: int):
        # Log to monitoring system
        metrics.gauge(f"token_usage.{feature}", actual_tokens)

        # Alert if approaching budget
        budget = self.budgets.get(feature)
        if actual_tokens > budget * 0.9:
            alert(f"{feature} used {actual_tokens}/{budget} tokens (90%)")

Monitoring:

# Track per feature
metrics.gauge("tokens.customer_support.input", input_tokens)
metrics.gauge("tokens.customer_support.output", output_tokens)

# Track per user
metrics.gauge(f"tokens.user.{user_id}", total_tokens)

# Track per model
metrics.gauge(f"tokens.model.{model_name}", total_tokens)

Layer 3: Skill/Tool Configuration Management

The problem: Agents have 30+ tools, no clear ownership, configuration drift.

PromptOps solution: Skill graphs with clear dependencies.

Skill registry:

# skills/registry.py

class SkillRegistry:
    def __init__(self):
        self.skills = {}

    def register(self, skill_class):
        skill = skill_class()

        self.skills[skill.name] = {
            "class": skill_class,
            "version": skill.version,
            "dependencies": skill.dependencies,
            "owner": skill.owner,
            "enabled": skill.enabled,
        }

    def get_skill_graph(self):
        """Generate dependency graph of skills"""
        graph = {}

        for name, config in self.skills.items():
            graph[name] = {
                "depends_on": config["dependencies"],
                "enabled": config["enabled"]
            }

        return graph

# Register skills
registry = SkillRegistry()
registry.register(OrderManagementSkill)
registry.register(RefundProcessingSkill)
registry.register(ProductCatalogSkill)

# Visualize skill graph
skill_graph = registry.get_skill_graph()
# Output:
# {
#   "order_management": {"depends_on": [], "enabled": True},
#   "refund_processing": {"depends_on": ["order_management"], "enabled": True},
#   "product_catalog": {"depends_on": [], "enabled": True}
# }

Configuration as code:

# config/skills.yaml

skills:
  order_management:
    version: "2.1.0"
    enabled: true
    max_tokens: 2000
    tools:
      - lookup_order
      - track_shipping
      - update_order

  refund_processing:
    version: "1.5.0"
    enabled: true
    max_tokens: 3000
    dependencies:
      - order_management
    tools:
      - check_eligibility
      - process_refund

Layer 4: LLM Call Logging & Tracing

The problem: LLM calls are black boxes, debugging is impossible.

PromptOps solution: Comprehensive logging of all LLM interactions.

What to log:

class LLMCallLogger:
    def log_call(self, **kwargs):
        log_entry = {
            # Request metadata
            "timestamp": time.time(),
            "request_id": kwargs["request_id"],
            "user_id": kwargs["user_id"],
            "feature": kwargs["feature"],

            # Model info
            "model": kwargs["model"],
            "temperature": kwargs.get("temperature", 0.7),

            # Input
            "prompt": kwargs["prompt"],
            "input_tokens": count_tokens(kwargs["prompt"]),

            # Output
            "response": kwargs["response"],
            "output_tokens": count_tokens(kwargs["response"]),

            # Performance
            "latency_ms": kwargs["latency_ms"],
            "cost": kwargs["cost"],

            # Tools (if applicable)
            "tools_available": kwargs.get("tools", []),
            "tools_called": kwargs.get("tools_called", []),

            # Outcome
            "success": kwargs["success"],
            "error": kwargs.get("error"),
        }

        # Send to logging system
        logger.info(json.dumps(log_entry))

        # Send to analytics
        analytics.track("llm_call", log_entry)

Distributed tracing:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def llm_call_with_tracing(prompt):
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("model", "claude-3-5-sonnet")
        span.set_attribute("input_tokens", count_tokens(prompt))

        response = llm.generate(prompt)

        span.set_attribute("output_tokens", count_tokens(response))
        span.set_attribute("success", True)

        return response

Layer 5: Cost Monitoring & Optimization

The problem: AI costs are opaque, optimization is guesswork.

PromptOps solution: Real-time cost tracking and alerts.

Cost tracking:

class CostTracker:
    def __init__(self):
        self.model_pricing = {
            "claude-3-5-sonnet": {
                "input": 0.003,   # per 1K tokens
                "output": 0.015   # per 1K tokens
            },
            "gpt-4": {
                "input": 0.03,
                "output": 0.06
            }
        }

    def calculate_cost(self, model, input_tokens, output_tokens):
        pricing = self.model_pricing[model]

        input_cost = (input_tokens / 1000) * pricing["input"]
        output_cost = (output_tokens / 1000) * pricing["output"]

        return input_cost + output_cost

    def track_daily_spend(self):
        today_spend = db.query("""
            SELECT SUM(cost) as total
            FROM llm_calls
            WHERE date = CURRENT_DATE
        """).total

        daily_budget = 1000  # $1000/day

        if today_spend > daily_budget * 0.9:
            alert(f"Daily AI spend: ${today_spend:.2f} / ${daily_budget} (90%)")

        metrics.gauge("ai_cost.daily", today_spend)

Cost dashboard:

Daily Cost: $847.32 / $1,000 (85%)

By Feature:
- customer_support: $423.15 (50%)
- data_analysis: $254.21 (30%)
- content_gen: $169.96 (20%)

By Model:
- claude-3-5-sonnet: $635.49 (75%)
- gpt-4: $211.83 (25%)

Top Cost Drivers:
1. User "enterprise_123": $127.44
2. Feature "bulk_analysis": $89.12
3. Prompt "detailed_summary": $67.89

Layer 6: Testing & Evaluation

The problem: Prompt changes break things in production.

PromptOps solution: Automated prompt testing.

Test suite:

# tests/prompts/test_customer_support.py

import pytest
from prompts.customer_support import get_prompt

class TestCustomerSupportPrompt:
    def test_prompt_includes_company_name(self):
        prompt = get_prompt(company_name="Acme Corp", tools=[])
        assert "Acme Corp" in prompt

    def test_prompt_lists_tools(self):
        tools = [MockTool("lookup_order"), MockTool("track_shipping")]
        prompt = get_prompt(company_name="Acme", tools=tools)
        assert "lookup_order" in prompt
        assert "track_shipping" in prompt

    def test_model_response_quality(self):
        """Test actual model responses with this prompt"""
        prompt = get_prompt(company_name="Acme", tools=[])

        # Test with various inputs
        test_cases = [
            {
                "input": "Where's my order?",
                "expected_tool": "lookup_order",
                "should_not_contain": ["refund", "cancel"]
            },
            {
                "input": "I want a refund",
                "expected_tool": "process_refund",
                "should_ask_for": "order number"
            }
        ]

        for case in test_cases:
            response = llm.generate(prompt + "\n" + case["input"])
            assert case["expected_tool"] in response

Evaluation with LangSmith:

from langsmith import evaluate

# Define evaluators
def check_tool_selection(run, example):
    """Verify correct tool was called"""
    expected_tool = example.expected_tool
    actual_tools = run.outputs.get("tools_called", [])

    return expected_tool in actual_tools

# Run evaluation
results = evaluate(
    customer_support_agent,
    data="customer_support_test_set",
    evaluators=[check_tool_selection]
)

# Results:
# Test Set: 100 examples
# Accuracy: 94% (94/100)
# Failures: 6 cases where wrong tool was selected

PromptOps Workflows

Workflow #1: Prompt Change Process

1. Developer updates prompt in version control
2. Automated tests run (unit + integration)
3. Evaluation suite runs on test dataset
4. Code review (includes prompt review)
5. Deploy to staging
6. A/B test in staging (compare old vs new)
7. Monitor key metrics (accuracy, cost, latency)
8. Graduate to production (or rollback if metrics degrade)

Workflow #2: Cost Optimization

1. Identify high-cost features (dashboard)
2. Analyze token usage (where are tokens going?)
3. Implement optimizations:
   - Prune context
   - Compress prompts
   - Use smaller models for simple tasks
   - Cache frequent responses
4. A/B test optimizations
5. Measure impact (cost ⬇️, accuracy maintained?)
6. Roll out if successful

Workflow #3: Incident Response

User reports: "AI gave wrong answer"

1. Look up request ID in logs
2. View full LLM call trace:
   - Input prompt
   - Output response
   - Tools called
   - Token usage
3. Reproduce issue in staging
4. Identify root cause:
   - Prompt ambiguity?
   - Wrong tool selected?
   - Context missing?
   - Model hallucination?
5. Implement fix
6. Test fix against incident case
7. Deploy fix
8. Add test case to prevent regression

PromptOps Metrics

Key metrics to track:

MetricWhat It MeasuresTarget
Token usage per requestContext efficiency< 5K tokens
Cost per request$ per user interaction< $0.05
Response latencyTime to first token< 2 seconds
Tool selection accuracy% correct tool chosen> 95%
Prompt success rate% requests completed> 98%
Daily AI spendTotal cost per day< budget
Cost per feature$ by feature areaTrack trends
Model accuracy% correct responses> 90%

Tools & Platforms

PromptOps platforms:

  • LangSmith: Debugging, tracing, evaluation
  • LangFuse: Open-source LLM observability
  • Weights & Biases: Prompt tracking and evaluation
  • Helicone: LLM observability and caching
  • Custom: Build your own logging/monitoring

Integration example (LangSmith):

from langsmith import traceable

@traceable(run_type="llm", name="customer_support_agent")
def handle_customer_query(query, user_id):
    # Automatically traced in LangSmith
    response = agent.run(query)
    return response

# View in LangSmith dashboard:
# - Full trace of LLM calls
# - Token usage
# - Latency breakdown
# - Cost
# - User feedback

The Bottom Line

PromptOps is emerging because AI systems have operational requirements that traditional DevOps doesn't cover.

Core practices:

  • Version control for prompts
  • Token budget management
  • Comprehensive logging of LLM calls
  • Cost monitoring and optimization
  • Automated testing of prompt changes
  • Skill/tool configuration management

Tools: LangSmith, LangFuse, custom logging

Expected impact:

  • 50-70% cost reduction
  • 10-20% accuracy improvement
  • Faster debugging
  • Safer deployments

Start with:

  1. Log all LLM calls (request, response, tokens, cost)
  2. Track daily spend
  3. Version control your prompts
  4. Set token budgets per feature

PromptOps is the new DevOps. If you're running AI in production, you need it.

Getting Started

Week 1: Set up logging

  • Log every LLM call
  • Track tokens and cost
  • Set up basic dashboard

Week 2: Add monitoring

  • Daily cost tracking
  • Token budget alerts
  • Latency monitoring

Week 3: Version control prompts

  • Move prompts to version control
  • Add basic tests
  • Document prompt versions

Week 4: Implement evaluation

  • Create test dataset
  • Run evaluation suite
  • Track accuracy over time

Need help setting up PromptOps for your AI system? We've built production LLM infrastructure for dozens of companies.

Get PromptOps consultation →


Related reading:

Tags:PromptOpsLLMOpsBest PracticesInfrastructure

About the Author

DomAIn Labs Team

The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.