
PromptOps Is the New DevOps: Managing Token Budgets, Skill Graphs, and Logs
PromptOps Is the New DevOps: Managing Token Budgets, Skill Graphs, and Logs
DevOps transformed how we deploy and manage traditional software. Build pipelines, monitoring, logging, infrastructure as code — all became standard practice.
Now we're seeing the same evolution for AI systems. But the primitives are different:
DevOps manages: Servers, databases, deployments PromptOps manages: Prompts, tokens, model versions, LLM calls
Welcome to PromptOps — the operational discipline for AI systems.
What Is PromptOps?
PromptOps = Prompt Engineering + Operations
It's the practice of managing AI systems in production:
- Version control for prompts
- Token budget management
- Monitoring LLM calls
- Optimizing costs
- Testing prompt changes
- Managing skill/tool configurations
- Analyzing performance
Why it's emerging now: Companies are moving from AI prototypes to production systems. And production AI has operational requirements that traditional DevOps doesn't cover.
The PromptOps Stack
Layer 1: Prompt Management
The problem: Prompts scattered across codebase, no version control, no testing.
PromptOps solution: Centralized prompt management.
Tools:
- LangSmith: Prompt versioning and testing
- Promptfoo: Prompt evaluation framework
- Custom solutions: Prompt registry in your codebase
Example:
# prompts/customer_support.py
VERSION = "2.1.0"
LAST_UPDATED = "2025-05-01"
TESTED_ON = ["claude-3-5-sonnet", "gpt-4"]
SYSTEM_PROMPT = """
You are a customer support agent for {company_name}.
Guidelines:
- Be concise and helpful
- Always verify customer identity
- Escalate to human if unsure
Available tools: {tool_list}
"""
def get_prompt(company_name: str, tools: list) -> str:
tool_list = ", ".join([t.name for t in tools])
return SYSTEM_PROMPT.format(
company_name=company_name,
tool_list=tool_list
)
Benefits:
- Prompts are version-controlled
- Changes can be reviewed/tested
- Rollback is possible
- Documentation is embedded
Layer 2: Token Budget Management
The problem: Token costs spiral out of control, no visibility into where tokens go.
PromptOps solution: Token budgets per request/user/feature.
Implementation:
class TokenBudgetManager:
def __init__(self):
self.budgets = {
"customer_support": 5000, # Max tokens per request
"data_analysis": 15000,
"content_generation": 8000,
}
def check_budget(self, feature: str, estimated_tokens: int):
budget = self.budgets.get(feature)
if estimated_tokens > budget:
raise TokenBudgetExceeded(
f"{feature} estimated {estimated_tokens} tokens, "
f"budget is {budget}"
)
def track_usage(self, feature: str, actual_tokens: int):
# Log to monitoring system
metrics.gauge(f"token_usage.{feature}", actual_tokens)
# Alert if approaching budget
budget = self.budgets.get(feature)
if actual_tokens > budget * 0.9:
alert(f"{feature} used {actual_tokens}/{budget} tokens (90%)")
Monitoring:
# Track per feature
metrics.gauge("tokens.customer_support.input", input_tokens)
metrics.gauge("tokens.customer_support.output", output_tokens)
# Track per user
metrics.gauge(f"tokens.user.{user_id}", total_tokens)
# Track per model
metrics.gauge(f"tokens.model.{model_name}", total_tokens)
Layer 3: Skill/Tool Configuration Management
The problem: Agents have 30+ tools, no clear ownership, configuration drift.
PromptOps solution: Skill graphs with clear dependencies.
Skill registry:
# skills/registry.py
class SkillRegistry:
def __init__(self):
self.skills = {}
def register(self, skill_class):
skill = skill_class()
self.skills[skill.name] = {
"class": skill_class,
"version": skill.version,
"dependencies": skill.dependencies,
"owner": skill.owner,
"enabled": skill.enabled,
}
def get_skill_graph(self):
"""Generate dependency graph of skills"""
graph = {}
for name, config in self.skills.items():
graph[name] = {
"depends_on": config["dependencies"],
"enabled": config["enabled"]
}
return graph
# Register skills
registry = SkillRegistry()
registry.register(OrderManagementSkill)
registry.register(RefundProcessingSkill)
registry.register(ProductCatalogSkill)
# Visualize skill graph
skill_graph = registry.get_skill_graph()
# Output:
# {
# "order_management": {"depends_on": [], "enabled": True},
# "refund_processing": {"depends_on": ["order_management"], "enabled": True},
# "product_catalog": {"depends_on": [], "enabled": True}
# }
Configuration as code:
# config/skills.yaml
skills:
order_management:
version: "2.1.0"
enabled: true
max_tokens: 2000
tools:
- lookup_order
- track_shipping
- update_order
refund_processing:
version: "1.5.0"
enabled: true
max_tokens: 3000
dependencies:
- order_management
tools:
- check_eligibility
- process_refund
Layer 4: LLM Call Logging & Tracing
The problem: LLM calls are black boxes, debugging is impossible.
PromptOps solution: Comprehensive logging of all LLM interactions.
What to log:
class LLMCallLogger:
def log_call(self, **kwargs):
log_entry = {
# Request metadata
"timestamp": time.time(),
"request_id": kwargs["request_id"],
"user_id": kwargs["user_id"],
"feature": kwargs["feature"],
# Model info
"model": kwargs["model"],
"temperature": kwargs.get("temperature", 0.7),
# Input
"prompt": kwargs["prompt"],
"input_tokens": count_tokens(kwargs["prompt"]),
# Output
"response": kwargs["response"],
"output_tokens": count_tokens(kwargs["response"]),
# Performance
"latency_ms": kwargs["latency_ms"],
"cost": kwargs["cost"],
# Tools (if applicable)
"tools_available": kwargs.get("tools", []),
"tools_called": kwargs.get("tools_called", []),
# Outcome
"success": kwargs["success"],
"error": kwargs.get("error"),
}
# Send to logging system
logger.info(json.dumps(log_entry))
# Send to analytics
analytics.track("llm_call", log_entry)
Distributed tracing:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def llm_call_with_tracing(prompt):
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("model", "claude-3-5-sonnet")
span.set_attribute("input_tokens", count_tokens(prompt))
response = llm.generate(prompt)
span.set_attribute("output_tokens", count_tokens(response))
span.set_attribute("success", True)
return response
Layer 5: Cost Monitoring & Optimization
The problem: AI costs are opaque, optimization is guesswork.
PromptOps solution: Real-time cost tracking and alerts.
Cost tracking:
class CostTracker:
def __init__(self):
self.model_pricing = {
"claude-3-5-sonnet": {
"input": 0.003, # per 1K tokens
"output": 0.015 # per 1K tokens
},
"gpt-4": {
"input": 0.03,
"output": 0.06
}
}
def calculate_cost(self, model, input_tokens, output_tokens):
pricing = self.model_pricing[model]
input_cost = (input_tokens / 1000) * pricing["input"]
output_cost = (output_tokens / 1000) * pricing["output"]
return input_cost + output_cost
def track_daily_spend(self):
today_spend = db.query("""
SELECT SUM(cost) as total
FROM llm_calls
WHERE date = CURRENT_DATE
""").total
daily_budget = 1000 # $1000/day
if today_spend > daily_budget * 0.9:
alert(f"Daily AI spend: ${today_spend:.2f} / ${daily_budget} (90%)")
metrics.gauge("ai_cost.daily", today_spend)
Cost dashboard:
Daily Cost: $847.32 / $1,000 (85%)
By Feature:
- customer_support: $423.15 (50%)
- data_analysis: $254.21 (30%)
- content_gen: $169.96 (20%)
By Model:
- claude-3-5-sonnet: $635.49 (75%)
- gpt-4: $211.83 (25%)
Top Cost Drivers:
1. User "enterprise_123": $127.44
2. Feature "bulk_analysis": $89.12
3. Prompt "detailed_summary": $67.89
Layer 6: Testing & Evaluation
The problem: Prompt changes break things in production.
PromptOps solution: Automated prompt testing.
Test suite:
# tests/prompts/test_customer_support.py
import pytest
from prompts.customer_support import get_prompt
class TestCustomerSupportPrompt:
def test_prompt_includes_company_name(self):
prompt = get_prompt(company_name="Acme Corp", tools=[])
assert "Acme Corp" in prompt
def test_prompt_lists_tools(self):
tools = [MockTool("lookup_order"), MockTool("track_shipping")]
prompt = get_prompt(company_name="Acme", tools=tools)
assert "lookup_order" in prompt
assert "track_shipping" in prompt
def test_model_response_quality(self):
"""Test actual model responses with this prompt"""
prompt = get_prompt(company_name="Acme", tools=[])
# Test with various inputs
test_cases = [
{
"input": "Where's my order?",
"expected_tool": "lookup_order",
"should_not_contain": ["refund", "cancel"]
},
{
"input": "I want a refund",
"expected_tool": "process_refund",
"should_ask_for": "order number"
}
]
for case in test_cases:
response = llm.generate(prompt + "\n" + case["input"])
assert case["expected_tool"] in response
Evaluation with LangSmith:
from langsmith import evaluate
# Define evaluators
def check_tool_selection(run, example):
"""Verify correct tool was called"""
expected_tool = example.expected_tool
actual_tools = run.outputs.get("tools_called", [])
return expected_tool in actual_tools
# Run evaluation
results = evaluate(
customer_support_agent,
data="customer_support_test_set",
evaluators=[check_tool_selection]
)
# Results:
# Test Set: 100 examples
# Accuracy: 94% (94/100)
# Failures: 6 cases where wrong tool was selected
PromptOps Workflows
Workflow #1: Prompt Change Process
1. Developer updates prompt in version control
2. Automated tests run (unit + integration)
3. Evaluation suite runs on test dataset
4. Code review (includes prompt review)
5. Deploy to staging
6. A/B test in staging (compare old vs new)
7. Monitor key metrics (accuracy, cost, latency)
8. Graduate to production (or rollback if metrics degrade)
Workflow #2: Cost Optimization
1. Identify high-cost features (dashboard)
2. Analyze token usage (where are tokens going?)
3. Implement optimizations:
- Prune context
- Compress prompts
- Use smaller models for simple tasks
- Cache frequent responses
4. A/B test optimizations
5. Measure impact (cost ⬇️, accuracy maintained?)
6. Roll out if successful
Workflow #3: Incident Response
User reports: "AI gave wrong answer"
1. Look up request ID in logs
2. View full LLM call trace:
- Input prompt
- Output response
- Tools called
- Token usage
3. Reproduce issue in staging
4. Identify root cause:
- Prompt ambiguity?
- Wrong tool selected?
- Context missing?
- Model hallucination?
5. Implement fix
6. Test fix against incident case
7. Deploy fix
8. Add test case to prevent regression
PromptOps Metrics
Key metrics to track:
| Metric | What It Measures | Target |
|---|---|---|
| Token usage per request | Context efficiency | < 5K tokens |
| Cost per request | $ per user interaction | < $0.05 |
| Response latency | Time to first token | < 2 seconds |
| Tool selection accuracy | % correct tool chosen | > 95% |
| Prompt success rate | % requests completed | > 98% |
| Daily AI spend | Total cost per day | < budget |
| Cost per feature | $ by feature area | Track trends |
| Model accuracy | % correct responses | > 90% |
Tools & Platforms
PromptOps platforms:
- LangSmith: Debugging, tracing, evaluation
- LangFuse: Open-source LLM observability
- Weights & Biases: Prompt tracking and evaluation
- Helicone: LLM observability and caching
- Custom: Build your own logging/monitoring
Integration example (LangSmith):
from langsmith import traceable
@traceable(run_type="llm", name="customer_support_agent")
def handle_customer_query(query, user_id):
# Automatically traced in LangSmith
response = agent.run(query)
return response
# View in LangSmith dashboard:
# - Full trace of LLM calls
# - Token usage
# - Latency breakdown
# - Cost
# - User feedback
The Bottom Line
PromptOps is emerging because AI systems have operational requirements that traditional DevOps doesn't cover.
Core practices:
- Version control for prompts
- Token budget management
- Comprehensive logging of LLM calls
- Cost monitoring and optimization
- Automated testing of prompt changes
- Skill/tool configuration management
Tools: LangSmith, LangFuse, custom logging
Expected impact:
- 50-70% cost reduction
- 10-20% accuracy improvement
- Faster debugging
- Safer deployments
Start with:
- Log all LLM calls (request, response, tokens, cost)
- Track daily spend
- Version control your prompts
- Set token budgets per feature
PromptOps is the new DevOps. If you're running AI in production, you need it.
Getting Started
Week 1: Set up logging
- Log every LLM call
- Track tokens and cost
- Set up basic dashboard
Week 2: Add monitoring
- Daily cost tracking
- Token budget alerts
- Latency monitoring
Week 3: Version control prompts
- Move prompts to version control
- Add basic tests
- Document prompt versions
Week 4: Implement evaluation
- Create test dataset
- Run evaluation suite
- Track accuracy over time
Need help setting up PromptOps for your AI system? We've built production LLM infrastructure for dozens of companies.
Related reading:
- LangSmith docs: https://smith.langchain.com
- LangGraph docs: https://www.langgraph.dev
About the Author
DomAIn Labs Team
The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.