
Building Multi-Step Workflow Agents That Don't Break
Building Multi-Step Workflow Agents That Don't Break
A simple AI agent answers questions. A powerful AI agent executes workflows.
Imagine an agent that doesn't just respond to "I need to return this order"—it checks the order status, validates return eligibility, generates a return label, emails it to the customer, updates inventory, and schedules the refund. All automatically.
That's a multi-step workflow agent. And building one that actually works reliably requires proper architecture.
This guide shows you how to build workflow agents that handle complexity, recover from errors, and don't leave customers in broken states.
What Makes Workflow Agents Different
Simple Agent (Single Step)
// One action, done
async function simpleAgent(input: string) {
const response = await llm.generate(input)
return response
}
Use case: Answer FAQ, provide information, classify data
Workflow Agent (Multi-Step)
// Multiple coordinated actions with state
async function workflowAgent(input: string) {
let state = initializeState()
// Step 1: Understand intent
state = await planWorkflow(input, state)
// Step 2-N: Execute each step
for (const step of state.plan) {
state = await executeStep(step, state)
if (state.error) {
state = await handleError(state)
}
if (state.complete) break
}
return state.result
}
Use case: Process returns, onboard customers, fulfill orders, escalate support tickets
Core Architecture Pattern
Here's the reliable pattern for workflow agents:
interface WorkflowState {
// What we're trying to accomplish
goal: string
// Current execution plan
plan: WorkflowStep[]
// What step are we on?
currentStep: number
// Data accumulated during workflow
context: Record<string, any>
// Errors encountered
errors: Error[]
// Final result
result?: any
// Status
status: 'planning' | 'executing' | 'completed' | 'failed'
}
interface WorkflowStep {
name: string
action: (state: WorkflowState) => Promise<WorkflowState>
requiresHuman?: boolean
retryable?: boolean
timeout?: number
}
Why this structure?
- State: Everything needed to resume if interrupted
- Plan: Clear sequence of actions (debuggable, modifiable)
- Context: Data flows between steps
- Errors: Tracked for recovery and reporting
- Status: Know exactly where we are
Real Example: Order Return Workflow
Let's build a complete return processing agent.
Step 1: Define the Workflow
const returnOrderWorkflow: WorkflowStep[] = [
{
name: 'validate_order',
action: validateOrder,
retryable: false,
timeout: 5000
},
{
name: 'check_return_eligibility',
action: checkReturnEligibility,
retryable: true,
timeout: 5000
},
{
name: 'generate_return_label',
action: generateReturnLabel,
retryable: true,
timeout: 10000
},
{
name: 'send_confirmation_email',
action: sendConfirmationEmail,
retryable: true,
timeout: 5000
},
{
name: 'update_inventory',
action: updateInventory,
retryable: true,
timeout: 5000
},
{
name: 'schedule_refund',
action: scheduleRefund,
retryable: true,
timeout: 5000
}
]
Step 2: Implement Each Step
async function validateOrder(state: WorkflowState): Promise<WorkflowState> {
try {
const { orderNumber } = state.context
// Query order database
const order = await db.orders.findUnique({
where: { orderNumber }
})
if (!order) {
throw new Error(`Order ${orderNumber} not found`)
}
// Add order data to context for next steps
return {
...state,
context: {
...state.context,
order: order,
customerId: order.customerId,
orderDate: order.createdAt
},
currentStep: state.currentStep + 1
}
} catch (error) {
return {
...state,
errors: [...state.errors, error],
status: 'failed'
}
}
}
async function checkReturnEligibility(state: WorkflowState): Promise<WorkflowState> {
try {
const { order, returnReason } = state.context
// Calculate days since purchase
const daysSincePurchase = daysBetween(order.createdAt, new Date())
// Check return window (60 days)
if (daysSincePurchase > 60) {
// Escalate to human for approval
return {
...state,
status: 'requires_human',
context: {
...state.context,
escalationReason: 'Return requested outside 60-day window',
escalationMessage: `Order from ${order.createdAt} is ${daysSincePurchase} days old. Requires manager approval.`
}
}
}
// Check if items are returnable
const nonReturnableItems = order.items.filter(
item => item.category === 'final_sale'
)
if (nonReturnableItems.length > 0) {
throw new Error(
`Order contains non-returnable items: ${nonReturnableItems.map(i => i.name).join(', ')}`
)
}
// Eligible - continue
return {
...state,
context: {
...state.context,
eligible: true,
returnWindow: 60 - daysSincePurchase // Days remaining
},
currentStep: state.currentStep + 1
}
} catch (error) {
return {
...state,
errors: [...state.errors, error],
status: 'failed'
}
}
}
async function generateReturnLabel(state: WorkflowState): Promise<WorkflowState> {
try {
const { order } = state.context
// Call shipping API
const label = await shippingAPI.createReturnLabel({
orderId: order.id,
fromAddress: order.shippingAddress,
toAddress: WAREHOUSE_ADDRESS,
weight: calculateWeight(order.items),
serviceLevel: 'ground'
})
return {
...state,
context: {
...state.context,
returnLabel: label,
trackingNumber: label.trackingNumber,
labelUrl: label.pdfUrl
},
currentStep: state.currentStep + 1
}
} catch (error) {
// Shipping API failure - retry logic handled by executor
return {
...state,
errors: [...state.errors, error],
status: 'failed'
}
}
}
async function sendConfirmationEmail(state: WorkflowState): Promise<WorkflowState> {
try {
const { order, returnLabel, customerId } = state.context
const customer = await db.customers.findUnique({
where: { id: customerId }
})
await emailService.send({
to: customer.email,
template: 'return_confirmation',
data: {
customerName: customer.name,
orderNumber: order.orderNumber,
returnLabelUrl: returnLabel.pdfUrl,
trackingNumber: returnLabel.trackingNumber,
estimatedRefundDate: addDays(new Date(), 7)
}
})
return {
...state,
context: {
...state.context,
emailSent: true,
emailSentAt: new Date()
},
currentStep: state.currentStep + 1
}
} catch (error) {
// Email failure shouldn't block workflow
// Log error but continue
console.error('Email send failed:', error)
return {
...state,
errors: [...state.errors, error],
context: {
...state.context,
emailSent: false,
emailError: error.message
},
currentStep: state.currentStep + 1 // Continue anyway
}
}
}
async function updateInventory(state: WorkflowState): Promise<WorkflowState> {
try {
const { order } = state.context
// Mark items as returning
await db.inventory.updateMany({
where: {
itemId: { in: order.items.map(item => item.id) }
},
data: {
status: 'returning',
expectedReturnDate: addDays(new Date(), 7)
}
})
return {
...state,
context: {
...state.context,
inventoryUpdated: true
},
currentStep: state.currentStep + 1
}
} catch (error) {
return {
...state,
errors: [...state.errors, error],
status: 'failed'
}
}
}
async function scheduleRefund(state: WorkflowState): Promise<WorkflowState> {
try {
const { order } = state.context
// Schedule refund for 7 days from now (when return received)
const refund = await paymentProcessor.scheduleRefund({
orderId: order.id,
amount: order.total,
scheduledFor: addDays(new Date(), 7),
reason: 'return_requested'
})
return {
...state,
context: {
...state.context,
refundScheduled: true,
refundId: refund.id,
refundAmount: refund.amount,
refundDate: refund.scheduledFor
},
currentStep: state.currentStep + 1,
status: 'completed',
result: {
success: true,
message: 'Return processed successfully',
trackingNumber: state.context.trackingNumber,
refundAmount: refund.amount,
refundDate: refund.scheduledFor
}
}
} catch (error) {
return {
...state,
errors: [...state.errors, error],
status: 'failed'
}
}
}
Step 3: Build the Workflow Executor
This is the engine that runs workflows with error handling:
class WorkflowExecutor {
async execute(
workflow: WorkflowStep[],
initialContext: Record<string, any>
): Promise<WorkflowState> {
let state: WorkflowState = {
goal: 'process_return',
plan: workflow,
currentStep: 0,
context: initialContext,
errors: [],
status: 'executing'
}
// Save initial state for recovery
await this.saveState(state)
// Execute each step
for (let i = 0; i < workflow.length; i++) {
const step = workflow[i]
console.log(`Executing step ${i + 1}/${workflow.length}: ${step.name}`)
// Execute with retry logic
state = await this.executeWithRetry(step, state)
// Save state after each step (for recovery)
await this.saveState(state)
// Check if workflow should stop
if (state.status === 'failed') {
await this.handleFailure(state)
break
}
if (state.status === 'requires_human') {
await this.escalateToHuman(state)
break
}
if (state.status === 'completed') {
await this.handleSuccess(state)
break
}
}
return state
}
private async executeWithRetry(
step: WorkflowStep,
state: WorkflowState
): Promise<WorkflowState> {
const maxRetries = step.retryable ? 3 : 1
let lastError: Error | null = null
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
// Execute with timeout
const result = await this.executeWithTimeout(
step.action(state),
step.timeout || 30000
)
// Success - return result
return result
} catch (error) {
lastError = error
console.error(
`Step ${step.name} failed (attempt ${attempt}/${maxRetries}):`,
error
)
// If retryable and not last attempt, wait and retry
if (step.retryable && attempt < maxRetries) {
const backoff = Math.pow(2, attempt) * 1000 // Exponential backoff
await this.sleep(backoff)
continue
}
}
}
// All retries failed
return {
...state,
errors: [...state.errors, lastError!],
status: 'failed'
}
}
private async executeWithTimeout<T>(
promise: Promise<T>,
timeout: number
): Promise<T> {
return Promise.race([
promise,
new Promise<T>((_, reject) =>
setTimeout(() => reject(new Error('Step timeout')), timeout)
)
])
}
private async saveState(state: WorkflowState): Promise<void> {
// Persist state to database for recovery
await db.workflowStates.upsert({
where: { id: state.context.workflowId },
create: {
id: state.context.workflowId,
state: JSON.stringify(state),
updatedAt: new Date()
},
update: {
state: JSON.stringify(state),
updatedAt: new Date()
}
})
}
private async handleFailure(state: WorkflowState): Promise<void> {
// Log failure
console.error('Workflow failed:', state.errors)
// Notify monitoring
await monitoring.alert({
type: 'workflow_failure',
workflowId: state.context.workflowId,
goal: state.goal,
failedStep: state.plan[state.currentStep]?.name,
errors: state.errors.map(e => e.message)
})
// Notify customer if applicable
if (state.context.customerId) {
await this.notifyCustomerOfFailure(state)
}
}
private async escalateToHuman(state: WorkflowState): Promise<void> {
// Create support ticket for human review
await db.tickets.create({
data: {
type: 'workflow_escalation',
priority: 'high',
workflowId: state.context.workflowId,
reason: state.context.escalationReason,
context: JSON.stringify(state.context),
assignedTo: 'support_team'
}
})
// Notify team
await notificationService.send({
channel: 'slack',
message: `Workflow escalated: ${state.context.escalationMessage}`,
data: state.context
})
}
private async handleSuccess(state: WorkflowState): Promise<void> {
console.log('Workflow completed successfully:', state.result)
// Log success metric
await analytics.track('workflow_success', {
goal: state.goal,
duration: Date.now() - state.context.startTime,
stepsCompleted: state.currentStep + 1
})
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms))
}
}
Step 4: Usage
// Initialize executor
const executor = new WorkflowExecutor()
// Process a return
const result = await executor.execute(
returnOrderWorkflow,
{
workflowId: generateId(),
orderNumber: 'ORD-12345',
returnReason: 'Changed mind',
startTime: Date.now()
}
)
if (result.status === 'completed') {
console.log('Return processed:', result.result)
} else if (result.status === 'failed') {
console.error('Return failed:', result.errors)
} else if (result.status === 'requires_human') {
console.log('Escalated to human:', result.context.escalationReason)
}
Error Handling Strategies
1. Graceful Degradation
Not all failures should stop the workflow:
// Critical step - failure stops workflow
async function processPayment(state: WorkflowState) {
try {
const result = await paymentAPI.charge(state.context.amount)
return { ...state, context: { ...state.context, paymentId: result.id } }
} catch (error) {
return { ...state, status: 'failed', errors: [...state.errors, error] }
}
}
// Non-critical step - failure logged but workflow continues
async function sendReceipt(state: WorkflowState) {
try {
await emailService.send(state.context.receipt)
return { ...state, context: { ...state.context, receiptSent: true } }
} catch (error) {
console.error('Receipt send failed:', error)
// Continue anyway - we can resend later
return {
...state,
errors: [...state.errors, error],
context: { ...state.context, receiptSent: false }
}
}
}
2. Compensating Actions (Rollback)
When a late step fails, undo earlier steps:
const workflowWithCompensation: WorkflowStep[] = [
{
name: 'reserve_inventory',
action: reserveInventory,
compensation: releaseInventory // Undo if workflow fails
},
{
name: 'charge_customer',
action: chargeCustomer,
compensation: refundCustomer
},
{
name: 'ship_order',
action: shipOrder,
compensation: cancelShipment
}
]
async function executeWithCompensation(workflow: WorkflowStep[], context: any) {
const completedSteps: WorkflowStep[] = []
for (const step of workflow) {
try {
await step.action(context)
completedSteps.push(step)
} catch (error) {
// Failure - rollback completed steps in reverse order
console.error(`Step ${step.name} failed. Rolling back...`)
for (const completedStep of completedSteps.reverse()) {
if (completedStep.compensation) {
await completedStep.compensation(context)
}
}
throw error
}
}
}
3. Human-in-the-Loop
Some decisions require human judgment:
async function requiresHumanReview(state: WorkflowState): Promise<WorkflowState> {
// Check if automated decision is confident
const confidence = await mlModel.predictConfidence(state.context.data)
if (confidence < 0.8) {
// Low confidence - escalate
return {
...state,
status: 'requires_human',
context: {
...state.context,
reviewReason: `Model confidence ${confidence} below threshold`,
reviewData: state.context.data
}
}
}
// High confidence - proceed automatically
return {
...state,
currentStep: state.currentStep + 1
}
}
State Management Best Practices
1. Idempotency
Steps should be safe to retry:
// BAD - not idempotent
async function sendEmail(state: WorkflowState) {
await emailService.send(state.context.email) // Sends duplicate if retried
}
// GOOD - idempotent
async function sendEmail(state: WorkflowState) {
// Check if already sent
if (state.context.emailSent) {
return state // Skip if already done
}
await emailService.send(state.context.email)
return {
...state,
context: {
...state.context,
emailSent: true,
emailSentAt: new Date()
}
}
}
2. State Persistence
Save state after every step for recovery:
// If workflow crashes mid-execution, we can resume
async function resumeWorkflow(workflowId: string) {
// Load saved state from database
const savedState = await db.workflowStates.findUnique({
where: { id: workflowId }
})
const state: WorkflowState = JSON.parse(savedState.state)
// Resume from where we left off
const executor = new WorkflowExecutor()
return executor.execute(
state.plan.slice(state.currentStep), // Remaining steps
state.context
)
}
3. Timeout Management
Prevent indefinite waiting:
const step: WorkflowStep = {
name: 'call_external_api',
action: callExternalAPI,
timeout: 10000, // 10 seconds max
retryable: true
}
// If API call takes > 10s, timeout and retry
Testing Workflow Agents
Unit Test Individual Steps
describe('validateOrder', () => {
it('should add order to context if found', async () => {
const state = {
context: { orderNumber: 'ORD-123' },
// ... other state
}
const result = await validateOrder(state)
expect(result.context.order).toBeDefined()
expect(result.currentStep).toBe(state.currentStep + 1)
})
it('should fail if order not found', async () => {
const state = {
context: { orderNumber: 'INVALID' },
// ... other state
}
const result = await validateOrder(state)
expect(result.status).toBe('failed')
expect(result.errors).toHaveLength(1)
})
})
Integration Test Full Workflows
describe('returnOrderWorkflow', () => {
it('should complete successfully for valid return', async () => {
const executor = new WorkflowExecutor()
const result = await executor.execute(returnOrderWorkflow, {
workflowId: 'test-123',
orderNumber: 'ORD-VALID',
returnReason: 'Defective'
})
expect(result.status).toBe('completed')
expect(result.result.success).toBe(true)
expect(result.result.trackingNumber).toBeDefined()
})
it('should escalate returns outside 60-day window', async () => {
// Order from 90 days ago
await db.orders.create({
data: {
orderNumber: 'ORD-OLD',
createdAt: subDays(new Date(), 90)
}
})
const result = await executor.execute(returnOrderWorkflow, {
workflowId: 'test-124',
orderNumber: 'ORD-OLD',
returnReason: 'Changed mind'
})
expect(result.status).toBe('requires_human')
expect(result.context.escalationReason).toContain('outside 60-day window')
})
})
Monitoring & Observability
Track workflow performance:
// Log every step execution
await analytics.track('workflow_step_executed', {
workflowId: state.context.workflowId,
step: step.name,
duration: executionTime,
success: !state.errors.length
})
// Monitor common failure points
if (state.errors.length > 0) {
await monitoring.increment(`workflow.step.${step.name}.errors`)
}
// Track end-to-end metrics
await analytics.track('workflow_completed', {
goal: state.goal,
totalSteps: state.plan.length,
completedSteps: state.currentStep + 1,
duration: Date.now() - state.context.startTime,
status: state.status
})
The Bottom Line
Building reliable workflow agents requires:
Architecture:
- Clear state management
- Well-defined steps
- Error handling at every level
- State persistence for recovery
Error Handling:
- Retry logic for transient failures
- Graceful degradation for non-critical steps
- Compensating actions for rollback
- Human escalation for edge cases
Testing:
- Unit tests for individual steps
- Integration tests for full workflows
- Chaos testing for error scenarios
Monitoring:
- Track execution at step level
- Alert on failures
- Measure performance
- Log for debugging
Investment: 3-6 weeks to build a robust workflow agent system
Returns: Automate complex business processes end-to-end, 80-95% success rate
Next Steps
- Map your workflow: Document the steps needed end-to-end
- Identify decision points: Where might errors occur? Where's human judgment needed?
- Design error handling: How should each failure type be handled?
- Build incrementally: Start with happy path, add error handling, then edge cases
- Test thoroughly: Simulate failures, test recovery, validate idempotency
Need help building a workflow agent? Schedule a consultation to discuss your specific use case, or check out our case studies to see workflow agents in action.
Remember: The first version doesn't need to handle every edge case. Start simple, deploy, learn from real usage, and iterate.
About the Author
DomAIn Labs Team
The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.