Towering mountain peaks piercing through clouds
Agent Guides

Building Multi-Step Workflow Agents That Don't Break

DomAIn Labs Team
January 22, 2025
14 min read

Building Multi-Step Workflow Agents That Don't Break

A simple AI agent answers questions. A powerful AI agent executes workflows.

Imagine an agent that doesn't just respond to "I need to return this order"—it checks the order status, validates return eligibility, generates a return label, emails it to the customer, updates inventory, and schedules the refund. All automatically.

That's a multi-step workflow agent. And building one that actually works reliably requires proper architecture.

This guide shows you how to build workflow agents that handle complexity, recover from errors, and don't leave customers in broken states.

What Makes Workflow Agents Different

Simple Agent (Single Step)

// One action, done
async function simpleAgent(input: string) {
  const response = await llm.generate(input)
  return response
}

Use case: Answer FAQ, provide information, classify data

Workflow Agent (Multi-Step)

// Multiple coordinated actions with state
async function workflowAgent(input: string) {
  let state = initializeState()

  // Step 1: Understand intent
  state = await planWorkflow(input, state)

  // Step 2-N: Execute each step
  for (const step of state.plan) {
    state = await executeStep(step, state)

    if (state.error) {
      state = await handleError(state)
    }

    if (state.complete) break
  }

  return state.result
}

Use case: Process returns, onboard customers, fulfill orders, escalate support tickets

Core Architecture Pattern

Here's the reliable pattern for workflow agents:

interface WorkflowState {
  // What we're trying to accomplish
  goal: string

  // Current execution plan
  plan: WorkflowStep[]

  // What step are we on?
  currentStep: number

  // Data accumulated during workflow
  context: Record<string, any>

  // Errors encountered
  errors: Error[]

  // Final result
  result?: any

  // Status
  status: 'planning' | 'executing' | 'completed' | 'failed'
}

interface WorkflowStep {
  name: string
  action: (state: WorkflowState) => Promise<WorkflowState>
  requiresHuman?: boolean
  retryable?: boolean
  timeout?: number
}

Why this structure?

  • State: Everything needed to resume if interrupted
  • Plan: Clear sequence of actions (debuggable, modifiable)
  • Context: Data flows between steps
  • Errors: Tracked for recovery and reporting
  • Status: Know exactly where we are

Real Example: Order Return Workflow

Let's build a complete return processing agent.

Step 1: Define the Workflow

const returnOrderWorkflow: WorkflowStep[] = [
  {
    name: 'validate_order',
    action: validateOrder,
    retryable: false,
    timeout: 5000
  },
  {
    name: 'check_return_eligibility',
    action: checkReturnEligibility,
    retryable: true,
    timeout: 5000
  },
  {
    name: 'generate_return_label',
    action: generateReturnLabel,
    retryable: true,
    timeout: 10000
  },
  {
    name: 'send_confirmation_email',
    action: sendConfirmationEmail,
    retryable: true,
    timeout: 5000
  },
  {
    name: 'update_inventory',
    action: updateInventory,
    retryable: true,
    timeout: 5000
  },
  {
    name: 'schedule_refund',
    action: scheduleRefund,
    retryable: true,
    timeout: 5000
  }
]

Step 2: Implement Each Step

async function validateOrder(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { orderNumber } = state.context

    // Query order database
    const order = await db.orders.findUnique({
      where: { orderNumber }
    })

    if (!order) {
      throw new Error(`Order ${orderNumber} not found`)
    }

    // Add order data to context for next steps
    return {
      ...state,
      context: {
        ...state.context,
        order: order,
        customerId: order.customerId,
        orderDate: order.createdAt
      },
      currentStep: state.currentStep + 1
    }

  } catch (error) {
    return {
      ...state,
      errors: [...state.errors, error],
      status: 'failed'
    }
  }
}

async function checkReturnEligibility(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { order, returnReason } = state.context

    // Calculate days since purchase
    const daysSincePurchase = daysBetween(order.createdAt, new Date())

    // Check return window (60 days)
    if (daysSincePurchase > 60) {
      // Escalate to human for approval
      return {
        ...state,
        status: 'requires_human',
        context: {
          ...state.context,
          escalationReason: 'Return requested outside 60-day window',
          escalationMessage: `Order from ${order.createdAt} is ${daysSincePurchase} days old. Requires manager approval.`
        }
      }
    }

    // Check if items are returnable
    const nonReturnableItems = order.items.filter(
      item => item.category === 'final_sale'
    )

    if (nonReturnableItems.length > 0) {
      throw new Error(
        `Order contains non-returnable items: ${nonReturnableItems.map(i => i.name).join(', ')}`
      )
    }

    // Eligible - continue
    return {
      ...state,
      context: {
        ...state.context,
        eligible: true,
        returnWindow: 60 - daysSincePurchase  // Days remaining
      },
      currentStep: state.currentStep + 1
    }

  } catch (error) {
    return {
      ...state,
      errors: [...state.errors, error],
      status: 'failed'
    }
  }
}

async function generateReturnLabel(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { order } = state.context

    // Call shipping API
    const label = await shippingAPI.createReturnLabel({
      orderId: order.id,
      fromAddress: order.shippingAddress,
      toAddress: WAREHOUSE_ADDRESS,
      weight: calculateWeight(order.items),
      serviceLevel: 'ground'
    })

    return {
      ...state,
      context: {
        ...state.context,
        returnLabel: label,
        trackingNumber: label.trackingNumber,
        labelUrl: label.pdfUrl
      },
      currentStep: state.currentStep + 1
    }

  } catch (error) {
    // Shipping API failure - retry logic handled by executor
    return {
      ...state,
      errors: [...state.errors, error],
      status: 'failed'
    }
  }
}

async function sendConfirmationEmail(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { order, returnLabel, customerId } = state.context

    const customer = await db.customers.findUnique({
      where: { id: customerId }
    })

    await emailService.send({
      to: customer.email,
      template: 'return_confirmation',
      data: {
        customerName: customer.name,
        orderNumber: order.orderNumber,
        returnLabelUrl: returnLabel.pdfUrl,
        trackingNumber: returnLabel.trackingNumber,
        estimatedRefundDate: addDays(new Date(), 7)
      }
    })

    return {
      ...state,
      context: {
        ...state.context,
        emailSent: true,
        emailSentAt: new Date()
      },
      currentStep: state.currentStep + 1
    }

  } catch (error) {
    // Email failure shouldn't block workflow
    // Log error but continue
    console.error('Email send failed:', error)

    return {
      ...state,
      errors: [...state.errors, error],
      context: {
        ...state.context,
        emailSent: false,
        emailError: error.message
      },
      currentStep: state.currentStep + 1  // Continue anyway
    }
  }
}

async function updateInventory(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { order } = state.context

    // Mark items as returning
    await db.inventory.updateMany({
      where: {
        itemId: { in: order.items.map(item => item.id) }
      },
      data: {
        status: 'returning',
        expectedReturnDate: addDays(new Date(), 7)
      }
    })

    return {
      ...state,
      context: {
        ...state.context,
        inventoryUpdated: true
      },
      currentStep: state.currentStep + 1
    }

  } catch (error) {
    return {
      ...state,
      errors: [...state.errors, error],
      status: 'failed'
    }
  }
}

async function scheduleRefund(state: WorkflowState): Promise<WorkflowState> {
  try {
    const { order } = state.context

    // Schedule refund for 7 days from now (when return received)
    const refund = await paymentProcessor.scheduleRefund({
      orderId: order.id,
      amount: order.total,
      scheduledFor: addDays(new Date(), 7),
      reason: 'return_requested'
    })

    return {
      ...state,
      context: {
        ...state.context,
        refundScheduled: true,
        refundId: refund.id,
        refundAmount: refund.amount,
        refundDate: refund.scheduledFor
      },
      currentStep: state.currentStep + 1,
      status: 'completed',
      result: {
        success: true,
        message: 'Return processed successfully',
        trackingNumber: state.context.trackingNumber,
        refundAmount: refund.amount,
        refundDate: refund.scheduledFor
      }
    }

  } catch (error) {
    return {
      ...state,
      errors: [...state.errors, error],
      status: 'failed'
    }
  }
}

Step 3: Build the Workflow Executor

This is the engine that runs workflows with error handling:

class WorkflowExecutor {
  async execute(
    workflow: WorkflowStep[],
    initialContext: Record<string, any>
  ): Promise<WorkflowState> {

    let state: WorkflowState = {
      goal: 'process_return',
      plan: workflow,
      currentStep: 0,
      context: initialContext,
      errors: [],
      status: 'executing'
    }

    // Save initial state for recovery
    await this.saveState(state)

    // Execute each step
    for (let i = 0; i < workflow.length; i++) {
      const step = workflow[i]

      console.log(`Executing step ${i + 1}/${workflow.length}: ${step.name}`)

      // Execute with retry logic
      state = await this.executeWithRetry(step, state)

      // Save state after each step (for recovery)
      await this.saveState(state)

      // Check if workflow should stop
      if (state.status === 'failed') {
        await this.handleFailure(state)
        break
      }

      if (state.status === 'requires_human') {
        await this.escalateToHuman(state)
        break
      }

      if (state.status === 'completed') {
        await this.handleSuccess(state)
        break
      }
    }

    return state
  }

  private async executeWithRetry(
    step: WorkflowStep,
    state: WorkflowState
  ): Promise<WorkflowState> {

    const maxRetries = step.retryable ? 3 : 1
    let lastError: Error | null = null

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        // Execute with timeout
        const result = await this.executeWithTimeout(
          step.action(state),
          step.timeout || 30000
        )

        // Success - return result
        return result

      } catch (error) {
        lastError = error
        console.error(
          `Step ${step.name} failed (attempt ${attempt}/${maxRetries}):`,
          error
        )

        // If retryable and not last attempt, wait and retry
        if (step.retryable && attempt < maxRetries) {
          const backoff = Math.pow(2, attempt) * 1000  // Exponential backoff
          await this.sleep(backoff)
          continue
        }
      }
    }

    // All retries failed
    return {
      ...state,
      errors: [...state.errors, lastError!],
      status: 'failed'
    }
  }

  private async executeWithTimeout<T>(
    promise: Promise<T>,
    timeout: number
  ): Promise<T> {
    return Promise.race([
      promise,
      new Promise<T>((_, reject) =>
        setTimeout(() => reject(new Error('Step timeout')), timeout)
      )
    ])
  }

  private async saveState(state: WorkflowState): Promise<void> {
    // Persist state to database for recovery
    await db.workflowStates.upsert({
      where: { id: state.context.workflowId },
      create: {
        id: state.context.workflowId,
        state: JSON.stringify(state),
        updatedAt: new Date()
      },
      update: {
        state: JSON.stringify(state),
        updatedAt: new Date()
      }
    })
  }

  private async handleFailure(state: WorkflowState): Promise<void> {
    // Log failure
    console.error('Workflow failed:', state.errors)

    // Notify monitoring
    await monitoring.alert({
      type: 'workflow_failure',
      workflowId: state.context.workflowId,
      goal: state.goal,
      failedStep: state.plan[state.currentStep]?.name,
      errors: state.errors.map(e => e.message)
    })

    // Notify customer if applicable
    if (state.context.customerId) {
      await this.notifyCustomerOfFailure(state)
    }
  }

  private async escalateToHuman(state: WorkflowState): Promise<void> {
    // Create support ticket for human review
    await db.tickets.create({
      data: {
        type: 'workflow_escalation',
        priority: 'high',
        workflowId: state.context.workflowId,
        reason: state.context.escalationReason,
        context: JSON.stringify(state.context),
        assignedTo: 'support_team'
      }
    })

    // Notify team
    await notificationService.send({
      channel: 'slack',
      message: `Workflow escalated: ${state.context.escalationMessage}`,
      data: state.context
    })
  }

  private async handleSuccess(state: WorkflowState): Promise<void> {
    console.log('Workflow completed successfully:', state.result)

    // Log success metric
    await analytics.track('workflow_success', {
      goal: state.goal,
      duration: Date.now() - state.context.startTime,
      stepsCompleted: state.currentStep + 1
    })
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms))
  }
}

Step 4: Usage

// Initialize executor
const executor = new WorkflowExecutor()

// Process a return
const result = await executor.execute(
  returnOrderWorkflow,
  {
    workflowId: generateId(),
    orderNumber: 'ORD-12345',
    returnReason: 'Changed mind',
    startTime: Date.now()
  }
)

if (result.status === 'completed') {
  console.log('Return processed:', result.result)
} else if (result.status === 'failed') {
  console.error('Return failed:', result.errors)
} else if (result.status === 'requires_human') {
  console.log('Escalated to human:', result.context.escalationReason)
}

Error Handling Strategies

1. Graceful Degradation

Not all failures should stop the workflow:

// Critical step - failure stops workflow
async function processPayment(state: WorkflowState) {
  try {
    const result = await paymentAPI.charge(state.context.amount)
    return { ...state, context: { ...state.context, paymentId: result.id } }
  } catch (error) {
    return { ...state, status: 'failed', errors: [...state.errors, error] }
  }
}

// Non-critical step - failure logged but workflow continues
async function sendReceipt(state: WorkflowState) {
  try {
    await emailService.send(state.context.receipt)
    return { ...state, context: { ...state.context, receiptSent: true } }
  } catch (error) {
    console.error('Receipt send failed:', error)
    // Continue anyway - we can resend later
    return {
      ...state,
      errors: [...state.errors, error],
      context: { ...state.context, receiptSent: false }
    }
  }
}

2. Compensating Actions (Rollback)

When a late step fails, undo earlier steps:

const workflowWithCompensation: WorkflowStep[] = [
  {
    name: 'reserve_inventory',
    action: reserveInventory,
    compensation: releaseInventory  // Undo if workflow fails
  },
  {
    name: 'charge_customer',
    action: chargeCustomer,
    compensation: refundCustomer
  },
  {
    name: 'ship_order',
    action: shipOrder,
    compensation: cancelShipment
  }
]

async function executeWithCompensation(workflow: WorkflowStep[], context: any) {
  const completedSteps: WorkflowStep[] = []

  for (const step of workflow) {
    try {
      await step.action(context)
      completedSteps.push(step)
    } catch (error) {
      // Failure - rollback completed steps in reverse order
      console.error(`Step ${step.name} failed. Rolling back...`)

      for (const completedStep of completedSteps.reverse()) {
        if (completedStep.compensation) {
          await completedStep.compensation(context)
        }
      }

      throw error
    }
  }
}

3. Human-in-the-Loop

Some decisions require human judgment:

async function requiresHumanReview(state: WorkflowState): Promise<WorkflowState> {
  // Check if automated decision is confident
  const confidence = await mlModel.predictConfidence(state.context.data)

  if (confidence < 0.8) {
    // Low confidence - escalate
    return {
      ...state,
      status: 'requires_human',
      context: {
        ...state.context,
        reviewReason: `Model confidence ${confidence} below threshold`,
        reviewData: state.context.data
      }
    }
  }

  // High confidence - proceed automatically
  return {
    ...state,
    currentStep: state.currentStep + 1
  }
}

State Management Best Practices

1. Idempotency

Steps should be safe to retry:

// BAD - not idempotent
async function sendEmail(state: WorkflowState) {
  await emailService.send(state.context.email)  // Sends duplicate if retried
}

// GOOD - idempotent
async function sendEmail(state: WorkflowState) {
  // Check if already sent
  if (state.context.emailSent) {
    return state  // Skip if already done
  }

  await emailService.send(state.context.email)

  return {
    ...state,
    context: {
      ...state.context,
      emailSent: true,
      emailSentAt: new Date()
    }
  }
}

2. State Persistence

Save state after every step for recovery:

// If workflow crashes mid-execution, we can resume
async function resumeWorkflow(workflowId: string) {
  // Load saved state from database
  const savedState = await db.workflowStates.findUnique({
    where: { id: workflowId }
  })

  const state: WorkflowState = JSON.parse(savedState.state)

  // Resume from where we left off
  const executor = new WorkflowExecutor()
  return executor.execute(
    state.plan.slice(state.currentStep),  // Remaining steps
    state.context
  )
}

3. Timeout Management

Prevent indefinite waiting:

const step: WorkflowStep = {
  name: 'call_external_api',
  action: callExternalAPI,
  timeout: 10000,  // 10 seconds max
  retryable: true
}

// If API call takes > 10s, timeout and retry

Testing Workflow Agents

Unit Test Individual Steps

describe('validateOrder', () => {
  it('should add order to context if found', async () => {
    const state = {
      context: { orderNumber: 'ORD-123' },
      // ... other state
    }

    const result = await validateOrder(state)

    expect(result.context.order).toBeDefined()
    expect(result.currentStep).toBe(state.currentStep + 1)
  })

  it('should fail if order not found', async () => {
    const state = {
      context: { orderNumber: 'INVALID' },
      // ... other state
    }

    const result = await validateOrder(state)

    expect(result.status).toBe('failed')
    expect(result.errors).toHaveLength(1)
  })
})

Integration Test Full Workflows

describe('returnOrderWorkflow', () => {
  it('should complete successfully for valid return', async () => {
    const executor = new WorkflowExecutor()

    const result = await executor.execute(returnOrderWorkflow, {
      workflowId: 'test-123',
      orderNumber: 'ORD-VALID',
      returnReason: 'Defective'
    })

    expect(result.status).toBe('completed')
    expect(result.result.success).toBe(true)
    expect(result.result.trackingNumber).toBeDefined()
  })

  it('should escalate returns outside 60-day window', async () => {
    // Order from 90 days ago
    await db.orders.create({
      data: {
        orderNumber: 'ORD-OLD',
        createdAt: subDays(new Date(), 90)
      }
    })

    const result = await executor.execute(returnOrderWorkflow, {
      workflowId: 'test-124',
      orderNumber: 'ORD-OLD',
      returnReason: 'Changed mind'
    })

    expect(result.status).toBe('requires_human')
    expect(result.context.escalationReason).toContain('outside 60-day window')
  })
})

Monitoring & Observability

Track workflow performance:

// Log every step execution
await analytics.track('workflow_step_executed', {
  workflowId: state.context.workflowId,
  step: step.name,
  duration: executionTime,
  success: !state.errors.length
})

// Monitor common failure points
if (state.errors.length > 0) {
  await monitoring.increment(`workflow.step.${step.name}.errors`)
}

// Track end-to-end metrics
await analytics.track('workflow_completed', {
  goal: state.goal,
  totalSteps: state.plan.length,
  completedSteps: state.currentStep + 1,
  duration: Date.now() - state.context.startTime,
  status: state.status
})

The Bottom Line

Building reliable workflow agents requires:

Architecture:

  • Clear state management
  • Well-defined steps
  • Error handling at every level
  • State persistence for recovery

Error Handling:

  • Retry logic for transient failures
  • Graceful degradation for non-critical steps
  • Compensating actions for rollback
  • Human escalation for edge cases

Testing:

  • Unit tests for individual steps
  • Integration tests for full workflows
  • Chaos testing for error scenarios

Monitoring:

  • Track execution at step level
  • Alert on failures
  • Measure performance
  • Log for debugging

Investment: 3-6 weeks to build a robust workflow agent system

Returns: Automate complex business processes end-to-end, 80-95% success rate

Next Steps

  1. Map your workflow: Document the steps needed end-to-end
  2. Identify decision points: Where might errors occur? Where's human judgment needed?
  3. Design error handling: How should each failure type be handled?
  4. Build incrementally: Start with happy path, add error handling, then edge cases
  5. Test thoroughly: Simulate failures, test recovery, validate idempotency

Need help building a workflow agent? Schedule a consultation to discuss your specific use case, or check out our case studies to see workflow agents in action.

Remember: The first version doesn't need to handle every edge case. Start simple, deploy, learn from real usage, and iterate.

Tags:workflowsstate managementerror handlingarchitecturereliability

About the Author

DomAIn Labs Team

The DomAIn Labs team consists of AI engineers, strategists, and educators passionate about demystifying AI for small businesses.