Building Production-Ready AI Workflows: Best Practices

Moving AI workflows from prototype to production requires careful planning. Here’s everything you need to know to build reliable, maintainable workflows that scale.

1. Error Handling and Retries

The most important principle: assume everything will fail.

Configure Retry Logic

Every external API call can fail. Configure appropriate retry behavior:

steps:
  - type: http_call
    url: https://api.example.com/data
    retries: 3
    backoff_seconds: 5
    drop_on_failure: false

Best Practices:

Set retries: 3 for most external APIs
Use exponential backoff to avoid overwhelming services
Decide if failures should stop the workflow or continue

Handle Different Failure Modes

steps:
  - type: openai_completion
    api_key: ${env:OPENAI_API_KEY}
    model: gpt-4
    swallow_on_error: true  # Continue even if this fails
    raw_on_error: true       # Keep the error details

  - type: if_else
    if:
      conditions:
        - field: openai.error
          op: exists
      steps:
        - type: slack_webhook
          text: "AI processing failed: ${openai.error}"
    else:
      steps: []

2. Environment Variables Management

Never hardcode secrets. Ever.

Secure Configuration

steps:
  - type: openai_completion
    api_key: ${env:OPENAI_API_KEY}  # ✅ Good
    # api_key: sk-abc123              # ❌ Never do this

  - type: salesforce
    username: ${env:SF_USERNAME}
    password: ${env:SF_PASSWORD}
    security_token: ${env:SF_TOKEN}

Environment-Specific Settings

Use different variables for staging and production:

steps:
  - type: http_call
    url: ${env:API_BASE_URL}/users  # Different per environment

Pro Tip: Use a naming convention like PROD_ prefix for production secrets.

3. State Management

Design your state structure carefully—it’s the backbone of your workflow.

Keep State Clean

steps:
  - type: http_call
    url: https://api.example.com/users
    output_to: raw_users  # Store in specific location

  - type: jq
    query: '[.[] | {id, name, email}]'
    input_from: raw_users
    output_to: users  # Clean, filtered data

Avoid State Pollution

Don’t dump everything into the root state:

# ❌ Bad - pollutes state
- type: http_call
  inject: true  # Spreads all fields into root

# ✅ Good - organised
- type: http_call
  output_to: api_response

4. Monitoring and Observability

You can’t fix what you can’t see.

Add Checkpoints

Use the print step strategically:

steps:
  - type: http_call
    url: https://api.example.com/data
    output_to: api_data

  - type: print
    prefix: "[Checkpoint: After API]"
    fields:
      - api_data.status
      - api_data.record_count

  - type: for_each
    input_from: api_data.records
    var: record
    steps:
      - type: openai_completion
        prompt: "Process: ${record}"

Track Important Metrics

steps:
  - type: add_timestamp
    output_to: metrics.start_time

  # ... your workflow steps ...

  - type: add_timestamp
    output_to: metrics.end_time

  - type: http_call
    url: https://api.yourapp.com/metrics
    method: POST
    body:
      workflow: ${workflow.name}
      duration: ${metrics.duration}
      records_processed: ${state.count}

5. Input Validation

Validate data early to catch issues before they cause problems.

Use Filter Steps

steps:
  - type: filter
    conditions:
      - field: email
        operator: contains
        value: "@"
      - field: age
        operator: gte
        value: 18
    drop_on_failure: true  # Reject invalid input

Validate Required Fields

steps:
  - type: if_else
    if:
      conditions:
        - field: input.email
          op: exists
        - field: input.name
          op: exists
      steps:
        # Process the data
        - type: print
          text: "Valid input received"
    else:
      steps:
        - type: slack_webhook
          text: "Invalid input received: ${input}"
        - type: http_call
          url: https://api.yourapp.com/errors

6. Testing Strategies

Test workflows thoroughly before deploying to production.

Test with Sample Data

Create test events that cover:

Happy path scenarios
Edge cases (empty arrays, null values)
Error scenarios (API failures, timeouts)

Use Conditional Logic for Testing

steps:
  # Add environment marker to event during testing
  - type: python_function
    code: |
      def process(event):
          import os
          event['is_test'] = os.getenv('etlr_stage') == 'test'
          return event
    handler: process

  - type: if_else
    if:
      conditions:
        - field: is_test
          op: eq
          value: true
      steps:
        - type: print
          prefix: "[TEST MODE]"
          text: "Running in test mode"
    else:
      steps:
        - type: slack_webhook
          webhook_url: ${env:SLACK_WEBHOOK}
          text: "Production workflow executed"

7. Performance Optimisation

Make your workflows fast and efficient.

Batch Operations

Instead of calling APIs in a loop:

# ❌ Slow - many API calls
- type: for_each
  input_from: users
  var: user
  steps:
    - type: http_call
      url: https://api.example.com/user/${user.id}

# ✅ Fast - batch request
- type: jq
  query: '[.[] | .id]'
  input_from: users
  output_to: user_ids

- type: http_call
  url: https://api.example.com/users/batch
  method: POST
  body:
    ids: ${user_ids}

Parallel Processing

Use for_each with isolated item scope for better error handling:

- type: for_each
  input_from: records
  var: record
  item_scope: isolated  # Isolate each item's processing
  steps:
    - type: openai_completion
      prompt: "Analyse: ${record}"

8. Documentation

Document your workflows as you build them.

Add Comments

name: customer-onboarding

# This workflow runs when a new customer signs up
# It enriches their data and sends welcome emails

steps:
  # Fetch company data from Clearbit
  - type: http_call
    url: https://api.clearbit.com/v2/companies/find
    # ... config ...

  # Generate personalised welcome message using AI
  - type: openai_completion
    # ... config ...

Maintain a Changelog

Track changes to production workflows:

# v1.2.0 - 2024-11-25
# - Added error handling for API timeouts
# - Improved AI prompt for better results
# - Added metrics tracking

9. Cost Management

AI workflows can get expensive. Monitor and optimise costs.

Choose the Right Model

steps:
  # Use cheaper models for simple tasks
  - type: openai_completion
    model: gpt-3.5-turbo  # Cost-effective
    prompt: "Summarise in one sentence: ${input}"

  # Reserve expensive models for complex tasks
  - type: openai_completion
    model: gpt-4  # More capable but costly
    prompt: "Analyse legal implications of: ${contract}"

Cache Expensive Operations

steps:
  - type: http_call
    url: https://api.yourapp.com/cache/${input.id}
    output_to: cached_result

  - type: if_else
    if:
      conditions:
        - field: cached_result.exists
          op: eq
          value: true
      steps:
        - type: print
          text: "Using cached result"
    else:
      steps:
        - type: openai_completion
          api_key: ${env:OPENAI_API_KEY}
          model: gpt-4
          prompt: "Process: ${input.data}"

10. Security Best Practices

Protect your workflows and data.

Keep Webhook URLs Private

# Your webhook URL contains a security hash
# https://pipeline.etlr.io/v1/webhooks/{org_id}/{workflow_id}/{security_hash}

# Keep this URL secret - treat it like a password
input:
  type: http_webhook

Sanitise User Input

steps:
  - type: jq
    query: |
      {
        email: .email | gsub("[^a-zA-Z0-9@._-]"; ""),
        name: .name | gsub("[^a-zA-Z0-9 ]"; "")
      }
    input_from: input
    output_to: sanitised

Limit Sensitive Data Exposure

steps:
  - type: print
    fields:
      - user.id
      - user.email
    # Don't print: user.password_hash, user.api_key

Production Checklist

Before deploying to production, ensure:

✅ All secrets use environment variables
✅ Error handling configured for all external calls
✅ Monitoring and alerting in place
✅ Input validation implemented
✅ Tested with production-like data
✅ Documentation complete
✅ Team has access to logs and traces
✅ Rollback plan documented

Conclusion

Building production AI workflows requires attention to detail, but following these best practices will save you countless hours of debugging and maintenance.

Start with these patterns, iterate based on your needs, and your workflows will be reliable, maintainable, and ready to scale.

Ready to build production-grade AI workflows? Get started with ETLR and leverage our built-in best practices.