Moving AI workflows from prototype to production requires careful planning. Here’s everything you need to know to build reliable, maintainable workflows that scale.
1. Error Handling and Retries
The most important principle: assume everything will fail.
Configure Retry Logic
Every external API call can fail. Configure appropriate retry behavior:
steps:
- type: http_call
url: https://api.example.com/data
retries: 3
backoff_seconds: 5
drop_on_failure: false
Best Practices:
- Set
retries: 3for most external APIs - Use exponential backoff to avoid overwhelming services
- Decide if failures should stop the workflow or continue
Handle Different Failure Modes
steps:
- type: openai_completion
api_key: ${env:OPENAI_API_KEY}
model: gpt-4
swallow_on_error: true # Continue even if this fails
raw_on_error: true # Keep the error details
- type: if_else
if:
conditions:
- field: openai.error
op: exists
steps:
- type: slack_webhook
text: "AI processing failed: ${openai.error}"
else:
steps: []
2. Environment Variables Management
Never hardcode secrets. Ever.
Secure Configuration
steps:
- type: openai_completion
api_key: ${env:OPENAI_API_KEY} # ✅ Good
# api_key: sk-abc123 # ❌ Never do this
- type: salesforce
username: ${env:SF_USERNAME}
password: ${env:SF_PASSWORD}
security_token: ${env:SF_TOKEN}
Environment-Specific Settings
Use different variables for staging and production:
steps:
- type: http_call
url: ${env:API_BASE_URL}/users # Different per environment
Pro Tip: Use a naming convention like PROD_ prefix for production secrets.
3. State Management
Design your state structure carefully—it’s the backbone of your workflow.
Keep State Clean
steps:
- type: http_call
url: https://api.example.com/users
output_to: raw_users # Store in specific location
- type: jq
query: '[.[] | {id, name, email}]'
input_from: raw_users
output_to: users # Clean, filtered data
Avoid State Pollution
Don’t dump everything into the root state:
# ❌ Bad - pollutes state
- type: http_call
inject: true # Spreads all fields into root
# ✅ Good - organised
- type: http_call
output_to: api_response
4. Monitoring and Observability
You can’t fix what you can’t see.
Add Checkpoints
Use the print step strategically:
steps:
- type: http_call
url: https://api.example.com/data
output_to: api_data
- type: print
prefix: "[Checkpoint: After API]"
fields:
- api_data.status
- api_data.record_count
- type: for_each
input_from: api_data.records
var: record
steps:
- type: openai_completion
prompt: "Process: ${record}"
Track Important Metrics
steps:
- type: add_timestamp
output_to: metrics.start_time
# ... your workflow steps ...
- type: add_timestamp
output_to: metrics.end_time
- type: http_call
url: https://api.yourapp.com/metrics
method: POST
body:
workflow: ${workflow.name}
duration: ${metrics.duration}
records_processed: ${state.count}
5. Input Validation
Validate data early to catch issues before they cause problems.
Use Filter Steps
steps:
- type: filter
conditions:
- field: email
operator: contains
value: "@"
- field: age
operator: gte
value: 18
drop_on_failure: true # Reject invalid input
Validate Required Fields
steps:
- type: if_else
if:
conditions:
- field: input.email
op: exists
- field: input.name
op: exists
steps:
# Process the data
- type: print
text: "Valid input received"
else:
steps:
- type: slack_webhook
text: "Invalid input received: ${input}"
- type: http_call
url: https://api.yourapp.com/errors
6. Testing Strategies
Test workflows thoroughly before deploying to production.
Test with Sample Data
Create test events that cover:
- Happy path scenarios
- Edge cases (empty arrays, null values)
- Error scenarios (API failures, timeouts)
Use Conditional Logic for Testing
steps:
# Add environment marker to event during testing
- type: python_function
code: |
def process(event):
import os
event['is_test'] = os.getenv('etlr_stage') == 'test'
return event
handler: process
- type: if_else
if:
conditions:
- field: is_test
op: eq
value: true
steps:
- type: print
prefix: "[TEST MODE]"
text: "Running in test mode"
else:
steps:
- type: slack_webhook
webhook_url: ${env:SLACK_WEBHOOK}
text: "Production workflow executed"
7. Performance Optimisation
Make your workflows fast and efficient.
Batch Operations
Instead of calling APIs in a loop:
# ❌ Slow - many API calls
- type: for_each
input_from: users
var: user
steps:
- type: http_call
url: https://api.example.com/user/${user.id}
# ✅ Fast - batch request
- type: jq
query: '[.[] | .id]'
input_from: users
output_to: user_ids
- type: http_call
url: https://api.example.com/users/batch
method: POST
body:
ids: ${user_ids}
Parallel Processing
Use for_each with isolated item scope for better error handling:
- type: for_each
input_from: records
var: record
item_scope: isolated # Isolate each item's processing
steps:
- type: openai_completion
prompt: "Analyse: ${record}"
8. Documentation
Document your workflows as you build them.
Add Comments
name: customer-onboarding
# This workflow runs when a new customer signs up
# It enriches their data and sends welcome emails
steps:
# Fetch company data from Clearbit
- type: http_call
url: https://api.clearbit.com/v2/companies/find
# ... config ...
# Generate personalised welcome message using AI
- type: openai_completion
# ... config ...
Maintain a Changelog
Track changes to production workflows:
# v1.2.0 - 2024-11-25
# - Added error handling for API timeouts
# - Improved AI prompt for better results
# - Added metrics tracking
9. Cost Management
AI workflows can get expensive. Monitor and optimise costs.
Choose the Right Model
steps:
# Use cheaper models for simple tasks
- type: openai_completion
model: gpt-3.5-turbo # Cost-effective
prompt: "Summarise in one sentence: ${input}"
# Reserve expensive models for complex tasks
- type: openai_completion
model: gpt-4 # More capable but costly
prompt: "Analyse legal implications of: ${contract}"
Cache Expensive Operations
steps:
- type: http_call
url: https://api.yourapp.com/cache/${input.id}
output_to: cached_result
- type: if_else
if:
conditions:
- field: cached_result.exists
op: eq
value: true
steps:
- type: print
text: "Using cached result"
else:
steps:
- type: openai_completion
api_key: ${env:OPENAI_API_KEY}
model: gpt-4
prompt: "Process: ${input.data}"
10. Security Best Practices
Protect your workflows and data.
Keep Webhook URLs Private
# Your webhook URL contains a security hash
# https://pipeline.etlr.io/v1/webhooks/{org_id}/{workflow_id}/{security_hash}
# Keep this URL secret - treat it like a password
input:
type: http_webhook
Sanitise User Input
steps:
- type: jq
query: |
{
email: .email | gsub("[^a-zA-Z0-9@._-]"; ""),
name: .name | gsub("[^a-zA-Z0-9 ]"; "")
}
input_from: input
output_to: sanitised
Limit Sensitive Data Exposure
steps:
- type: print
fields:
- user.id
- user.email
# Don't print: user.password_hash, user.api_key
Production Checklist
Before deploying to production, ensure:
- ✅ All secrets use environment variables
- ✅ Error handling configured for all external calls
- ✅ Monitoring and alerting in place
- ✅ Input validation implemented
- ✅ Tested with production-like data
- ✅ Documentation complete
- ✅ Team has access to logs and traces
- ✅ Rollback plan documented
Conclusion
Building production AI workflows requires attention to detail, but following these best practices will save you countless hours of debugging and maintenance.
Start with these patterns, iterate based on your needs, and your workflows will be reliable, maintainable, and ready to scale.
Ready to build production-grade AI workflows? Get started with ETLR and leverage our built-in best practices.