Monitoring your website or service is critical for maintaining uptime and catching issues before they impact users. But setting up monitoring shouldn’t require complex infrastructure or expensive third-party tools.

With ETLR, you can build production-grade monitoring workflows using simple YAML files. Schedule health checks, track response times, filter for errors, and send alerts to your team—all with infrastructure you can version, test, and deploy like code.

Why ETLR for Monitoring?

Traditional monitoring solutions often require:

  • Complex setup and configuration UIs
  • Vendor lock-in with proprietary alerting
  • Limited customisation options
  • High costs at scale

ETLR gives you:

  • Infrastructure as Code - Monitor workflows in Git alongside your application
  • Flexible Alerting - Send alerts to Slack, Discord, email, or any webhook
  • Custom Logic - Filter, transform, and enrich monitoring data however you need
  • Simple Deployment - One command to deploy or update monitoring rules

Let’s explore real-world monitoring examples you can deploy today.

Example 1: Basic HTTP Health Check

The simplest monitoring workflow: ping your service every minute and alert on failures.

workflow:
  name: "website_healthcheck"
  description: "Monitor website uptime and response time"
  input:
    type: cron
    cron: "*/1 * * * *"  # Every minute
  
  steps:
    - type: http_call
      url: https://yourapp.com/health
      method: GET
      timeout: 10
      include_status: true
      output_to: health_check
    
    - type: filter
      groups:
        - conditions:
            - field: health_check.status
              op: ne
              value: 200
    
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: "🚨 Health check failed!\nURL: https://yourapp.com/health\nStatus: ${health_check.status}\nDuration: ${health_check.duration_ms}ms"

How it works:

  1. Cron trigger - Runs every minute automatically
  2. HTTP call - Makes a GET request to your health endpoint with a 10-second timeout
  3. Filter - Only continues if status code is NOT 200 (drops successful checks)
  4. Slack alert - Sends a formatted message with status code and response time

Key parameters:

  • include_status: true - Adds status code to the response
  • output_to - Stores the response at health_check in state
  • filter with ne (not equal) operator - Drops events when status is 200

Example 2: Multi-Endpoint Monitoring

Monitor multiple services and track which ones are down:

workflow:
  name: "multi_service_monitor"
  description: "Monitor multiple endpoints and report failures"
  input:
    type: cron
    cron: "*/5 * * * *"  # Every 5 minutes
  
  steps:
    # Check API
    - type: http_call
      url: https://api.yourapp.com/health
      method: GET
      timeout: 10
      include_status: true
      drop_on_failure: false
      output_to: api_health
    
    # Check Web App
    - type: http_call
      url: https://app.yourapp.com
      method: GET
      timeout: 10
      include_status: true
      drop_on_failure: false
      output_to: web_health
    
    # Check Database API
    - type: http_call
      url: https://db.yourapp.com/status
      method: GET
      timeout: 10
      include_status: true
      drop_on_failure: false
      output_to: db_health
    
    # Add timestamp
    - type: add_timestamp
      format: ISO-8601
      field: checked_at
    
    # Send comprehensive status report
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: |
        📊 Service Status Report
        Time: ${checked_at}
        
        API: ${api_health.status} (${api_health.duration_ms}ms)
        Web: ${web_health.status} (${web_health.duration_ms}ms)
        DB: ${db_health.status} (${db_health.duration_ms}ms)

Key features:

  • drop_on_failure: false - Continue workflow even if a request fails
  • Multiple http_call steps with different output_to values
  • add_timestamp - Adds current time for reporting
  • text_template with multiline string - Creates formatted report

Example 3: Response Time Monitoring with Thresholds

Alert when response times exceed acceptable thresholds:

workflow:
  name: "response_time_monitor"
  description: "Alert on slow response times"
  input:
    type: cron
    cron: "*/2 * * * *"  # Every 2 minutes
  
  steps:
    - type: http_call
      url: https://api.yourapp.com/users
      method: GET
      headers:
        Authorization: "Bearer ${env:API_TOKEN}"
      timeout: 30
      include_status: true
      output_to: api_response
    
    # Filter for slow responses (> 2000ms)
    - type: filter
      groups:
        - conditions:
            - field: api_response.duration_ms
              op: gt
              value: 2000
    
    # Send detailed alert
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: |
        ⚠️ SLOW API RESPONSE DETECTED
        
        Endpoint: /users
        Response Time: ${api_response.duration_ms}ms
        Status: ${api_response.status}
        Threshold: 2000ms
        
        Action Required: Investigate performance degradation

Key concepts:

  • gt operator - Greater than comparison in filter step
  • Response time threshold filtering
  • Custom alert messages with context

Example 4: API Error Rate Monitoring

Track API errors and alert when error rates spike:

workflow:
  name: "api_error_monitor"
  description: "Monitor API for 4xx and 5xx errors"
  input:
    type: cron
    cron: "*/1 * * * *"
  
  steps:
    - type: http_call
      url: https://api.yourapp.com/analytics/errors
      method: GET
      headers:
        X-API-Key: "${env:API_KEY}"
      timeout: 15
      output_to: error_stats
    
    # Filter for error rate > 5%
    - type: filter
      groups:
        - conditions:
            - field: error_stats.body.error_rate
              op: gt
              value: 5.0
    
    - type: add_timestamp
      field: alert_time
    
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: |
        🔴 HIGH ERROR RATE ALERT
        
        Error Rate: ${error_stats.body.error_rate}%
        Total Requests: ${error_stats.body.total_requests}
        Failed Requests: ${error_stats.body.failed_requests}
        Time: ${alert_time}
        
        Dashboard: https://dashboard.yourapp.com/errors

Advanced features:

  • Accessing nested response data with state paths
  • Percentage-based thresholds
  • Including dashboard links in alerts

Example 5: Certificate Expiry Monitoring

Monitor SSL certificate expiry dates:

workflow:
  name: "ssl_certificate_monitor"
  description: "Alert on expiring SSL certificates"
  input:
    type: cron
    cron: "0 9 * * *"  # Daily at 9 AM UTC
  
  steps:
    - type: http_call
      url: https://api.sslchecker.com/check
      method: POST
      headers:
        Authorization: "Bearer ${env:SSL_CHECKER_TOKEN}"
      body:
        domains:
          - yourapp.com
          - api.yourapp.com
          - app.yourapp.com
      output_to: cert_status
    
    # Filter for certificates expiring in < 30 days
    - type: filter
      groups:
        - conditions:
            - field: cert_status.body.days_until_expiry
              op: lt
              value: 30
    
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: |
        ⚠️ SSL CERTIFICATE EXPIRING SOON
        
        Domain: ${cert_status.body.domain}
        Expires: ${cert_status.body.expiry_date}
        Days Remaining: ${cert_status.body.days_until_expiry}
        
        Action Required: Renew certificate before expiry

Example 6: Advanced Multi-Channel Alerting

Send alerts to multiple channels based on severity:

workflow:
  name: "advanced_monitoring"
  description: "Multi-service monitoring with severity-based routing"
  input:
    type: cron
    cron: "*/3 * * * *"
  
  steps:
    # Check critical endpoint
    - type: http_call
      url: https://api.yourapp.com/critical-service
      method: GET
      timeout: 10
      include_status: true
      output_to: critical_check
    
    # Add metadata
    - type: add_timestamp
      field: check_time
    
    - type: add_uuid
      field: incident_id
    
    # Critical failure path
    - type: filter
      groups:
        - conditions:
            - field: critical_check.status
              op: gte
              value: 500
    
    # Alert via Slack
    - type: slack_webhook
      webhook_url: "${env:SLACK_CRITICAL_WEBHOOK}"
      text_template: |
        🚨 CRITICAL SERVICE FAILURE
        
        Incident ID: ${incident_id}
        Service: critical-service
        Status: ${critical_check.status}
        Response Time: ${critical_check.duration_ms}ms
        Time: ${check_time}
        
        @channel - Immediate attention required!
    
    # Also send email for critical issues
    - type: resend_email
      api_key: "${env:RESEND_API_KEY}"
      from_email: "[email protected]"
      to: "[email protected]"
      subject_template: "🚨 Critical Service Failure - ${incident_id}"
      html_template: |
        <h2>Critical Service Failure Detected</h2>
        <p><strong>Incident ID:</strong> ${incident_id}</p>
        <p><strong>Service:</strong> critical-service</p>
        <p><strong>Status Code:</strong> ${critical_check.status}</p>
        <p><strong>Response Time:</strong> ${critical_check.duration_ms}ms</p>
        <p><strong>Timestamp:</strong> ${check_time}</p>
        <hr>
        <p>Please investigate immediately.</p>

Enterprise features:

Example 7: Uptime Tracking with Logging

Log all health check results for uptime analysis:

workflow:
  name: "uptime_tracker"
  description: "Track uptime and log all checks"
  input:
    type: cron
    cron: "*/1 * * * *"
  
  steps:
    - type: http_call
      url: https://yourapp.com/health
      method: GET
      timeout: 10
      include_status: true
      drop_on_failure: false
      output_to: health
    
    - type: add_timestamp
      format: ISO-8601
      field: timestamp
    
    # Log to external service
    - type: http_call
      url: https://logging.yourapp.com/healthchecks
      method: POST
      headers:
        Authorization: "Bearer ${env:LOGGING_TOKEN}"
      body:
        timestamp: "${timestamp}"
        status: "${health.status}"
        duration_ms: "${health.duration_ms}"
        success: "${health.status == 200}"
      output_to: log_result
    
    # Alert only on failure
    - type: filter
      groups:
        - conditions:
            - field: health.status
              op: ne
              value: 200
    
    - type: slack_webhook
      webhook_url: "${env:SLACK_WEBHOOK_URL}"
      text_template: "🚨 Service down! Status: ${health.status}"

Key patterns:

  • Logging all checks regardless of outcome
  • Separate alert logic from logging
  • Structured data collection for analysis

Best Practices

1. Use Environment Variables for Secrets

Never hardcode tokens or webhook URLs. Use environment variables instead:

# ✅ Good
webhook_url: "${env:SLACK_WEBHOOK_URL}"
api_key: "${env:API_KEY}"

# ❌ Bad
webhook_url: "https://hooks.slack.com/services/T00/B00/XXX"

2. Set Appropriate Timeouts

Balance between catching slow responses and avoiding false alarms:

# Fast API endpoint
timeout: 5

# Slower backend service
timeout: 30

# Long-running operation
timeout: 120

3. Use drop_on_failure: false for Comprehensive Monitoring

When monitoring multiple services, continue the workflow even if one fails:

- type: http_call
  url: https://service1.com/health
  drop_on_failure: false
  output_to: service1

- type: http_call
  url: https://service2.com/health
  drop_on_failure: false
  output_to: service2

4. Add Context to Alerts

Include actionable information in alert messages:

text_template: |
  🚨 Alert: ${service_name} Down
  
  Status: ${response.status}
  Response Time: ${response.duration_ms}ms
  Endpoint: ${endpoint_url}
  Time: ${timestamp}
  
  Dashboard: https://dashboard.yourapp.com
  Runbook: https://docs.yourapp.com/runbooks/service-down

5. Test Before Deploying

Use start_now: true on cron inputs to test immediately:

input:
  type: cron
  cron: "*/5 * * * *"
  start_now: true  # Runs once immediately on deployment

Deployment

Deploy any of these workflows with the ETLR CLI:

# Deploy from YAML file
etlr deploy monitoring.yaml

# Deploy from workflow.yaml in current directory
etlr deploy

# Deploy with environment variables
etlr deploy monitoring.yaml -e SLACK_WEBHOOK_URL=https://hooks.slack.com/...

The CLI will automatically:

  • Create or update the workflow
  • Start it running
  • Display the webhook URL (for http_webhook inputs)

Monitor your workflows in the ETLR dashboard where you can view:

  • Execution history and logs
  • Response times and success rates
  • State data from each run
  • Version history and rollback options

Conclusion

With ETLR, you can build comprehensive monitoring solutions using simple YAML workflows. From basic health checks to advanced multi-service monitoring with custom alerting logic, everything is code that you can version, test, and deploy with confidence.

The examples in this post are production-ready and based on actual ETLR step documentation. Start with a simple health check, then expand to monitor multiple services, track response times, and route alerts based on severity.

Your monitoring infrastructure should be as reliable and maintainable as the services you’re monitoring. With ETLR, it finally can be.

Next Steps

Ready to build your first monitoring workflow? Here are some resources to help you get started:

Core Documentation

Key Steps for Monitoring

Input Types

More Resources