Skip to content

Failover & Retry

Build resilient workflows that handle node failures, network outages, and transient errors gracefully.


Retry Configuration

Basic Retry

Retry failed tool executions:

{
  "step_id": "risky_operation",
  "tool_name": "external_api_call",
  "retry": {
    "max_attempts": 3,
    "backoff_ms": 1000
  }
}

Behavior: - Attempt 1: Execute immediately - Attempt 2: Wait 1 second, retry - Attempt 3: Wait 2 seconds (exponential backoff), retry - Failure: Return error after 3 attempts

Exponential Backoff

{
  "retry": {
    "max_attempts": 5,
    "backoff_ms": 500,
    "backoff_multiplier": 2.0,
    "max_backoff_ms": 10000
  }
}

Timing: - Attempt 1: 0ms delay - Attempt 2: 500ms delay - Attempt 3: 1000ms delay (500 × 2) - Attempt 4: 2000ms delay (1000 × 2) - Attempt 5: 4000ms delay (2000 × 2)


Failover Strategies

1. Automatic Failover

Try alternative nodes when primary fails:

{
  "target": { "kind": "local" },
  "retry": {
    "max_attempts": 3,
    "failover": true
  }
}

Flow: 1. Gateway selects Node A (best match) 2. Node A offline → Gateway selects Node B 3. Node B fails → Gateway selects Node C 4. Node C succeeds → Return result

2. Explicit Failover Chain

Specify fallback nodes:

{
  "target": {
    "kind": "failover_chain",
    "nodes": [
      "production:primary",
      "production:secondary",
      "development:backup"
    ]
  }
}

Gateway tries nodes in order until one succeeds.

3. Load-Balanced Failover

Distribute across multiple nodes with automatic failover:

{
  "target": {
    "kind": "load_balanced",
    "min_nodes": 2,
    "failover": true
  }
}

If any node fails, gateway redistributes work to remaining nodes.


Offline Operation

Store-and-Forward

When nodes are offline, requests are queued:

{
  "target": { "kind": "local" },
  "offline_behavior": {
    "mode": "queue",
    "ttl_seconds": 86400,  // 24 hours
    "retry_interval_seconds": 300  // 5 minutes
  }
}

Behavior: 1. Request arrives when all nodes offline 2. Gateway stores in mailbox 3. Background worker retries every 5 minutes 4. Request expires after 24 hours if undelivered 5. Node comes online → Request delivers automatically

Immediate Failure Mode

Fail fast instead of queuing:

{
  "offline_behavior": {
    "mode": "fail_fast"
  }
}

Use when: - Real-time execution required - Stale results unacceptable - Client can handle retry logic


Execution Evidence

Degraded Execution

When failover occurs, results marked as degraded:

{
  "ok": true,
  "verified": true,
  "result": { /* ... */ },
  "execution_path": [
    "local:node_a (timeout)",
    "local:node_b (success)"
  ],
  "degraded": true,
  "degraded_reason": "Primary node unavailable, failed over to backup"
}

Execution Timeline

Detailed timing for audit:

{
  "timeline": [
    {
      "step": "redact",
      "node": "local:node_a",
      "timestamp": "2024-01-27T10:30:00Z",
      "duration_ms": 45,
      "status": "success"
    },
    {
      "step": "summarize",
      "node": "local:node_a",
      "timestamp": "2024-01-27T10:30:01Z",
      "duration_ms": 0,
      "status": "timeout",
      "error": "Node unreachable"
    },
    {
      "step": "summarize",
      "node": "local:node_b",
      "timestamp": "2024-01-27T10:30:06Z",
      "duration_ms": 123,
      "status": "success"
    }
  ]
}

Error Handling

Retryable Errors

These errors trigger automatic retry:

  • Network timeouts
  • Node unreachable
  • Temporary capacity issues
  • HTTP 503 (Service Unavailable)
  • HTTP 429 (Rate Limit)

Non-Retryable Errors

These errors fail immediately:

  • Invalid tool arguments (HTTP 400)
  • Missing capabilities (HTTP 403)
  • Tool not found (HTTP 404)
  • Schema validation failure

Custom Error Handling

{
  "retry": {
    "retryable_errors": [
      "NetworkTimeout",
      "CapacityExceeded"
    ],
    "non_retryable_errors": [
      "InvalidInput",
      "SchemaValidationFailed"
    ]
  }
}

Idempotency

Job IDs

Workflows use stable job IDs to prevent duplicate execution:

{
  "job_id": "job-2024-01-27-xyz",
  "workflow_id": "feedback_safe_summary",
  "input": { /* ... */ }
}

If the same job_id is submitted twice: - First execution proceeds normally - Second execution returns cached result

Deterministic Execution

Tools marked as deterministic guarantee same output for same input:

{
  "tool_id": "tool:unit_convert",
  "deterministic": true
}

Enables safe retries without side effects.


Circuit Breaker

Preventing Cascading Failures

Circuit breaker stops retrying after repeated failures:

{
  "circuit_breaker": {
    "failure_threshold": 5,
    "timeout_seconds": 60,
    "half_open_requests": 1
  }
}

States:

  1. Closed (normal operation)
  2. Requests flow normally
  3. Track failure rate

  4. Open (after 5 failures)

  5. Fail fast for 60 seconds
  6. Prevent overwhelming failing node

  7. Half-Open (after timeout)

  8. Allow 1 test request
  9. If succeeds → Close circuit
  10. If fails → Re-open circuit

Monitoring & Alerts

Metrics to Track

# Node availability
curl http://127.0.0.1:8787/v1/metrics | jq '.node_availability'

# Failover rate
curl http://127.0.0.1:8787/v1/metrics | jq '.failover_rate'

# Degraded execution percentage
curl http://127.0.0.1:8787/v1/metrics | jq '.degraded_percentage'

Alerting on Degraded Execution

# Monitor audit log for degraded workflows
import json

with open('audit/audit.log') as f:
    for line in f:
        record = json.loads(line)
        if record.get('degraded'):
            print(f"ALERT: Degraded execution in {record['workflow_id']}")
            print(f"Reason: {record['degraded_reason']}")

Best Practices

1. Configure Retries for All External Tools

{
  "tool_name": "external_api",
  "retry": {
    "max_attempts": 3,
    "backoff_ms": 1000
  }
}

2. Enable Failover for Critical Workflows

{
  "retry": {
    "failover": true
  }
}

3. Set Reasonable TTLs for Queued Requests

{
  "offline_behavior": {
    "ttl_seconds": 3600  // 1 hour, not 1 week
  }
}

4. Monitor Degraded Execution Rate

If > 10% of workflows are degraded, investigate root cause.

5. Use Circuit Breakers for Flaky Services

{
  "circuit_breaker": {
    "failure_threshold": 3,
    "timeout_seconds": 30
  }
}

Testing Resilience

Simulate Node Failures

# Stop edge node mid-workflow
kill -STOP $(pgrep -f "node.py")

# Resume after 10 seconds
sleep 10
kill -CONT $(pgrep -f "node.py")

Verify Failover

python demos/failover_demo.py

Expected output:

{
  "ok": true,
  "execution_path": [
    "local:node_a (timeout)",
    "local:node_b (success)"
  ],
  "degraded": true
}


Next Steps