Failover & Retry¶

Build resilient workflows that handle node failures, network outages, and transient errors gracefully.

Retry Configuration¶

Basic Retry¶

Retry failed tool executions:

{
  "step_id": "risky_operation",
  "tool_name": "external_api_call",
  "retry": {
    "max_attempts": 3,
    "backoff_ms": 1000
  }
}

Behavior: - Attempt 1: Execute immediately - Attempt 2: Wait 1 second, retry - Attempt 3: Wait 2 seconds (exponential backoff), retry - Failure: Return error after 3 attempts

Exponential Backoff¶

{
  "retry": {
    "max_attempts": 5,
    "backoff_ms": 500,
    "backoff_multiplier": 2.0,
    "max_backoff_ms": 10000
  }
}

Timing: - Attempt 1: 0ms delay - Attempt 2: 500ms delay - Attempt 3: 1000ms delay (500 × 2) - Attempt 4: 2000ms delay (1000 × 2) - Attempt 5: 4000ms delay (2000 × 2)

Failover Strategies¶

1. Automatic Failover¶

Try alternative nodes when primary fails:

{
  "target": { "kind": "local" },
  "retry": {
    "max_attempts": 3,
    "failover": true
  }
}

Flow: 1. Gateway selects Node A (best match) 2. Node A offline → Gateway selects Node B 3. Node B fails → Gateway selects Node C 4. Node C succeeds → Return result

2. Explicit Failover Chain¶

Specify fallback nodes:

{
  "target": {
    "kind": "failover_chain",
    "nodes": [
      "production:primary",
      "production:secondary",
      "development:backup"
    ]
  }
}

Gateway tries nodes in order until one succeeds.

3. Load-Balanced Failover¶

Distribute across multiple nodes with automatic failover:

{
  "target": {
    "kind": "load_balanced",
    "min_nodes": 2,
    "failover": true
  }
}

If any node fails, gateway redistributes work to remaining nodes.

Offline Operation¶

Store-and-Forward¶

When nodes are offline, requests are queued:

{
  "target": { "kind": "local" },
  "offline_behavior": {
    "mode": "queue",
    "ttl_seconds": 86400,  // 24 hours
    "retry_interval_seconds": 300  // 5 minutes
  }
}

Behavior: 1. Request arrives when all nodes offline 2. Gateway stores in mailbox 3. Background worker retries every 5 minutes 4. Request expires after 24 hours if undelivered 5. Node comes online → Request delivers automatically

Immediate Failure Mode¶

Fail fast instead of queuing:

{
  "offline_behavior": {
    "mode": "fail_fast"
  }
}

Use when: - Real-time execution required - Stale results unacceptable - Client can handle retry logic

Execution Evidence¶

Degraded Execution¶

When failover occurs, results marked as degraded:

{
  "ok": true,
  "verified": true,
  "result": { /* ... */ },
  "execution_path": [
    "local:node_a (timeout)",
    "local:node_b (success)"
  ],
  "degraded": true,
  "degraded_reason": "Primary node unavailable, failed over to backup"
}

Execution Timeline¶

Detailed timing for audit:

{
  "timeline": [
    {
      "step": "redact",
      "node": "local:node_a",
      "timestamp": "2024-01-27T10:30:00Z",
      "duration_ms": 45,
      "status": "success"
    },
    {
      "step": "summarize",
      "node": "local:node_a",
      "timestamp": "2024-01-27T10:30:01Z",
      "duration_ms": 0,
      "status": "timeout",
      "error": "Node unreachable"
    },
    {
      "step": "summarize",
      "node": "local:node_b",
      "timestamp": "2024-01-27T10:30:06Z",
      "duration_ms": 123,
      "status": "success"
    }
  ]
}

Error Handling¶

Retryable Errors¶

These errors trigger automatic retry:

Network timeouts
Node unreachable
Temporary capacity issues
HTTP 503 (Service Unavailable)
HTTP 429 (Rate Limit)

Non-Retryable Errors¶

These errors fail immediately:

Invalid tool arguments (HTTP 400)
Missing capabilities (HTTP 403)
Tool not found (HTTP 404)
Schema validation failure

Custom Error Handling¶

{
  "retry": {
    "retryable_errors": [
      "NetworkTimeout",
      "CapacityExceeded"
    ],
    "non_retryable_errors": [
      "InvalidInput",
      "SchemaValidationFailed"
    ]
  }
}

Idempotency¶

Job IDs¶

Workflows use stable job IDs to prevent duplicate execution:

{
  "job_id": "job-2024-01-27-xyz",
  "workflow_id": "feedback_safe_summary",
  "input": { /* ... */ }
}

If the same job_id is submitted twice: - First execution proceeds normally - Second execution returns cached result

Deterministic Execution¶

Tools marked as deterministic guarantee same output for same input:

{
  "tool_id": "tool:unit_convert",
  "deterministic": true
}

Enables safe retries without side effects.

Circuit Breaker¶

Preventing Cascading Failures¶

Circuit breaker stops retrying after repeated failures:

{
  "circuit_breaker": {
    "failure_threshold": 5,
    "timeout_seconds": 60,
    "half_open_requests": 1
  }
}

States:

Closed (normal operation)
Requests flow normally
Track failure rate
Open (after 5 failures)
Fail fast for 60 seconds
Prevent overwhelming failing node
Half-Open (after timeout)
Allow 1 test request
If succeeds → Close circuit
If fails → Re-open circuit

Monitoring & Alerts¶

Metrics to Track¶

# Node availability
curl http://127.0.0.1:8787/v1/metrics | jq '.node_availability'

# Failover rate
curl http://127.0.0.1:8787/v1/metrics | jq '.failover_rate'

# Degraded execution percentage
curl http://127.0.0.1:8787/v1/metrics | jq '.degraded_percentage'

Alerting on Degraded Execution¶

# Monitor audit log for degraded workflows
import json

with open('audit/audit.log') as f:
    for line in f:
        record = json.loads(line)
        if record.get('degraded'):
            print(f"ALERT: Degraded execution in {record['workflow_id']}")
            print(f"Reason: {record['degraded_reason']}")

Best Practices¶

1. Configure Retries for All External Tools¶

{
  "tool_name": "external_api",
  "retry": {
    "max_attempts": 3,
    "backoff_ms": 1000
  }
}

2. Enable Failover for Critical Workflows¶

{
  "retry": {
    "failover": true
  }
}

3. Set Reasonable TTLs for Queued Requests¶

{
  "offline_behavior": {
    "ttl_seconds": 3600  // 1 hour, not 1 week
  }
}

4. Monitor Degraded Execution Rate¶

If > 10% of workflows are degraded, investigate root cause.

5. Use Circuit Breakers for Flaky Services¶

{
  "circuit_breaker": {
    "failure_threshold": 3,
    "timeout_seconds": 30
  }
}

Testing Resilience¶

Simulate Node Failures¶

# Stop edge node mid-workflow
kill -STOP $(pgrep -f "node.py")

# Resume after 10 seconds
sleep 10
kill -CONT $(pgrep -f "node.py")

Verify Failover¶

python demos/failover_demo.py

Expected output:

{
  "ok": true,
  "execution_path": [
    "local:node_a (timeout)",
    "local:node_b (success)"
  ],
  "degraded": true
}

Next Steps¶

Multi-Node Targeting - Node selection strategies
Offline Operation - Store-and-forward details
Execution Evidence - Audit trail
Deployment - Production deployment