Failover & Retry¶
Build resilient workflows that handle node failures, network outages, and transient errors gracefully.
Retry Configuration¶
Basic Retry¶
Retry failed tool executions:
{
"step_id": "risky_operation",
"tool_name": "external_api_call",
"retry": {
"max_attempts": 3,
"backoff_ms": 1000
}
}
Behavior: - Attempt 1: Execute immediately - Attempt 2: Wait 1 second, retry - Attempt 3: Wait 2 seconds (exponential backoff), retry - Failure: Return error after 3 attempts
Exponential Backoff¶
{
"retry": {
"max_attempts": 5,
"backoff_ms": 500,
"backoff_multiplier": 2.0,
"max_backoff_ms": 10000
}
}
Timing: - Attempt 1: 0ms delay - Attempt 2: 500ms delay - Attempt 3: 1000ms delay (500 × 2) - Attempt 4: 2000ms delay (1000 × 2) - Attempt 5: 4000ms delay (2000 × 2)
Failover Strategies¶
1. Automatic Failover¶
Try alternative nodes when primary fails:
Flow: 1. Gateway selects Node A (best match) 2. Node A offline → Gateway selects Node B 3. Node B fails → Gateway selects Node C 4. Node C succeeds → Return result
2. Explicit Failover Chain¶
Specify fallback nodes:
{
"target": {
"kind": "failover_chain",
"nodes": [
"production:primary",
"production:secondary",
"development:backup"
]
}
}
Gateway tries nodes in order until one succeeds.
3. Load-Balanced Failover¶
Distribute across multiple nodes with automatic failover:
If any node fails, gateway redistributes work to remaining nodes.
Offline Operation¶
Store-and-Forward¶
When nodes are offline, requests are queued:
{
"target": { "kind": "local" },
"offline_behavior": {
"mode": "queue",
"ttl_seconds": 86400, // 24 hours
"retry_interval_seconds": 300 // 5 minutes
}
}
Behavior: 1. Request arrives when all nodes offline 2. Gateway stores in mailbox 3. Background worker retries every 5 minutes 4. Request expires after 24 hours if undelivered 5. Node comes online → Request delivers automatically
Immediate Failure Mode¶
Fail fast instead of queuing:
Use when: - Real-time execution required - Stale results unacceptable - Client can handle retry logic
Execution Evidence¶
Degraded Execution¶
When failover occurs, results marked as degraded:
{
"ok": true,
"verified": true,
"result": { /* ... */ },
"execution_path": [
"local:node_a (timeout)",
"local:node_b (success)"
],
"degraded": true,
"degraded_reason": "Primary node unavailable, failed over to backup"
}
Execution Timeline¶
Detailed timing for audit:
{
"timeline": [
{
"step": "redact",
"node": "local:node_a",
"timestamp": "2024-01-27T10:30:00Z",
"duration_ms": 45,
"status": "success"
},
{
"step": "summarize",
"node": "local:node_a",
"timestamp": "2024-01-27T10:30:01Z",
"duration_ms": 0,
"status": "timeout",
"error": "Node unreachable"
},
{
"step": "summarize",
"node": "local:node_b",
"timestamp": "2024-01-27T10:30:06Z",
"duration_ms": 123,
"status": "success"
}
]
}
Error Handling¶
Retryable Errors¶
These errors trigger automatic retry:
- Network timeouts
- Node unreachable
- Temporary capacity issues
- HTTP 503 (Service Unavailable)
- HTTP 429 (Rate Limit)
Non-Retryable Errors¶
These errors fail immediately:
- Invalid tool arguments (HTTP 400)
- Missing capabilities (HTTP 403)
- Tool not found (HTTP 404)
- Schema validation failure
Custom Error Handling¶
{
"retry": {
"retryable_errors": [
"NetworkTimeout",
"CapacityExceeded"
],
"non_retryable_errors": [
"InvalidInput",
"SchemaValidationFailed"
]
}
}
Idempotency¶
Job IDs¶
Workflows use stable job IDs to prevent duplicate execution:
If the same job_id is submitted twice:
- First execution proceeds normally
- Second execution returns cached result
Deterministic Execution¶
Tools marked as deterministic guarantee same output for same input:
Enables safe retries without side effects.
Circuit Breaker¶
Preventing Cascading Failures¶
Circuit breaker stops retrying after repeated failures:
States:
- Closed (normal operation)
- Requests flow normally
-
Track failure rate
-
Open (after 5 failures)
- Fail fast for 60 seconds
-
Prevent overwhelming failing node
-
Half-Open (after timeout)
- Allow 1 test request
- If succeeds → Close circuit
- If fails → Re-open circuit
Monitoring & Alerts¶
Metrics to Track¶
# Node availability
curl http://127.0.0.1:8787/v1/metrics | jq '.node_availability'
# Failover rate
curl http://127.0.0.1:8787/v1/metrics | jq '.failover_rate'
# Degraded execution percentage
curl http://127.0.0.1:8787/v1/metrics | jq '.degraded_percentage'
Alerting on Degraded Execution¶
# Monitor audit log for degraded workflows
import json
with open('audit/audit.log') as f:
for line in f:
record = json.loads(line)
if record.get('degraded'):
print(f"ALERT: Degraded execution in {record['workflow_id']}")
print(f"Reason: {record['degraded_reason']}")
Best Practices¶
1. Configure Retries for All External Tools¶
2. Enable Failover for Critical Workflows¶
3. Set Reasonable TTLs for Queued Requests¶
4. Monitor Degraded Execution Rate¶
If > 10% of workflows are degraded, investigate root cause.
5. Use Circuit Breakers for Flaky Services¶
Testing Resilience¶
Simulate Node Failures¶
# Stop edge node mid-workflow
kill -STOP $(pgrep -f "node.py")
# Resume after 10 seconds
sleep 10
kill -CONT $(pgrep -f "node.py")
Verify Failover¶
Expected output:
{
"ok": true,
"execution_path": [
"local:node_a (timeout)",
"local:node_b (success)"
],
"degraded": true
}
Next Steps¶
- Multi-Node Targeting - Node selection strategies
- Offline Operation - Store-and-forward details
- Execution Evidence - Audit trail
- Deployment - Production deployment