Retry Strategy Design
This guide covers how to design an effective retry strategy for your queue-based jobs.
Default retry behavior
Every queue starts with these defaults:
| Setting | Default |
|---|---|
retries_enabled | true |
retry_delays | [60, 1800, 10800] (1 min, 30 min, 3 hours) |
dlq_retention_days | 14 |
This means a failed job is retried 3 times with increasing delays before being dead-lettered.
Choosing retry counts
| Scenario | Recommended retries | Reasoning |
|---|---|---|
| Idempotent endpoints (webhook delivery) | 3-5 retries | Safe to retry; allows time for transient outages |
| Non-idempotent endpoints (payment processing) | 0-1 retries | Retrying may cause duplicate actions |
| Fast, reliable endpoints (internal microservices) | 1-2 retries | Failures are likely transient; quick retry resolves them |
| Slow, unreliable endpoints (third-party APIs) | 3+ retries with long delays | Give the external service time to recover |
Choosing backoff delays
Use increasing delays (exponential-style backoff) to avoid hammering a recovering endpoint:
| Strategy | retry_delays | Total wait | Best for |
|---|---|---|---|
| Aggressive | [10, 60, 300] | ~6 min | Internal services with fast recovery |
| Standard (default) | [60, 1800, 10800] | ~3.5 hours | General purpose |
| Patient | [300, 3600, 21600] | ~7 hours | Third-party APIs with slow recovery |
| Cautious | [60, 600, 3600, 14400, 43200] | ~16.5 hours | Critical jobs where you want maximum retry window |
Idempotency on your endpoints
Retries mean your endpoint may receive the same request multiple times. Make your endpoints idempotent:
- Use unique identifiers in the payload to detect duplicates
- Check whether the action has already been performed before executing it
- Use database transactions to prevent partial updates
- Return the same response for duplicate requests
If your endpoint is not idempotent, either disable retries or ensure that the retry chain is safe for your use case.
DLQ vs. alerts
Both the dead-letter queue and alerts serve different purposes:
| Feature | DLQ | Alerts |
|---|---|---|
| Purpose | Hold failed jobs for inspection and replay | Notify you that something is wrong |
| Action required | Manual: replay or purge | Investigate and fix the root cause |
| Best for | Jobs that must eventually succeed (webhook delivery, data sync) | Monitoring health and detecting patterns |
Use both together: alerts tell you when jobs are failing, and the DLQ gives you a way to recover once you fix the issue.
When to disable retries
Disable retries (retries_enabled: false) when:
- The endpoint is not idempotent and retrying is unsafe
- Failures are expected and handled elsewhere (e.g., via the callback URL)
- You want immediate feedback without waiting for retries to complete
When retries are disabled, failed jobs are marked failed_no_retries and do not enter the DLQ.
Next steps
- Queues — Queue configuration reference
- Jobs — Job lifecycle and retry flow
- Using the Dead Letter Queue — Replay and manage failed jobs
- HTTP Endpoint Design — Make your targets reliable