Retry Strategy Design

This guide covers how to design an effective retry strategy for your queue-based jobs.

Default retry behavior

Every queue starts with these defaults:

This means a failed job is retried 3 times with increasing delays before being dead-lettered.

Scenario	Recommended retries	Reasoning
Idempotent endpoints (webhook delivery)	3-5 retries	Safe to retry; allows time for transient outages
Non-idempotent endpoints (payment processing)	0-1 retries	Retrying may cause duplicate actions
Fast, reliable endpoints (internal microservices)	1-2 retries	Failures are likely transient; quick retry resolves them
Slow, unreliable endpoints (third-party APIs)	3+ retries with long delays	Give the external service time to recover

Use increasing delays (exponential-style backoff) to avoid hammering a recovering endpoint:

Strategy	`retry_delays`	Total wait	Best for
Aggressive	`[10, 60, 300]`	~6 min	Internal services with fast recovery
Standard (default)	`[60, 1800, 10800]`	~3.5 hours	General purpose
Patient	`[300, 3600, 21600]`	~7 hours	Third-party APIs with slow recovery
Cautious	`[60, 600, 3600, 14400, 43200]`	~16.5 hours	Critical jobs where you want maximum retry window

Retries mean your endpoint may receive the same request multiple times. Make your endpoints idempotent:

If your endpoint is not idempotent, either disable retries or ensure that the retry chain is safe for your use case.

Both the dead-letter queue and alerts serve different purposes:

Feature	DLQ	Alerts
Purpose	Hold failed jobs for inspection and replay	Notify you that something is wrong
Action required	Manual: replay or purge	Investigate and fix the root cause
Best for	Jobs that must eventually succeed (webhook delivery, data sync)	Monitoring health and detecting patterns

Use both together: alerts tell you when jobs are failing, and the DLQ gives you a way to recover once you fix the issue.

Disable retries (retries_enabled: false) when:

When retries are disabled, failed jobs are marked failed_no_retries and do not enter the DLQ.