Skip to content

Retry Strategy Design

This guide covers how to design an effective retry strategy for your queue-based jobs.

Default retry behavior

Every queue starts with these defaults:

SettingDefault
retries_enabledtrue
retry_delays[60, 1800, 10800] (1 min, 30 min, 3 hours)
dlq_retention_days14

This means a failed job is retried 3 times with increasing delays before being dead-lettered.

Choosing retry counts

ScenarioRecommended retriesReasoning
Idempotent endpoints (webhook delivery)3-5 retriesSafe to retry; allows time for transient outages
Non-idempotent endpoints (payment processing)0-1 retriesRetrying may cause duplicate actions
Fast, reliable endpoints (internal microservices)1-2 retriesFailures are likely transient; quick retry resolves them
Slow, unreliable endpoints (third-party APIs)3+ retries with long delaysGive the external service time to recover

Choosing backoff delays

Use increasing delays (exponential-style backoff) to avoid hammering a recovering endpoint:

Strategyretry_delaysTotal waitBest for
Aggressive[10, 60, 300]~6 minInternal services with fast recovery
Standard (default)[60, 1800, 10800]~3.5 hoursGeneral purpose
Patient[300, 3600, 21600]~7 hoursThird-party APIs with slow recovery
Cautious[60, 600, 3600, 14400, 43200]~16.5 hoursCritical jobs where you want maximum retry window

Idempotency on your endpoints

Retries mean your endpoint may receive the same request multiple times. Make your endpoints idempotent:

  • Use unique identifiers in the payload to detect duplicates
  • Check whether the action has already been performed before executing it
  • Use database transactions to prevent partial updates
  • Return the same response for duplicate requests

If your endpoint is not idempotent, either disable retries or ensure that the retry chain is safe for your use case.

DLQ vs. alerts

Both the dead-letter queue and alerts serve different purposes:

FeatureDLQAlerts
PurposeHold failed jobs for inspection and replayNotify you that something is wrong
Action requiredManual: replay or purgeInvestigate and fix the root cause
Best forJobs that must eventually succeed (webhook delivery, data sync)Monitoring health and detecting patterns

Use both together: alerts tell you when jobs are failing, and the DLQ gives you a way to recover once you fix the issue.

When to disable retries

Disable retries (retries_enabled: false) when:

  • The endpoint is not idempotent and retrying is unsafe
  • Failures are expected and handled elsewhere (e.g., via the callback URL)
  • You want immediate feedback without waiting for retries to complete

When retries are disabled, failed jobs are marked failed_no_retries and do not enter the DLQ.

Next steps