Skip to main content
Kafka producers can automatically retry failed requests to improve reliability and handle transient failures in distributed systems.

Why retries matter

In distributed systems, temporary failures are common:
  • Network connectivity issues
  • Broker leadership changes
  • Temporary resource constraints
  • Replication delays
Without retries, these transient issues would result in lost messages. Retries provide resilience against such failures.

Retry configuration

Basic retry settings

# Number of retry attempts (default varies by Kafka version)
retries=2147483647

# Time to wait between retries (default: 100ms)
retry.backoff.ms=100

# Maximum time to wait for acknowledgment (default: 30s)
request.timeout.ms=30000

# Maximum time to deliver a message including retries (default: 2 minutes)
delivery.timeout.ms=120000

Kafka version differences

Kafka < 3.0:
  • retries=0 (no retries by default)
  • Must explicitly enable retries
Kafka >= 3.0:
  • retries=Integer.MAX_VALUE (unlimited retries)
  • Retries enabled by default with idempotent producers

Types of errors

Retriable errors

These errors can potentially be resolved by retrying:
  • TimeoutException: Request timed out
  • NotEnoughReplicasException: Not enough in-sync replicas
  • NotEnoughReplicasAfterAppendException: Replication issues
  • RetriableException: Generic retriable error
  • LeaderNotAvailableException: Leader election in progress
  • NetworkException: Network connectivity issues

Non-retriable errors

These errors indicate permanent failures that won’t be resolved by retrying:
  • RecordTooLargeException: Message exceeds size limits
  • SerializationException: Message serialization failed
  • OffsetMetadataTooLarge: Offset metadata too large
  • InvalidTopicException: Topic doesn’t exist or is invalid
  • UnknownTopicOrPartitionException: Topic or partition invalid
  • AuthorizationException: Authentication/authorization failure

Retry backoff strategies

Fixed backoff (default)

Waits a fixed amount of time between retries:
retry.backoff.ms=100  # Always wait 100ms between retries
Pattern: Wait → Retry → Wait → Retry → Wait → Retry

Exponential backoff

Not natively supported by Kafka producer, but can be implemented at the application level:
Attempt 1: Wait 100ms
Attempt 2: Wait 200ms  
Attempt 3: Wait 400ms
Attempt 4: Wait 800ms

Impact on message ordering

With retries enabled

Retries can affect message ordering within a partition:
Message A sent → Fails → Retry scheduled
Message B sent → Succeeds immediately
Message A retry → Succeeds

Result: Message B appears before Message A in partition

Preserving order

To maintain strict ordering, configure:
# Limit in-flight requests to preserve order
max.in.flight.requests.per.connection=1

# Or use idempotent producer (recommended)
enable.idempotence=true
max.in.flight.requests.per.connection=5  # Up to 5 with idempotency

Delivery timeout vs request timeout

Request timeout

Time to wait for a single request attempt:
request.timeout.ms=30000  # 30 seconds per attempt

Delivery timeout

Total time limit for delivering a message (including all retries):
delivery.timeout.ms=120000  # 2 minutes total
Relationship:
delivery.timeout.ms >= request.timeout.ms + (retries × retry.backoff.ms)

Configuration examples

# Unlimited retries with delivery timeout
retries=2147483647
delivery.timeout.ms=300000      # 5 minutes total
request.timeout.ms=30000        # 30 seconds per attempt
retry.backoff.ms=100            # 100ms between retries
enable.idempotence=true         # Preserve ordering and avoid duplicates

Fast failure

# Limited retries for quick feedback
retries=3
delivery.timeout.ms=10000       # 10 seconds total
request.timeout.ms=5000         # 5 seconds per attempt  
retry.backoff.ms=100            # 100ms between retries
retries=0
request.timeout.ms=30000

Monitoring retry behavior

Key metrics to track

  • retry-rate: Rate of retry attempts
  • retry-total: Total number of retries
  • error-rate: Rate of failed requests (after all retries)
  • request-latency: Time taken for requests (including retries)

JMX metrics

kafka.producer:type=producer-metrics,client-id=<client-id>
- retry-rate
- retry-total
- request-rate
- request-latency-avg

Error handling strategies

Synchronous error handling

Properties props = new Properties();
props.put("retries", 5);
props.put("retry.backoff.ms", 100);

Producer<String, String> producer = new KafkaProducer<>(props);

try {
    ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");
    RecordMetadata metadata = producer.send(record).get();
    System.out.println("Message sent to " + metadata.topic() + ":" + metadata.partition());
} catch (Exception e) {
    System.err.println("Failed after all retries: " + e.getMessage());
}

Asynchronous error handling

ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");

producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        System.err.println("Failed after all retries: " + exception.getMessage());
    } else {
        System.out.println("Message sent successfully");
    }
});

Best practices

Production recommendations

  1. Enable unlimited retries: Set retries=Integer.MAX_VALUE
  2. Use delivery timeout: Set delivery.timeout.ms to control total time
  3. Enable idempotency: Prevents duplicates during retries
  4. Monitor retry metrics: Track retry rates and error patterns
  5. Handle non-retriable errors: Implement proper error handling for permanent failures

Configuration checklist

  • retries=Integer.MAX_VALUE (unlimited retries)
  • delivery.timeout.ms=120000 (reasonable total timeout)
  • request.timeout.ms=30000 (reasonable per-request timeout)
  • retry.backoff.ms=100 (reasonable delay between retries)
  • enable.idempotence=true (prevent duplicates)

Common mistakes to avoid

  • Setting retries=0 in production
  • Not handling non-retriable errors
  • Setting delivery timeout too low
  • Ignoring retry metrics and error rates
Default behavior in modern KafkaStarting with Kafka 3.0, producers have sensible retry defaults:
  • Unlimited retries with idempotency enabled
  • 2-minute delivery timeout
  • Proper error handling for most use cases
Message orderingRetries can affect message ordering within partitions. Use enable.idempotence=true or max.in.flight.requests.per.connection=1 if strict ordering is required.
I