Kafka producer retries

Kafka producers can automatically retry failed requests to improve reliability and handle transient failures in distributed systems.

Why retries matter

In distributed systems, temporary failures are common:

Network connectivity issues
Broker leadership changes
Temporary resource constraints
Replication delays

Without retries, these transient issues would result in lost messages. Retries provide resilience against such failures.

Retry configuration

Basic retry settings

# Number of retry attempts (default varies by Kafka version)
retries=2147483647

# Time to wait between retries (default: 100ms)
retry.backoff.ms=100

# Maximum time to wait for acknowledgment (default: 30s)
request.timeout.ms=30000

# Maximum time to deliver a message including retries (default: 2 minutes)
delivery.timeout.ms=120000

Kafka version differences

Kafka < 3.0:

retries=0 (no retries by default)
Must explicitly enable retries

Kafka >= 3.0:

retries=Integer.MAX_VALUE (unlimited retries)
Retries enabled by default with idempotent producers

Types of errors

Retriable errors

These errors can potentially be resolved by retrying:

TimeoutException: Request timed out
NotEnoughReplicasException: Not enough in-sync replicas
NotEnoughReplicasAfterAppendException: Replication issues
RetriableException: Generic retriable error
LeaderNotAvailableException: Leader election in progress
NetworkException: Network connectivity issues

Non-retriable errors

These errors indicate permanent failures that won’t be resolved by retrying:

RecordTooLargeException: Message exceeds size limits
SerializationException: Message serialization failed
OffsetMetadataTooLarge: Offset metadata too large
InvalidTopicException: Topic doesn’t exist or is invalid
UnknownTopicOrPartitionException: Topic or partition invalid
AuthorizationException: Authentication/authorization failure

Retry backoff strategies

Fixed backoff (default)

Waits a fixed amount of time between retries:

retry.backoff.ms=100  # Always wait 100ms between retries

Pattern: Wait → Retry → Wait → Retry → Wait → Retry

Exponential backoff

Not natively supported by Kafka producer, but can be implemented at the application level:

Attempt 1: Wait 100ms
Attempt 2: Wait 200ms  
Attempt 3: Wait 400ms
Attempt 4: Wait 800ms

Impact on message ordering

With retries enabled

Retries can affect message ordering within a partition:

Message A sent → Fails → Retry scheduled
Message B sent → Succeeds immediately
Message A retry → Succeeds

Result: Message B appears before Message A in partition

Preserving order

To maintain strict ordering, configure:

# Limit in-flight requests to preserve order
max.in.flight.requests.per.connection=1

# Or use idempotent producer (recommended)
enable.idempotence=true
max.in.flight.requests.per.connection=5  # Up to 5 with idempotency

Delivery timeout vs request timeout

Request timeout

Time to wait for a single request attempt:

request.timeout.ms=30000  # 30 seconds per attempt

Delivery timeout

Total time limit for delivering a message (including all retries):

delivery.timeout.ms=120000  # 2 minutes total

Relationship:

delivery.timeout.ms >= request.timeout.ms + (retries × retry.backoff.ms)

Configuration examples

High reliability (recommended)

# Unlimited retries with delivery timeout
retries=2147483647
delivery.timeout.ms=300000      # 5 minutes total
request.timeout.ms=30000        # 30 seconds per attempt
retry.backoff.ms=100            # 100ms between retries
enable.idempotence=true         # Preserve ordering and avoid duplicates

Fast failure

# Limited retries for quick feedback
retries=3
delivery.timeout.ms=10000       # 10 seconds total
request.timeout.ms=5000         # 5 seconds per attempt  
retry.backoff.ms=100            # 100ms between retries

No retries (not recommended for production)

retries=0
request.timeout.ms=30000

Monitoring retry behavior

Key metrics to track

retry-rate: Rate of retry attempts
retry-total: Total number of retries
error-rate: Rate of failed requests (after all retries)
request-latency: Time taken for requests (including retries)

JMX metrics

kafka.producer:type=producer-metrics,client-id=<client-id>
- retry-rate
- retry-total
- request-rate
- request-latency-avg

Error handling strategies

Synchronous error handling

Properties props = new Properties();
props.put("retries", 5);
props.put("retry.backoff.ms", 100);

Producer<String, String> producer = new KafkaProducer<>(props);

try {
    ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");
    RecordMetadata metadata = producer.send(record).get();
    System.out.println("Message sent to " + metadata.topic() + ":" + metadata.partition());
} catch (Exception e) {
    System.err.println("Failed after all retries: " + e.getMessage());
}

Asynchronous error handling

ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");

producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        System.err.println("Failed after all retries: " + exception.getMessage());
    } else {
        System.out.println("Message sent successfully");
    }
});

Best practices

Production recommendations

Enable unlimited retries: Set retries=Integer.MAX_VALUE
Use delivery timeout: Set delivery.timeout.ms to control total time
Enable idempotency: Prevents duplicates during retries
Monitor retry metrics: Track retry rates and error patterns
Handle non-retriable errors: Implement proper error handling for permanent failures

Configuration checklist

✅ retries=Integer.MAX_VALUE (unlimited retries)
✅ delivery.timeout.ms=120000 (reasonable total timeout)
✅ request.timeout.ms=30000 (reasonable per-request timeout)
✅ retry.backoff.ms=100 (reasonable delay between retries)
✅ enable.idempotence=true (prevent duplicates)

Common mistakes to avoid

Setting retries=0 in production
Not handling non-retriable errors
Setting delivery timeout too low
Ignoring retry metrics and error rates

Default behavior in modern KafkaStarting with Kafka 3.0, producers have sensible retry defaults:

Unlimited retries with idempotency enabled
2-minute delivery timeout
Proper error handling for most use cases

Message orderingRetries can affect message ordering within partitions. Use enable.idempotence=true or max.in.flight.requests.per.connection=1 if strict ordering is required.

Kafkademy

Understanding Kafka

Practicing Kafka

Next level Kafka

Why retries matter

Retry configuration

Basic retry settings

Kafka version differences

Types of errors

Retriable errors

Non-retriable errors

Retry backoff strategies

Fixed backoff (default)

Exponential backoff

Impact on message ordering

With retries enabled

Preserving order

Delivery timeout vs request timeout

Request timeout

Delivery timeout

Configuration examples

High reliability (recommended)

Fast failure

No retries (not recommended for production)

Monitoring retry behavior

Key metrics to track

JMX metrics

Error handling strategies

Synchronous error handling

Asynchronous error handling

Best practices

Production recommendations

Configuration checklist

Common mistakes to avoid

Kafkademy

Understanding Kafka

Practicing Kafka

Next level Kafka

​Why retries matter

​Retry configuration

​Basic retry settings

​Kafka version differences

​Types of errors

​Retriable errors

​Non-retriable errors

​Retry backoff strategies

​Fixed backoff (default)

​Exponential backoff

​Impact on message ordering

​With retries enabled

​Preserving order

​Delivery timeout vs request timeout

​Request timeout

​Delivery timeout

​Configuration examples

​High reliability (recommended)

​Fast failure

​No retries (not recommended for production)

​Monitoring retry behavior

​Key metrics to track

​JMX metrics

​Error handling strategies

​Synchronous error handling

​Asynchronous error handling

​Best practices

​Production recommendations

​Configuration checklist

​Common mistakes to avoid

Why retries matter

Retry configuration

Basic retry settings

Kafka version differences

Types of errors

Retriable errors

Non-retriable errors

Retry backoff strategies

Fixed backoff (default)

Exponential backoff

Impact on message ordering

With retries enabled

Preserving order

Delivery timeout vs request timeout

Request timeout

Delivery timeout

Configuration examples

High reliability (recommended)

Fast failure

No retries (not recommended for production)

Monitoring retry behavior

Key metrics to track

JMX metrics

Error handling strategies

Synchronous error handling

Asynchronous error handling

Best practices

Production recommendations

Configuration checklist

Common mistakes to avoid