Kafka producers can automatically retry failed requests to improve reliability and handle transient failures in distributed systems.

Why retries matter

In distributed systems, temporary failures are common:
  • Network connectivity issues
  • Broker leadership changes
  • Temporary resource constraints
  • Replication delays
Without retries, these transient issues would result in lost messages. Retries provide resilience against such failures.

Retry configuration

Basic retry settings

# Number of retry attempts (default varies by Kafka version)
retries=2147483647

# Time to wait between retries (default: 100ms)
retry.backoff.ms=100

# Maximum time to wait for acknowledgment (default: 30s)
request.timeout.ms=30000

# Maximum time to deliver a message including retries (default: 2 minutes)
delivery.timeout.ms=120000

Kafka version differences

Kafka < 3.0:
  • retries=0 (no retries by default)
  • Must explicitly enable retries
Kafka >= 3.0:
  • retries=Integer.MAX_VALUE (unlimited retries)
  • Retries enabled by default with idempotent producers

Types of errors

Retriable errors

These errors can potentially be resolved by retrying:
  • TimeoutException: Request timed out
  • NotEnoughReplicasException: Not enough in-sync replicas
  • NotEnoughReplicasAfterAppendException: Replication issues
  • RetriableException: Generic retriable error
  • LeaderNotAvailableException: Leader election in progress
  • NetworkException: Network connectivity issues

Non-retriable errors

These errors indicate permanent failures that won’t be resolved by retrying:
  • RecordTooLargeException: Message exceeds size limits
  • SerializationException: Message serialization failed
  • OffsetMetadataTooLarge: Offset metadata too large
  • InvalidTopicException: Topic doesn’t exist or is invalid
  • UnknownTopicOrPartitionException: Topic or partition invalid
  • AuthorizationException: Authentication/authorization failure

Retry backoff strategies

Fixed backoff (default)

Waits a fixed amount of time between retries:
retry.backoff.ms=100  # Always wait 100ms between retries
Pattern: Wait → Retry → Wait → Retry → Wait → Retry

Exponential backoff

Not natively supported by Kafka producer, but can be implemented at the application level:
Attempt 1: Wait 100ms
Attempt 2: Wait 200ms  
Attempt 3: Wait 400ms
Attempt 4: Wait 800ms

Impact on message ordering

With retries enabled

Retries can affect message ordering within a partition:
Message A sent → Fails → Retry scheduled
Message B sent → Succeeds immediately
Message A retry → Succeeds

Result: Message B appears before Message A in partition

Preserving order

To maintain strict ordering, configure:
# Limit in-flight requests to preserve order
max.in.flight.requests.per.connection=1

# Or use idempotent producer (recommended)
enable.idempotence=true
max.in.flight.requests.per.connection=5  # Up to 5 with idempotency

Delivery timeout vs request timeout

Request timeout

Time to wait for a single request attempt:
request.timeout.ms=30000  # 30 seconds per attempt

Delivery timeout

Total time limit for delivering a message (including all retries):
delivery.timeout.ms=120000  # 2 minutes total
Relationship:
delivery.timeout.ms >= request.timeout.ms + (retries × retry.backoff.ms)

Configuration examples

# Unlimited retries with delivery timeout
retries=2147483647
delivery.timeout.ms=300000      # 5 minutes total
request.timeout.ms=30000        # 30 seconds per attempt
retry.backoff.ms=100            # 100ms between retries
enable.idempotence=true         # Preserve ordering and avoid duplicates

Fast failure

# Limited retries for quick feedback
retries=3
delivery.timeout.ms=10000       # 10 seconds total
request.timeout.ms=5000         # 5 seconds per attempt  
retry.backoff.ms=100            # 100ms between retries
retries=0
request.timeout.ms=30000

Monitoring retry behavior

Key metrics to track

  • retry-rate: Rate of retry attempts
  • retry-total: Total number of retries
  • error-rate: Rate of failed requests (after all retries)
  • request-latency: Time taken for requests (including retries)

JMX metrics

kafka.producer:type=producer-metrics,client-id=<client-id>
- retry-rate
- retry-total
- request-rate
- request-latency-avg

Error handling strategies

Synchronous error handling

Properties props = new Properties();
props.put("retries", 5);
props.put("retry.backoff.ms", 100);

Producer<String, String> producer = new KafkaProducer<>(props);

try {
    ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");
    RecordMetadata metadata = producer.send(record).get();
    System.out.println("Message sent to " + metadata.topic() + ":" + metadata.partition());
} catch (Exception e) {
    System.err.println("Failed after all retries: " + e.getMessage());
}

Asynchronous error handling

ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");

producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        System.err.println("Failed after all retries: " + exception.getMessage());
    } else {
        System.out.println("Message sent successfully");
    }
});

Best practices

Production recommendations

  1. Enable unlimited retries: Set retries=Integer.MAX_VALUE
  2. Use delivery timeout: Set delivery.timeout.ms to control total time
  3. Enable idempotency: Prevents duplicates during retries
  4. Monitor retry metrics: Track retry rates and error patterns
  5. Handle non-retriable errors: Implement proper error handling for permanent failures

Configuration checklist

  • retries=Integer.MAX_VALUE (unlimited retries)
  • delivery.timeout.ms=120000 (reasonable total timeout)
  • request.timeout.ms=30000 (reasonable per-request timeout)
  • retry.backoff.ms=100 (reasonable delay between retries)
  • enable.idempotence=true (prevent duplicates)

Common mistakes to avoid

  • Setting retries=0 in production
  • Not handling non-retriable errors
  • Setting delivery timeout too low
  • Ignoring retry metrics and error rates
Default behavior in modern KafkaStarting with Kafka 3.0, producers have sensible retry defaults:
  • Unlimited retries with idempotency enabled
  • 2-minute delivery timeout
  • Proper error handling for most use cases
Message orderingRetries can affect message ordering within partitions. Use enable.idempotence=true or max.in.flight.requests.per.connection=1 if strict ordering is required.