Why retries matter
In distributed systems, temporary failures are common:- Network connectivity issues
- Broker leadership changes
- Temporary resource constraints
- Replication delays
Retry configuration
Basic retry settings
Kafka version differences
Kafka < 3.0:retries=0
(no retries by default)- Must explicitly enable retries
retries=Integer.MAX_VALUE
(unlimited retries)- Retries enabled by default with idempotent producers
Types of errors
Retriable errors
These errors can potentially be resolved by retrying:- TimeoutException: Request timed out
- NotEnoughReplicasException: Not enough in-sync replicas
- NotEnoughReplicasAfterAppendException: Replication issues
- RetriableException: Generic retriable error
- LeaderNotAvailableException: Leader election in progress
- NetworkException: Network connectivity issues
Non-retriable errors
These errors indicate permanent failures that won’t be resolved by retrying:- RecordTooLargeException: Message exceeds size limits
- SerializationException: Message serialization failed
- OffsetMetadataTooLarge: Offset metadata too large
- InvalidTopicException: Topic doesn’t exist or is invalid
- UnknownTopicOrPartitionException: Topic or partition invalid
- AuthorizationException: Authentication/authorization failure
Retry backoff strategies
Fixed backoff (default)
Waits a fixed amount of time between retries:Exponential backoff
Not natively supported by Kafka producer, but can be implemented at the application level:Impact on message ordering
With retries enabled
Retries can affect message ordering within a partition:Preserving order
To maintain strict ordering, configure:Delivery timeout vs request timeout
Request timeout
Time to wait for a single request attempt:Delivery timeout
Total time limit for delivering a message (including all retries):Configuration examples
High reliability (recommended)
Fast failure
No retries (not recommended for production)
Monitoring retry behavior
Key metrics to track
- retry-rate: Rate of retry attempts
- retry-total: Total number of retries
- error-rate: Rate of failed requests (after all retries)
- request-latency: Time taken for requests (including retries)
JMX metrics
Error handling strategies
Synchronous error handling
Asynchronous error handling
Best practices
Production recommendations
- Enable unlimited retries: Set
retries=Integer.MAX_VALUE
- Use delivery timeout: Set
delivery.timeout.ms
to control total time - Enable idempotency: Prevents duplicates during retries
- Monitor retry metrics: Track retry rates and error patterns
- Handle non-retriable errors: Implement proper error handling for permanent failures
Configuration checklist
- ✅
retries=Integer.MAX_VALUE
(unlimited retries) - ✅
delivery.timeout.ms=120000
(reasonable total timeout) - ✅
request.timeout.ms=30000
(reasonable per-request timeout) - ✅
retry.backoff.ms=100
(reasonable delay between retries) - ✅
enable.idempotence=true
(prevent duplicates)
Common mistakes to avoid
- Setting
retries=0
in production - Not handling non-retriable errors
- Setting delivery timeout too low
- Ignoring retry metrics and error rates
Default behavior in modern KafkaStarting with Kafka 3.0, producers have sensible retry defaults:
- Unlimited retries with idempotency enabled
- 2-minute delivery timeout
- Proper error handling for most use cases
Message orderingRetries can affect message ordering within partitions. Use
enable.idempotence=true
or max.in.flight.requests.per.connection=1
if strict ordering is required.