- How producer retries work and when they’re triggered
- The difference between retriable and non-retriable errors
- How to configure retry behavior for idempotent producers
- Best practices for retry timeout and backoff settings
Why retries matter
In distributed systems, temporary failures are common:- Network connectivity issues
- Broker leadership changes
- Temporary resource constraints
- Replication delays
Retry configuration
Basic retry settings
Kafka version differences
Kafka < 3.0:retries=0(no retries by default)- Must explicitly enable retries
retries=Integer.MAX_VALUE(unlimited retries)- Retries enabled by default with idempotent producers
Types of errors
Retriable errors
These errors can potentially be resolved by retrying:- TimeoutException: Request timed out
- NotEnoughReplicasException: Not enough in-sync replicas
- NotEnoughReplicasAfterAppendException: Replication issues
- RetriableException: Generic retriable error
- LeaderNotAvailableException: Leader election in progress
- NetworkException: Network connectivity issues
Non-retriable errors
These errors indicate permanent failures that won’t be resolved by retrying:- RecordTooLargeException: Message exceeds size limits
- SerializationException: Message serialization failed
- OffsetMetadataTooLarge: Offset metadata too large
- InvalidTopicException: Topic doesn’t exist or is invalid
- UnknownTopicOrPartitionException: Topic or partition invalid
- AuthorizationException: Authentication/authorization failure
Error handling decision tree
This decision tree helps you understand how to handle different types of producer errors:
Retry decision flowchart
This flowchart shows how the producer decides whether to retry a failed request:
Idempotent producers (Kafka 2.4+): With
enable.idempotence=true and acks=all, you get unlimited retries by default without risk of duplicates, making retry configuration much simpler.Retry backoff strategies
Fixed backoff (default)
Waits a fixed amount of time between retries:Exponential backoff
Not natively supported by Kafka producer, but can be implemented at the application level:Impact on message ordering
With retries enabled
Retries can affect message ordering within a partition:Preserve order
To maintain strict ordering, configure:Delivery timeout vs request timeout
Request timeout
Time to wait for a single request attempt:Delivery timeout
Total time limit for delivering a message (including all retries):Configuration examples
High reliability (recommended)
Fast failure
No retries (not recommended for production)
Monitor retry behavior
Key metrics to track
- retry-rate: Rate of retry attempts
- retry-total: Total number of retries
- error-rate: Rate of failed requests (after all retries)
- request-latency: Time taken for requests (including retries)
JMX metrics
Error handling strategies
Synchronous error handling
Asynchronous error handling
Best practices
Production recommendations
- Enable unlimited retries: Set
retries=Integer.MAX_VALUE - Use delivery timeout: Set
delivery.timeout.msto control total time - Enable idempotency: Prevents duplicates during retries
- Monitor retry metrics: Track retry rates and error patterns
- Handle non-retriable errors: Implement proper error handling for permanent failures
Configuration checklist
- ✅
retries=Integer.MAX_VALUE(unlimited retries) - ✅
delivery.timeout.ms=120000(reasonable total timeout) - ✅
request.timeout.ms=30000(reasonable per-request timeout) - ✅
retry.backoff.ms=100(reasonable delay between retries) - ✅
enable.idempotence=true(prevent duplicates)
Common mistakes to avoid
- Setting
retries=0in production - Not handling non-retriable errors
- Setting delivery timeout too low
- Ignoring retry metrics and error rates
enable.idempotence=true or max.in.flight.requests.per.connection=1 if strict ordering is required.
See it in practice with ConduktorConduktor Console displays producer retry metrics and error rates in real-time. Monitor retry attempts, successful retries, and failed messages to validate your retry configuration and identify patterns in transient versus permanent failures.
Next steps
- Configure idempotent producers for exactly-once semantics
- Understand acknowledgment settings for delivery guarantees
- Optimize producer batching for throughput
- Monitor producer metrics using CLI tools