Replication factor
The replication factor determines how many copies of your data are maintained across the Kafka cluster.How replication works
- Each topic partition has one leader and zero or more followers
- All reads and writes go through the leader
- Followers replicate data from the leader
- If a leader fails, one of the followers becomes the new leader
Choosing replication factor
For production environments:- Minimum: 3 (recommended)
- Common: 3-5 depending on cluster size and requirements
- Maximum: Generally not more than 5-7
- Minimum: 1 (acceptable for non-critical data)
- Recommended: 2-3 for realistic testing
Replication factor considerations
Fault tolerance formulaWith replication factor
N
, you can tolerate up to N-1
broker failures while maintaining data availability.- Replication factor 1: No fault tolerance (data loss if broker fails)
- Replication factor 3: Tolerates 2 broker failures
- Replication factor 5: Tolerates 4 broker failures
-
Higher replication factor:
- ✅ Better fault tolerance and availability
- ✅ Higher data durability
- ❌ Increased storage requirements
- ❌ Higher network overhead
- ❌ Increased replication lag
-
Lower replication factor:
- ✅ Lower storage costs
- ✅ Reduced network overhead
- ❌ Reduced fault tolerance
- ❌ Higher risk of data loss
Partition count
Partitions enable Kafka to scale and parallelize data processing across multiple brokers and consumers.How partitions work
- Topics are divided into partitions
- Each partition is an ordered, immutable sequence of messages
- Partitions are distributed across brokers
- Consumers can process partitions in parallel
Choosing partition count
Starting recommendations:- Small topics (< 1GB/day): 1-3 partitions
- Medium topics (1-10GB/day): 6-12 partitions
- Large topics (> 10GB/day): 20+ partitions
1. Throughput requirements
2. Consumer parallelism
- Maximum consumers in a consumer group = Number of partitions
- More partitions = More potential parallelism
- Fewer partitions = Less parallelism but simpler management
3. Broker distribution
- Partitions should be evenly distributed across brokers
- Each broker should handle a reasonable number of partitions
- Avoid having too many partitions per broker (recommended: < 4000)
Partition count considerations
Trade-offs:-
More partitions:
- ✅ Higher potential throughput
- ✅ Better parallelism for consumers
- ✅ Better load distribution
- ❌ More overhead (file handles, memory)
- ❌ Longer leader election times
- ❌ More complex partition management
-
Fewer partitions:
- ✅ Lower resource overhead
- ✅ Simpler management
- ✅ Faster leader elections
- ❌ Limited throughput potential
- ❌ Less consumer parallelism
Best practices
Planning for growth
Partition planningStart with enough partitions to handle 2-3x your expected peak throughput. It’s easier to start with more partitions than to add them later.
- Estimate peak throughput needs for the next 2-3 years
- Add 50-100% buffer for unexpected growth
- Consider data retention and storage requirements
Production recommendations
Replication factor:- Use replication factor 3 for most production workloads
- Use replication factor 5 for critical data that cannot tolerate loss
- Never use replication factor 1 in production
- Start with 6-12 partitions for most new topics
- Scale up based on observed throughput requirements
- Aim for 10-100MB per partition per day as a rough guideline
Performance testing
Before finalizing partition and replication settings:- Test with realistic data volumes
- Measure single partition throughput
- Test consumer group scaling
- Monitor resource usage (CPU, memory, disk I/O, network)
- Test failure scenarios (broker failures, network partitions)
Monitoring and adjustment
Key metrics to monitor:- Throughput per partition
- Consumer lag by partition
- Broker resource utilization
- Replication lag
- Leader election frequency
Partition limitsBe aware of cluster-wide partition limits:
- Each broker has limits on total partitions (typically 2000-4000)
- ZooKeeper has metadata overhead for each partition
- Too many partitions can impact cluster stability