- How incremental cooperative rebalancing reduces processing downtime
- The multi-phase rebalancing process and how it differs from eager rebalancing
- How to configure static group membership for stable consumer identities
- Best practices for production deployments
Traditional rebalancing problems
Eager rebalancing (pre-Kafka 2.4)
- Stop-the-world: All consumers stop processing during rebalance
- Complete reassignment: All partitions are revoked and reassigned
- Processing downtime: No messages processed during rebalance period
- Cascading rebalances: One consumer failure affects entire group
Performance impact
Eager versus incremental rebalancing comparison
This diagram compares the impact of eager rebalancing versus incremental cooperative rebalancing:
Incremental cooperative rebalancing
How it works (Kafka 2.4+)
- Minimal disruption: Only affected partitions are reassigned
- Continued processing: Unaffected partitions continue processing
- Gradual transition: Rebalance happens in multiple phases
- Reduced downtime: Significantly shorter processing interruptions
Rebalancing phases
The incremental rebalance happens in two distinct phases, minimizing disruption:
- ✅ Only ONE consumer stops ONE partition (P2)
- ✅ Eight out of nine partitions never stop processing
- ✅ Total downtime: ~100ms instead of several seconds
- ✅ Cascading failures prevented
Configuration
Static group membership
Concept
Static group membership allows consumers to maintain stable identities across restarts, preventing unnecessary rebalances during planned maintenance or brief outages.Benefits
- Fewer rebalances: Consumer restarts don’t trigger rebalances
- Stable assignments: Partitions stay with the same consumer instance
- Faster recovery: Consumers can resume processing from where they left off
- Operational efficiency: Planned maintenance doesn’t disrupt other consumers
Configuration
Consumer lifecycle
Use cases and benefits
High-availability applications
Containerized environments
Stream processing applications
- State preservation: Local state stores remain associated with specific consumers
- Reduced reprocessing: Avoid recomputing state after rebalances
- Consistent partitioning: Same consumer always processes same partitions
Monitor and observe
Key metrics
- Rebalance frequency: Number of rebalances per time period
- Rebalance duration: Time taken for rebalance completion
- Partition assignment stability: How often partitions change owners
- Consumer lag during rebalance: Processing delay during rebalances
JMX metrics
Configuration best practices
For incremental rebalancing
For static group membership
Combined configuration
Operational considerations
Deployment strategies
- Rolling updates: Use static group membership for zero-downtime deployments
- Blue-green: Static IDs help maintain partition assignments
- Canary releases: Incremental rebalancing minimizes impact on stable consumers
Maintenance windows
Troubleshooting
Common issues and solutions:- Duplicate static IDs: Ensure unique
group.instance.idper consumer - Long session timeouts: Balance between stability and failure detection
- Assignment strategy conflicts: Ensure all consumers use compatible assignors
Performance impact
Before (eager rebalancing)
After (incremental + static)
Measurable improvements
- 90% reduction in processing downtime during rebalances
- 50% fewer unnecessary rebalances with static membership
- Improved throughput due to reduced processing interruptions
- Better consumer utilization with sticky partition assignments
See it in practice with ConduktorConduktor Console visualizes consumer group rebalances in real-time, showing which partitions are being reassigned and the rebalance duration. Monitor consumer lag during rebalances to verify that incremental rebalancing is minimizing disruption as expected.
Next steps
- Configure consumer group settings for optimal performance
- Understand delivery semantics for reliable processing
- Monitor consumer groups using CLI tools
- Troubleshoot consumer lag effectively