Risk analysis

Preview functionality: Insights is currently a preview feature and is subject to change as we continue working on it.

Risk analysis is one of the graphs in the Insights dashboard. It identifies topics with configuration issues that could impact your Kafka cluster’s reliability and performance.

Overview

Risk analysis monitors three critical aspects of topic configuration:

Replication factor: topics with insufficient data redundancy
Partition distribution: topics with sub-optimal partition allocation across brokers
Partition skew: topics with uneven data distribution across partitions

Each graph uses color coding to indicate severity level:

Red: critical risk requiring immediate attention
Orange: moderate risk that should be addressed
Green: healthy configuration

Replication factor

The replication factor graph displays topics organized by their replica count (1, 2, 3, etc.). Individual topics are listed below with their current RF settings. Low replication factors increase the risk of data loss if a broker fails. RF = 1 means no data redundancy. If the broker hosting that topic fails, all data becomes unavailable permanently. RF = 3 is recommended for production environments. This provides the right balance between data safety and storage overhead, tolerating one broker failure without data loss.

Red (RF = 1): critical risk - no fault tolerance
Orange (RF = 2): moderate risk - can tolerate only one broker failure
Green (RF = 3+): adequate fault tolerance

Resolve replication

Kafka does not allow changing the replication factor of an existing topic through configuration updates.You have to use partition reassignment to add replicas or recreate the topic with the required replication factor.

Identify the topic with low replication factor:

Go to the topic - go to Topics from the main menu and select the topic shown in the graph.
Review current configuration - click the Configuration tab and note the replication factor shown at the top of the page to confirm the current value. Replication factor is a topic-level setting that can’t be changed after topic creation through normal configuration updates.

Partition reassignment
Recreate topic

Partition reassignment allows you to add replicas to existing topics without recreating them. This operation requires Kafka administrative tools external to Console:

Document current partition assignment - in Console, go to the topic and click the Partitions tab. Document the current replica assignments for all partitions.
Perform partition reassignment - use Kafka administrative tools (such as kafka-reassign-partitions) to add additional replicas to the topic. This process replicates data across additional brokers in the background.
Partition reassignment requires creating a JSON file specifying new replica assignments and executing the reassignment using Kafka CLI tools. Set throttling limits to avoid impacting cluster performance during the operation.
Verify completion in Console - return to the Partitions tab in Console and verify all partitions now show the increased replication factor.

Works without downtime and preserves existing data. Best for production environments and topics with significant data.

Set default.replication.factor=3 in broker configuration and configure min.insync.replicas=2 to ensure writes are acknowledged by at least two replicas. Use RBAC permissions to prevent users from creating topics with RF < 3.

Partition distribution

Uneven partition distribution creates hotspots where some brokers handle disproportionate load, leading to performance bottlenecks and reduced fault tolerance. The partition distribution graph displays topics grouped by partition count (1, 3, 4, 5, 6, 8, 10, 12, etc.). Topics are listed below with their total partition count. Viewing the full list will also show the replication factor and partition skew.

Resolve partition distribution

Optimal distribution spreads partitions evenly across all brokers with balanced leadership. Warning signs include partitions concentrated on specific brokers or partition counts that aren’t multiples of broker count (for example, 7 partitions on a 3-broker cluster). Analyze current distribution:

Go to the topic - go to Topics and select the affected topic.
Review partition distribution - click the Partitions tab and examine the distribution across brokers.
Switch views - toggle between Per partition and Per broker views to understand the distribution pattern. Per broker view shows:
- Which brokers lead which partitions
- Which brokers hold follower replicas
- Imbalances in partition leadership
Identify rebalancing needs - look for:
- Brokers with significantly more leader partitions than others
- Brokers with no partitions for critical topics
- Uneven distribution patterns that could cause hotspots

Rebalance leadership
Reassign replicas
Add partitions

Use Kafka administrative tools to trigger preferred leader election, which reassigns leadership to each partition’s preferred leader without moving data.This lightweight operation is safe for production and should be run regularly.

Preferred leader election only changes which broker is the leader for each partition. It does not move data or change replica assignments.

Partition skew

High partition skew causes performance problems (hot partitions, consumer lag), resource inefficiency (wasted parallelism, uneven disk usage) and may indicate poor partition key selection or producer misconfiguration. The partition skew graph displays topics grouped by skew ratio ranges:

< 0.25: Slight imbalance (green)
0.25 - 0.75: Moderate imbalance (orange)
> 0.75: Severe imbalance (red)

Topics are listed below with their specific skew ratio values. The skew ratio compares the largest partition to the smallest:

Skew Ratio = (Largest Partition Size) / (Smallest Partition Size)

< 0.25 (green): Acceptable variation
0.25 - 0.75 (orange): Monitor and investigate
> 0.75 (red): Immediate attention required

Possible causes

Poor partition key selection - keys with uneven distribution, too few unique keys, or clustering around certain values. Producer configuration issues - manual partition assignment, custom partitioner with flawed logic, or null keys. Data model problems - business logic creating natural hotspots, temporal patterns, or geographic clustering.

Resolve partition skew

Diagnose the skew:

Go to the topic - go to Topics and select the affected topic.
Review partition details - click the Partitions tab and select the Per partition view.
Identify imbalanced partitions - compare the following columns across all partitions:
- Total number of records - Shows message count per partition
- Partition size - Shows disk space consumed
- Begin offset and End offset - Shows the range of messages
Look for partitions with significantly higher values than others.
Document the pattern - note which partitions are oversized and by how much. This will help identify the root cause.

Analyze message keys:

Go to the Consume tab - click the Consume tab for the topic.
Configure consumer settings - configure the consumer to read from All partitions to see the full data distribution.
Review message keys - examine the keys in the consumed data. Look for patterns:
- Are certain keys appearing far more frequently than others?
- Are many messages using null keys?
- Is there visible clustering in key values?
Filter by partition - use the partition filter to consume from specific partitions (especially the largest and smallest) to compare key distributions.

Kafka can’t rebalance existing data in partitions. Focus on preventing future skew:

Fix partition key (recommended)
Increase partitions
Recreate topic

Choose keys with high cardinality and even distribution:Good choices:

User ID, Order ID, Transaction ID, Device ID
Composite keys like ${region}-${customerId}
Any identifier with naturally even distribution

Poor choices:

Status fields (limited values)
Boolean values (only two values)
Small enums (limited set of values)
Dates without time component
Null keys

Changing the partition key requires updating producer applications. Coordinate with your development team to implement the new key strategy.

Set up alerts for partition size differences and review the Risk Analysis dashboard regularly. Monitor consumer lag by partition to identify performance impacts.

Troubleshoot

Why does my topic show high skew even with good partition keys?

Several factors can cause skew even with well-designed partition keys:

Time-based patterns: Temporal clustering (business hours vs. night) creates natural skew based on when data was produced
Compaction: Log compacted topics retain more messages in partitions with higher key diversity
Retention: Uneven produce rates over time mean partitions contain data from different periods
Producer failures: Restarts or errors may temporarily cluster messages on specific partitions
Natural data distribution: Some business scenarios naturally create skew (one customer generating 80% of orders)

Analyze skew trends over time (days or weeks) rather than point-in-time snapshots. If skew is transient and self-correcting, monitor but don’t take action. Use retention-based cleanup to eventually age out historical skewed data.

Can I fix replication factor without recreating the topic?

Yes, use partition reassignment to add replicas to existing topics through Kafka administrative tools:

View current replica assignments in Console’s Partitions tab
Create a reassignment plan specifying new replica assignments with additional broker IDs
Execute the reassignment using Kafka CLI tools (data replicates in the background)
Monitor progress and verify completion in Console

This approach increases replication without downtime or data loss.

How many partitions should my topic have?

Key considerations:

Throughput: More partitions = more parallelism and higher potential throughput
Consumer count: You need at least as many partitions as consumers for full parallelism
Broker count: Choose a partition count that’s a multiple of broker count for even distribution
Message ordering: Ordering is only guaranteed within a single partition
Overhead: Each partition adds metadata overhead. Tens of thousands of partitions can cause performance issues

General guideline: Use 4 or fewer partitions for low-volume topics. For topics requiring more partitions, use multiples of 6 (6, 12, 18, 24, etc.).Examples:

Low-volume topic: 1-4 partitions
Standard topic: 6 or 12 partitions
High-throughput topic: 18, 24, 30 or more partitions (multiples of 6)

Why do I see partition skew immediately after topic creation?

This is normal and expected for newly created topics. Initial messages create skew as partitions receive different amounts of data before distribution stabilizes.Expected behavior:

First 100-1000 messages: High skew is normal
After 1000+ messages: Skew should normalize if partition keys are well-distributed
After 24-48 hours: Skew ratios should stabilize

Monitor skew over 24-48 hours or after at least 10,000 messages before taking corrective action. If skew remains high (> 0.75) after this period, investigate partition key selection.

What's the performance impact of partition reassignment?

Partition reassignment impacts network traffic, disk I/O, client latency, and broker CPU. Duration depends on data volume: small topics (< 1 GB) complete in minutes; large topics (> 1 TB) can take hours or days.Best practices to minimize impact:

Schedule during low-traffic periods
Use throttling to prevent saturating network bandwidth
Monitor cluster metrics (CPU, disk I/O, network throughput, client latency) using Console
For very large topics, reassign partitions in batches
Adjust throttle dynamically based on traffic patterns
Remove throttle after completion

Always set a throttle value when performing partition reassignment in production. Un-throttled reassignment can impact client operations and cause outages.

What's the difference between under-replicated partitions and under min ISR partitions?

These are related but different conditions:Under-replicated partitions:

A partition has fewer in-sync replicas than its configured replication factor
Caused by broker failures, network issues or replicas falling behind
Reduces fault tolerance but doesn’t immediately block writes
Example: Topic with RF=3 but only 2 replicas are in-sync

Under min ISR partitions:

The number of in-sync replicas falls below the min.insync.replicas setting
Blocks producers configured with acks=all from writing
More severe condition indicating immediate data durability risk
Example: min.insync.replicas=2 but only 1 replica is in-sync

A partition can be under-replicated without being under min ISR (if enough replicas remain in-sync to meet the minimum). However, under min ISR partitions are always also under-replicated.

Overview

Automate tasks

Manage Kafka resources

Insights

Monitor systems

Govern data

Overview

Replication factor

Resolve replication

Partition distribution

Resolve partition distribution

Partition skew

Possible causes

Resolve partition skew

Troubleshoot

Overview

Automate tasks

Manage Kafka resources

Insights

Monitor systems

Govern data

​Overview

​Replication factor

​Resolve replication

​Partition distribution

​Resolve partition distribution

​Partition skew

​Possible causes

​Resolve partition skew

​Troubleshoot

​Related resources

Overview

Replication factor

Resolve replication

Partition distribution

Resolve partition distribution

Partition skew

Possible causes

Resolve partition skew

Troubleshoot

Related resources