Skip to main content
Preview functionality: Insights is currently a preview feature and is subject to change as we continue working on it.
Risk analysis is one of the sections in the Insights dashboard. It identifies topics with configuration issues that could impact your Kafka cluster’s reliability and performance.

Overview

Risk analysis monitors three critical aspects of topic configuration:
  • Replication factor - topics with insufficient data redundancy
  • Partition distribution - topics with sub-optimal partition allocation across brokers
  • Partition skew - topics with uneven data distribution across partitions
Each graph uses color coding to indicate severity level:
  • Red: critical risk requiring immediate attention
  • Orange: moderate risk that should be addressed
  • Green: healthy configuration

Replication factor

The replication factor graph displays topics organized by their replica count (1, 2, 3, etc.). Individual topics are listed below with their current RF settings. Low replication factors increase the risk of data loss if a broker fails. RF = 1 means no data redundancy. If the broker hosting that topic fails, all data becomes unavailable permanently. RF = 3 is recommended for production environments. This provides the right balance between data safety and storage overhead, tolerating one broker failure without data loss.
  • Red (RF = 1): critical risk - no fault tolerance
  • Orange (RF = 2): moderate risk - can tolerate only one broker failure
  • Green (RF = 3+): adequate fault tolerance

How to resolve replication

Kafka does not allow changing the replication factor of an existing topic through configuration updates.You have to use partition reassignment to add replicas or recreate the topic with the desired replication factor.
Identify the topic with low replication factor:
1

Navigate to the topic

Navigate to Topics from the main menu and select the topic shown in the Risk Analysis dashboard.
2

Review current configuration

Click the Configuration tab and note the replication factor shown at the top of the page to confirm the current value.Replication factor is a topic-level setting that cannot be changed after topic creation through normal configuration updates.
  • Partition reassignment
  • Recreate topic
Partition reassignment allows you to add replicas to existing topics without recreating them. This operation requires Kafka administrative tools external to Console:
1

Document current partition assignment

In Console, navigate to the topic and click the Partitions tab. Document the current replica assignments for all partitions.
2

Perform partition reassignment

Use Kafka administrative tools (such as kafka-reassign-partitions) to add additional replicas to the topic. This process replicates data across additional brokers in the background.
Partition reassignment requires creating a JSON file specifying new replica assignments and executing the reassignment using Kafka CLI tools. Set throttling limits to avoid impacting cluster performance during the operation.
3

Verify completion in Console

Return to the Partitions tab in Console and verify all partitions now show the increased replication factor.
Works without downtime and preserves existing data. Best for production environments and topics with significant data.
Set default.replication.factor=3 in broker configuration and configure min.insync.replicas=2 to ensure writes are acknowledged by at least two replicas. Use RBAC permissions to prevent users from creating topics with RF < 3.

Partition distribution

What it shows

The partition distribution graph displays topics grouped by partition count (1, 3, 4, 5, 6, 8, 10, 12, etc.). Topics are listed below with their total partition count. Viewing the full list will also show the replication factor and partition skew.

Why it matters

Uneven partition distribution creates hotspots where some brokers handle disproportionate load, leading to performance bottlenecks and reduced fault tolerance.

How to interpret

Optimal distribution spreads partitions evenly across all brokers with balanced leadership. Warning signs include partitions concentrated on specific brokers or partition counts that aren’t multiples of broker count (for example, 7 partitions on a 3-broker cluster).

How to resolve this

Analyze current distribution:
1

Navigate to the topic

Go to Topics and select the affected topic from the Risk Analysis dashboard.
2

Review partition distribution

Click the Partitions tab and examine the distribution across brokers.
3

Switch views

Toggle between Per partition and Per broker views to understand the distribution pattern.Per broker view shows:
  • Which brokers lead which partitions
  • Which brokers hold follower replicas
  • Imbalances in partition leadership
4

Identify rebalancing needs

Look for:
  • Brokers with significantly more leader partitions than others
  • Brokers with no partitions for critical topics
  • Uneven distribution patterns that could cause hotspots
  • Rebalance leadership
  • Reassign replicas
  • Add partitions
Use Kafka administrative tools to trigger preferred leader election, which reassigns leadership to each partition’s preferred leader without moving data.This lightweight operation is safe for production and should be run regularly.
Preferred leader election only changes which broker is the leader for each partition. It does not move data or change replica assignments.

Partition skew

What it shows

The partition skew graph displays topics grouped by skew ratio ranges:
  • < 0.25: Slight imbalance (green)
  • 0.25 - 0.75: Moderate imbalance (orange)
  • > 0.75: Severe imbalance (red)
Topics are listed below with their specific skew ratio values.

Why it matters

High partition skew causes performance problems (hot partitions, consumer lag), resource inefficiency (wasted parallelism, uneven disk usage), and may indicate poor partition key selection or producer misconfiguration.

How to interpret

The skew ratio compares the largest partition to the smallest:
Skew Ratio = (Largest Partition Size) / (Smallest Partition Size)
  • < 0.25 (green): Acceptable variation
  • 0.25 - 0.75 (orange): Monitor and investigate
  • > 0.75 (red): Immediate attention required

Root causes

Poor partition key selection - Keys with uneven distribution, too few unique keys, or clustering around certain values. Producer configuration issues - Manual partition assignment, custom partitioner with flawed logic, or null keys. Data model problems - Business logic creating natural hotspots, temporal patterns, or geographic clustering.

How to resolve this

Diagnose the skew:
1

Navigate to the topic

Go to Topics and select the affected topic from the Risk Analysis dashboard.
2

Review partition details

Click the Partitions tab and select the Per partition view.
3

Identify imbalanced partitions

Compare the following columns across all partitions:
  • Total number of records - Shows message count per partition
  • Partition size - Shows disk space consumed
  • Begin offset and End offset - Shows the range of messages
Look for partitions with significantly higher values than others.
4

Document the pattern

Note which partitions are oversized and by how much. This will help identify the root cause.
Analyze message keys:
1

Navigate to the Consume tab

Click the Consume tab for the topic.
2

Configure consumer settings

Configure the consumer to read from All partitions to see the full data distribution.
3

Review message keys

Examine the keys in the consumed data. Look for patterns:
  • Are certain keys appearing far more frequently than others?
  • Are many messages using null keys?
  • Is there visible clustering in key values?
4

Filter by partition

Use the partition filter to consume from specific partitions (especially the largest and smallest) to compare key distributions.
You cannot rebalance existing data in partitions. Focus on preventing future skew:
Set up alerts for partition size differences and review the Risk Analysis dashboard regularly. Monitor consumer lag by partition to identify performance impacts.

Troubleshooting

Several factors can cause skew even with well-designed partition keys:
  • Time-based patterns: Temporal clustering (business hours vs. night) creates natural skew based on when data was produced
  • Compaction: Log compacted topics retain more messages in partitions with higher key diversity
  • Retention: Uneven produce rates over time mean partitions contain data from different periods
  • Producer failures: Restarts or errors may temporarily cluster messages on specific partitions
  • Natural data distribution: Some business scenarios naturally create skew (one customer generating 80% of orders)
Analyze skew trends over time (days or weeks) rather than point-in-time snapshots. If skew is transient and self-correcting, monitor but don’t take action. Use retention-based cleanup to eventually age out historical skewed data.
Yes, use partition reassignment to add replicas to existing topics through Kafka administrative tools:
  1. View current replica assignments in Console’s Partitions tab
  2. Create a reassignment plan specifying new replica assignments with additional broker IDs
  3. Execute the reassignment using Kafka CLI tools (data replicates in the background)
  4. Monitor progress and verify completion in Console
This approach increases replication without downtime or data loss.
Key considerations:
  • Throughput: More partitions = more parallelism and higher potential throughput
  • Consumer count: You need at least as many partitions as consumers for full parallelism
  • Broker count: Choose a partition count that’s a multiple of broker count for even distribution
  • Message ordering: Ordering is only guaranteed within a single partition
  • Overhead: Each partition adds metadata overhead. Tens of thousands of partitions can cause performance issues
General guideline: Start with max(# of consumers × 2, # of brokers × 2) and adjust based on monitoring.Examples:
  • 6-broker cluster with 10 consumers: 20 partitions
  • 6-broker cluster with 3 consumers: 12 partitions
  • 12-broker cluster with 50 consumers: 100 partitions
This is normal and expected for newly created topics. Initial messages create skew as partitions receive different amounts of data before distribution stabilizes.Expected behavior:
  • First 100-1000 messages: High skew is normal
  • After 1000+ messages: Skew should normalize if partition keys are well-distributed
  • After 24-48 hours: Skew ratios should stabilize
Monitor skew over 24-48 hours or after at least 10,000 messages before taking corrective action. If skew remains high (> 2.0) after this period, investigate partition key selection.
Partition reassignment impacts network traffic, disk I/O, client latency, and broker CPU. Duration depends on data volume: small topics (< 1 GB) complete in minutes; large topics (> 1 TB) can take hours or days.Best practices to minimize impact:
  • Schedule during low-traffic periods
  • Use throttling to prevent saturating network bandwidth
  • Monitor cluster metrics (CPU, disk I/O, network throughput, client latency) using Console
  • For very large topics, reassign partitions in batches
  • Adjust throttle dynamically based on traffic patterns
  • Remove throttle after completion
Always set a throttle value when performing partition reassignment in production. Un-throttled reassignment can impact client operations and cause outages.