Skip to main content
Learn how to monitor Kafka clusters and master essential operations in 15 minutes Effective monitoring is critical for running Kafka reliably. This guide covers how Kafka exposes metrics, which metrics to prioritize, and the operational procedures you need to master. What you’ll learn:
  • How Kafka exposes metrics via JMX
  • Key metrics to monitor for cluster health
  • Common monitoring tools and integrations
  • Essential operational procedures

Kafka monitoring

Kafka runs on the JVM and exposes all metrics via Java Management Extensions (JMX). You can collect these metrics using agents that attach to the Kafka process.
ToolTypeNotes
PrometheusOpen sourcePopular with Grafana dashboards
DatadogSaaSBuilt-in Kafka integration
New RelicSaaSAPM with Kafka support
ELK StackOpen sourceLog aggregation + metrics
Confluent Control CenterCommercialKafka-specific tooling

Kafka metrics to monitor

MetricDescriptionAlert threshold
UnderReplicatedPartitionsPartitions where followers are behind leader> 0 for extended periods
OfflinePartitionsCountPartitions with no available leader> 0 (critical)
ActiveControllerCountNumber of active controllers!= 1 (critical)
RequestHandlerAvgIdlePercentThread pool utilization< 20%
RequestQueueSizePending requestsGrowing over time
NetworkProcessorAvgIdlePercentNetwork thread utilization< 30%
LogFlushLatencyTime to flush to disk> baseline
FetchConsumerTotalTimeMsConsumer request latency> baseline
ProduceTotalTimeMsProducer request latency> baseline
Metrics are exposed using JMX in Kafka, although java agents or various vendors can help collect or expose metrics on different ports (for example Prometheus, etc…) Additionally, on top of broker metrics, client metrics in general (Producer, Consumer, Kafka Streams, Kafka Connect…) are important to collect and monitor. This page is meant as an introduction, and more content will soon be created towards metrics and monitoring in Apache Kafka.

References

There are many metrics exposed by Kafka providing information about nearly every function. To learn more about them, these references are very helpful:

Kafka cluster operations

For the day-to-day operations of Kafka, there are a number of operations that one has to learn and master to be able to perform them safely. These include:
  • Rolling Restart of Brokers
  • Updating Configurations
  • Rebalancing Partitions
  • Increasing replication factor
  • Adding a Broker
  • Replacing a Broker
  • Removing a Broker
  • Upgrading a Kafka Cluster with zero downtime
It is important to remember that managing your own cluster comes with all these responsibilities and more. Don’t forget to monitor producer and consumer metrics. Client-side metrics often reveal problems before broker metrics do.
See it in practice with ConduktorConduktor Console provides built-in monitoring for broker health, partition status, consumer lag, and throughput metrics. Set up alerts without configuring JMX agents or external monitoring systems.The Insights dashboard analyzes your cluster and identifies topics at risk of data loss, poor cluster efficiency or load imbalance. Monitor business-critical VIP topics and track governance metrics like schema adoption across your infrastructure.

Next steps