Skip to main content
This page provides best practices and recommendations for monitoring and alerting on Conduktor Gateway deployments. Use these as starting points and tune thresholds to match your environment’s baseline. Before you start, make sure you have set up monitoring and can access the Gateway Prometheus endpoint. Example Grafana dashboards are available on Github, in the Conduktor Helm package at charts/gateway/grafana-dashboards.

Request flow overview

The diagram below shows how traffic flows through Gateway, and where key metrics are captured. Use it as a reference when choosing what to monitor. The direction label on byte metrics refers to the direction of data flow: upstream means bytes flowing from clients toward Kafka, and downstream means bytes flowing from Kafka toward clients. Where metrics are captured:
BoundaryKey metrics
Client to Gatewaygateway_active_connections_vcluster, gateway_bytes_exchanged{direction="upstream"}
Gateway to Kafkagateway_upstream_connections_upstream_connected, gateway_bytes_exchanged_vcluster{direction="upstream"}
Kafka to Gatewaygateway_bytes_exchanged_vcluster{direction="downstream"}
Gateway to Clientgateway_bytes_exchanged{direction="downstream"}
Round-tripgateway_latency_request_response
Inside Gatewaygateway_current_inflight_apiKeys, gateway_thread_tasks

Availability and license

Gateway down

Alert when the Gateway instance is unreachable.
alert: GatewayDown
expr: up{job="conduktor-gateway"} == 0
for: 1m
labels:
  severity: critical
annotations:
  summary: "Gateway instance {{ $labels.instance }} is down"

License expiring

Track gateway_license_remaining_days and alert at two thresholds to give time for renewal.
alert: GatewayLicenseExpiringSoon
expr: gateway_license_remaining_days < 14
for: 1h
labels:
  severity: warning
annotations:
  summary: "Gateway license expires in {{ $value }} days"
alert: GatewayLicenseCritical
expr: gateway_license_remaining_days < 3
for: 5m
labels:
  severity: critical
annotations:
  summary: "Gateway license expires in {{ $value }} days - renew immediately"

Kafka node loss

Track gateway_upstream_io_nodes to detect when Gateway loses visibility of Kafka brokers.
alert: GatewayKafkaNodeLoss
expr: gateway_upstream_io_nodes < 3
for: 2m
labels:
  severity: critical
annotations:
  summary: "Gateway sees only {{ $value }} Kafka nodes (expected 3+)"
Adjust the threshold to match your Kafka cluster size.

Connections

Kafka connections are long-lived. In a stable environment, the active connection count should be relatively constant. Establish a baseline for your deployment and alert on unusual deviations.

Client connections

Monitor gateway_active_connections_vcluster for sudden drops (client disconnects) or spikes (connection storms).
alert: GatewayConnectionsDrop
expr: delta(gateway_active_connections_vcluster[5m]) < -50
for: 2m
labels:
  severity: warning
annotations:
  summary: "Sudden drop in client connections on vcluster {{ $labels.vcluster }}"
alert: GatewayConnectionsHigh
expr: gateway_active_connections_vcluster > 1000
for: 5m
labels:
  severity: warning
annotations:
  summary: "High client connection count ({{ $value }}) on vcluster {{ $labels.vcluster }}"

Upstream connections

gateway_upstream_connections_upstream_connected tracks the number of connections from Gateway to the backing Kafka cluster. This should be stable in an established environment. A sudden change could indicate Kafka broker issues or Gateway restarts.
alert: GatewayUpstreamConnectionsDrop
expr: delta(gateway_upstream_connections_upstream_connected[5m]) < -5
for: 2m
labels:
  severity: warning
annotations:
  summary: "Upstream connections dropped by {{ $value }} in 5 minutes"
gateway_upstream_connection_close_rate tracks how frequently upstream connections are closed. A high close rate relative to creation rate signals connection churn, which degrades performance.
alert: GatewayUpstreamConnectionChurn
expr: gateway_upstream_connection_close_rate > 5
for: 5m
labels:
  severity: warning
annotations:
  summary: "High upstream connection churn ({{ $value }} closes/s)"

Authentication failures

Track gateway_failed_authentications to detect clients stuck in authentication loops or brute-force attempts.
alert: GatewayAuthFailuresHigh
expr: rate(gateway_failed_authentications_total[5m]) > 10
for: 2m
labels:
  severity: warning
annotations:
  summary: "High authentication failure rate ({{ $value }}/s) for user {{ $labels.user }}"

Throughput

Overall data flow

gateway_bytes_exchanged tracks the total bytes exchanged between clients and Gateway, tagged by direction (upstream from clients toward Kafka, downstream from Kafka toward clients). Use this to monitor overall traffic volume and detect anomalies.
alert: GatewayTrafficDrop
expr: rate(gateway_bytes_exchanged[10m]) < 0.5 * rate(gateway_bytes_exchanged[1h] offset 1h)
for: 5m
labels:
  severity: warning
annotations:
  summary: "Gateway traffic dropped by more than 50% compared to 1 hour ago"

Per-Virtual Cluster throughput

gateway_bytes_exchanged_vcluster breaks down bytes exchanged per Virtual Cluster. This is useful for:
  • Identifying which tenants generate the most traffic
  • Capacity planning per Virtual Cluster
  • Detecting unexpected traffic spikes from specific tenants
alert: GatewayVClusterTrafficSpike
expr: rate(gateway_bytes_exchanged_vcluster[5m]) > 2 * rate(gateway_bytes_exchanged_vcluster[1h] offset 1h)
for: 5m
labels:
  severity: warning
annotations:
  summary: "Traffic spike on vcluster {{ $labels.vcluster }} — {{ $value | humanize }}B/s (2x above baseline)"
For per-topic granularity, use gateway_bytes_exchanged_topic_total to identify hot topics that may need partitioning or throttling.

Latency and performance

Round-trip latency

gateway_latency_request_response measures the round-trip time from Gateway to Kafka and back. This includes the time for Gateway to send a request to the Kafka broker, receive the response, and process it. Monitor percentiles (p50, p95, p99) rather than averages. A rising p99 often signals the need to scale before the average shows any degradation.
alert: GatewayLatencyHigh
expr: histogram_quantile(0.99, rate(gateway_latency_request_response_bucket[5m])) > 0.5
for: 5m
labels:
  severity: warning
annotations:
  summary: "Gateway p99 latency is {{ $value }}s"
For per-operation granularity, use gateway_apiKeys_latency_request_response to break down latency by API key (Produce, Fetch, Metadata, etc.) and isolate which operations are slow.

Throttling

gateway_apiKeys_throttle_ms reports the throttleTimeMs value from Kafka broker responses, broken down by API key. This is the throttle time imposed by Kafka itself (for example, due to quota violations), not by Gateway. Non-zero values mean Kafka is asking clients to back off.
alert: GatewayKafkaThrottling
expr: rate(gateway_apiKeys_throttle_ms_total[5m]) > 0
for: 5m
labels:
  severity: warning
annotations:
  summary: "Kafka is throttling {{ $labels.apiKey }} requests on vcluster {{ $labels.vcluster }}"

Inflight requests

gateway_current_inflight_apiKeys tracks the number of requests currently in-flight for each Virtual Cluster, user, and API key combination. It increments when Gateway forwards a request to Kafka and decrements when the response is sent back to the client. The request pipeline between clients and Kafka is decoupled through an internal buffer. If a request stays in the buffer too long, Gateway expires it and sends a timeout error to the client (tracked by gateway_request_expired below). The buffer capacity is controlled by the gateway_max_pending_requests configuration parameter. Watch this metric alongside latency — a rising inflight count with rising latency suggests Gateway is becoming a bottleneck and may need scaling.
alert: GatewayInflightRequestsHigh
expr: gateway_current_inflight_apiKeys > 500
for: 5m
labels:
  severity: warning
annotations:
  summary: "Sustained high inflight requests ({{ $value }}) for {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"

Processing backlog

gateway_thread_tasks tracks pending tasks on the Gateway thread where request/response rebuilding happens. A sustained high value indicates a processing bottleneck.
alert: GatewayProcessingBacklog
expr: gateway_thread_tasks > 100
for: 5m
labels:
  severity: warning
annotations:
  summary: "Processing backlog of {{ $value }} pending tasks on thread {{ $labels.thread }}"
To check load distribution across threads, use gateway_thread_request_received to verify requests are spread evenly.

Errors and timeouts

Error rate

gateway_error_per_apiKeys counts processing exceptions per API key for a given Virtual Cluster and user. Alert when the error rate exceeds a percentage of total traffic.
alert: GatewayHighErrorRate
expr: rate(gateway_error_per_apiKeys_total[5m]) / rate(gateway_current_inflight_apiKeys_total[5m]) > 0.05
for: 5m
labels:
  severity: critical
annotations:
  summary: "Error rate above 5% for API key {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"

Request timeouts

gateway_request_expired counts client requests that timed out waiting for a response from Kafka. Non-zero values usually indicate connectivity problems between Gateway and the Kafka cluster.
alert: GatewayRequestTimeouts
expr: rate(gateway_request_expired_total[5m]) > 0
for: 2m
labels:
  severity: critical
annotations:
  summary: "Client requests are timing out waiting for Kafka responses"

Consumer lag

These metrics track offsets as seen through Gateway’s Virtual Cluster abstraction — they reflect logical topics and Gateway-managed consumer groups, not the underlying Kafka offsets. Use them together to calculate consumer lag per topic and consumer group:
  • gateway_topic_log_end_offset — the latest offset in each partition of a logical topic, representing the most recent message written. Labeled by vcluster, topic, and partition.
  • gateway_topic_current_offset — the last committed offset for a consumer group on a logical topic, representing how far it has read. Labeled by vcluster, topic, partition, and group.
The difference between them is the consumer lag: how many messages a group has yet to process. Because the two metrics have different label sets, use explicit label matching in PromQL:
alert: GatewayConsumerLagHigh
expr: >
  gateway_topic_log_end_offset
  - on(vcluster, topic, partition) group_right(group)
  gateway_topic_current_offset > 10000
for: 5m
labels:
  severity: warning
annotations:
  summary: "Consumer lag of {{ $value }} on topic {{ $labels.topic }} for group {{ $labels.group }}"
Tune the threshold based on your expected throughput and processing speed.

Cache health

gateway_kcache_size reflects the number of key-value pairs in Gateway’s internal cache, broken down by the type label:
TypeWhat it tracks
topicTopic mappings
topicConcentrationRuleTopic concentration rules
offsetTrackingOffset tracking entries
aclRulesACL rules
encryptionCacheEncryption cache entries
testTokenizationTest tokenization entries
Monitor for unexpected growth in any of these types, which could indicate memory pressure or a configuration issue.
alert: GatewayCacheGrowth
expr: delta(gateway_kcache_size[1h]) > 1000
for: 10m
labels:
  severity: warning
annotations:
  summary: "Cache type {{ $labels.type }} grew by {{ $value }} entries in the last hour"

Quick reference

MetricCategoryWhat to watch forSuggested alert
up{job="conduktor-gateway"}AvailabilityInstance unreachable== 0 for 1m
gateway_license_remaining_daysAvailabilityLicense approaching expiry< 14 warning, < 3 critical
gateway_upstream_io_nodesAvailabilityKafka node lossBelow expected cluster size
gateway_active_connections_vclusterConnectionsSudden drops or spikesdelta < -50 in 5m, or > 1000
gateway_upstream_connections_upstream_connectedConnectionsConnection instabilitydelta < -5 in 5m
gateway_upstream_connection_close_rateConnectionsConnection churn> 5 closes/s
gateway_failed_authenticationsConnectionsAuth loops or brute forcerate > 10/s
gateway_bytes_exchangedThroughputTraffic anomalies> 50% drop vs. baseline
gateway_bytes_exchanged_vclusterThroughputPer-tenant traffic spikes> 2x above baseline
gateway_latency_request_responseLatencyRising response timesp99 > 500ms
gateway_apiKeys_throttle_msLatencyKafka-imposed throttlingrate > 0
gateway_current_inflight_apiKeysLatencyRequest backlog> 500 sustained
gateway_thread_tasksLatencyProcessing bottleneck> 100 sustained
gateway_error_per_apiKeysErrorsProcessing failures> 5% of total traffic
gateway_request_expiredErrorsKafka connectivityrate > 0
gateway_topic_log_end_offset - gateway_topic_current_offsetConsumer lagGrowing lag> 10000
gateway_kcache_sizeCacheUnexpected growthdelta > 1000 in 1h