Gateway monitoring & alerting recommendations

This page provides best practices and recommendations for monitoring and alerting on Conduktor Gateway deployments. Use these as starting points and tune thresholds to match your environment’s baseline. Before you start, make sure you have set up monitoring and can access the Gateway Prometheus endpoint. Example Grafana dashboards are available on Github, in the Conduktor Helm package at charts/gateway/grafana-dashboards.

Metrics endpoint

Gateway exposes a Prometheus-compatible /metrics endpoint on its HTTP API port. By default, this endpoint is unauthenticated, meaning anyone with network access to the port can scrape metrics. For production deployments, we recommend configuring credentials in GATEWAY_ADMIN_API_USERS and enabling authentication on the metrics endpoint by setting the GATEWAY_SECURED_METRICS environment variable to true. When enabled, requests to the /metrics endpoint require the same credentials as other Gateway HTTP API calls.

Request flow overview

The diagram below shows how traffic flows through Gateway, and where key metrics are captured. Use it as a reference when choosing what to monitor. The direction label on byte metrics refers to the direction of data flow: upstream means bytes flowing from clients toward Kafka, and downstream means bytes flowing from Kafka toward clients. Where metrics are captured:

Boundary	Key metrics
Client to Gateway	`gateway_active_connections_vcluster`, `gateway_bytes_exchanged{direction="upstream"}`
Gateway to Kafka	`gateway_upstream_connections_upstream_connected`, `gateway_bytes_exchanged_vcluster{direction="upstream"}`
Kafka to Gateway	`gateway_bytes_exchanged_vcluster{direction="downstream"}`
Gateway to Client	`gateway_bytes_exchanged{direction="downstream"}`
Round-trip	`gateway_latency_request_response`
Inside Gateway	`gateway_current_inflight_apiKeys`, `gateway_thread_tasks`

Availability and license

Gateway down

Alert when the Gateway instance is unreachable.

alert: GatewayDown
expr: up{job="conduktor-gateway"} == 0
for: 1m
labels:
  severity: critical
annotations:
  summary: "Gateway instance {{ $labels.instance }} is down"

License expiring

Track gateway_license_remaining_days and alert at two thresholds to give time for renewal.

alert: GatewayLicenseExpiringSoon
expr: gateway_license_remaining_days < 14
for: 1h
labels:
  severity: warning
annotations:
  summary: "Gateway license expires in {{ $value }} days"

alert: GatewayLicenseCritical
expr: gateway_license_remaining_days < 3
for: 5m
labels:
  severity: critical
annotations:
  summary: "Gateway license expires in {{ $value }} days - renew immediately"

Kafka node loss

Track gateway_upstream_io_nodes to detect when Gateway loses visibility of Kafka brokers.

alert: GatewayKafkaNodeLoss
expr: gateway_upstream_io_nodes < 3
for: 2m
labels:
  severity: critical
annotations:
  summary: "Gateway sees only {{ $value }} Kafka nodes (expected 3+)"

Adjust the threshold to match your Kafka cluster size.

Connections

Kafka connections are long-lived. In a stable environment, the active connection count should be relatively constant. Establish a baseline for your deployment and alert on unusual deviations.

Client connections

Monitor gateway_active_connections_vcluster for sudden drops (client disconnects) or spikes (connection storms).

alert: GatewayConnectionsDrop
expr: delta(gateway_active_connections_vcluster[5m]) < -50
for: 2m
labels:
  severity: warning
annotations:
  summary: "Sudden drop in client connections on vcluster {{ $labels.vcluster }}"

alert: GatewayConnectionsHigh
expr: gateway_active_connections_vcluster > 1000
for: 5m
labels:
  severity: warning
annotations:
  summary: "High client connection count ({{ $value }}) on vcluster {{ $labels.vcluster }}"

Upstream connections

gateway_upstream_connections_upstream_connected tracks the number of connections from Gateway to the backing Kafka cluster. This should be stable in an established environment. A sudden change could indicate Kafka broker issues or Gateway restarts.

alert: GatewayUpstreamConnectionsDrop
expr: delta(gateway_upstream_connections_upstream_connected[5m]) < -5
for: 2m
labels:
  severity: warning
annotations:
  summary: "Upstream connections dropped by {{ $value }} in 5 minutes"

gateway_upstream_connection_close_rate tracks how frequently upstream connections are closed. A high close rate relative to creation rate signals connection churn, which degrades performance.

alert: GatewayUpstreamConnectionChurn
expr: gateway_upstream_connection_close_rate > 5
for: 5m
labels:
  severity: warning
annotations:
  summary: "High upstream connection churn ({{ $value }} closes/s)"

Authentication failures

Track gateway_failed_authentications to detect clients stuck in authentication loops or brute-force attempts.

alert: GatewayAuthFailuresHigh
expr: rate(gateway_failed_authentications_total[5m]) > 10
for: 2m
labels:
  severity: warning
annotations:
  summary: "High authentication failure rate ({{ $value }}/s) for user {{ $labels.user }}"

Throughput

Overall data flow

gateway_bytes_exchanged tracks the total bytes exchanged between clients and Gateway, tagged by direction (upstream from clients toward Kafka, downstream from Kafka toward clients). Use this to monitor overall traffic volume and detect anomalies.

alert: GatewayTrafficDrop
expr: rate(gateway_bytes_exchanged[10m]) < 0.5 * rate(gateway_bytes_exchanged[1h] offset 1h)
for: 5m
labels:
  severity: warning
annotations:
  summary: "Gateway traffic dropped by more than 50% compared to 1 hour ago"

Per-Virtual Cluster throughput

gateway_bytes_exchanged_vcluster breaks down bytes exchanged per Virtual Cluster. This is useful for:

Identifying which tenants generate the most traffic
Capacity planning per Virtual Cluster
Detecting unexpected traffic spikes from specific tenants

alert: GatewayVClusterTrafficSpike
expr: rate(gateway_bytes_exchanged_vcluster[5m]) > 2 * rate(gateway_bytes_exchanged_vcluster[1h] offset 1h)
for: 5m
labels:
  severity: warning
annotations:
  summary: "Traffic spike on vcluster {{ $labels.vcluster }} — {{ $value | humanize }}B/s (2x above baseline)"

For per-topic granularity, use gateway_bytes_exchanged_topic_total to identify hot topics that may need partitioning or throttling.

Latency and performance

Round-trip latency

gateway_latency_request_response measures the round-trip time from Gateway to Kafka and back. This includes the time for Gateway to send a request to the Kafka broker, receive the response, and process it. Monitor percentiles (p50, p95, p99) rather than averages. A rising p99 often signals the need to scale before the average shows any degradation.

alert: GatewayLatencyHigh
expr: histogram_quantile(0.99, rate(gateway_latency_request_response_bucket[5m])) > 0.5
for: 5m
labels:
  severity: warning
annotations:
  summary: "Gateway p99 latency is {{ $value }}s"

For per-operation granularity, use gateway_apiKeys_latency_request_response to break down latency by API key (Produce, Fetch, Metadata, etc.) and isolate which operations are slow.

Throttling

gateway_apiKeys_throttle_ms reports the throttleTimeMs value from Kafka broker responses, broken down by API key. This is the throttle time imposed by Kafka itself (for example, due to quota violations), not by Gateway. Non-zero values mean Kafka is asking clients to back off.

alert: GatewayKafkaThrottling
expr: rate(gateway_apiKeys_throttle_ms_total[5m]) > 0
for: 5m
labels:
  severity: warning
annotations:
  summary: "Kafka is throttling {{ $labels.apiKey }} requests on vcluster {{ $labels.vcluster }}"

Inflight requests

gateway_current_inflight_apiKeys tracks the number of requests currently in-flight for each Virtual Cluster, user, and API key combination. It increments when Gateway forwards a request to Kafka and decrements when the response is sent back to the client. The request pipeline between clients and Kafka is decoupled through an internal buffer. If a request stays in the buffer too long, Gateway expires it and sends a timeout error to the client (tracked by gateway_request_expired below). The buffer capacity is controlled by the GATEWAY_NETWORK_MAX_PENDING_REQUESTS configuration parameter. Watch this metric alongside latency — a rising inflight count with rising latency suggests Gateway is becoming a bottleneck and may need scaling.

alert: GatewayInflightRequestsHigh
expr: gateway_current_inflight_apiKeys > 500
for: 5m
labels:
  severity: warning
annotations:
  summary: "Sustained high inflight requests ({{ $value }}) for {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"

Processing backlog

gateway_thread_tasks tracks pending tasks on the Gateway thread where request/response rebuilding happens. A sustained high value indicates a processing bottleneck.

alert: GatewayProcessingBacklog
expr: gateway_thread_tasks > 100
for: 5m
labels:
  severity: warning
annotations:
  summary: "Processing backlog of {{ $value }} pending tasks on thread {{ $labels.thread }}"

To check load distribution across threads, use gateway_thread_request_received to verify requests are spread evenly.

Errors and timeouts

Error rate

gateway_error_per_apiKeys counts processing exceptions per API key for a given Virtual Cluster and user. Alert when the error rate exceeds a percentage of total traffic.

alert: GatewayHighErrorRate
expr: rate(gateway_error_per_apiKeys_total[5m]) / rate(gateway_current_inflight_apiKeys_total[5m]) > 0.05
for: 5m
labels:
  severity: critical
annotations:
  summary: "Error rate above 5% for API key {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"

Request timeouts

gateway_request_expired counts client requests that timed out waiting for a response from Kafka. Non-zero values usually indicate connectivity problems between Gateway and the Kafka cluster.

alert: GatewayRequestTimeouts
expr: rate(gateway_request_expired_total[5m]) > 0
for: 2m
labels:
  severity: critical
annotations:
  summary: "Client requests are timing out waiting for Kafka responses"

Consumer lag

These metrics are populated for topics using topic concentration, where multiple logical topics share a single physical Kafka topic. They reflect Gateway-managed consumer groups, not the underlying Kafka offsets. Standard Kafka consumer lag tools cannot disaggregate lag per logical topic in this scenario, so Gateway provides these metrics as the only way to monitor consumer lag per concentrated topic.

gateway_topic_log_end_offset — the latest offset in each partition of a logical topic, representing the most recent message written. Labeled by vcluster, topic, and partition.
gateway_topic_current_offset — the last committed offset for a consumer group on a logical topic, representing how far it has read. Labeled by vcluster, topic, partition, and group.

The difference between them is the consumer lag: how many messages a group has yet to process. Because the two metrics have different label sets, use explicit label matching in PromQL:

alert: GatewayConsumerLagHigh
expr: >
  gateway_topic_log_end_offset
  - on(vcluster, topic, partition) group_right(group)
  gateway_topic_current_offset > 10000
for: 5m
labels:
  severity: warning
annotations:
  summary: "Consumer lag of {{ $value }} on topic {{ $labels.topic }} for group {{ $labels.group }}"

Tune the threshold based on your expected throughput and processing speed.

Cache health

gateway_kcache_size reflects the number of key-value pairs in Gateway’s internal cache, broken down by the type label:

Type	What it tracks
`topic`	Topic mappings
`topicConcentrationRule`	Topic concentration rules
`offsetTracking`	Offset tracking entries
`aclRules`	ACL rules
`encryptionCache`	Encryption cache entries
`testTokenization`	Test tokenization entries

Monitor for unexpected growth in any of these types, which could indicate memory pressure or a configuration issue.

alert: GatewayCacheGrowth
expr: delta(gateway_kcache_size[1h]) > 1000
for: 10m
labels:
  severity: warning
annotations:
  summary: "Cache type {{ $labels.type }} grew by {{ $value }} entries in the last hour"

Quick reference

Metric	Category	What to watch for	Suggested alert
`up{job="conduktor-gateway"}`	Availability	Instance unreachable	`== 0` for 1m
`gateway_license_remaining_days`	Availability	License approaching expiry	`< 14` warning, `< 3` critical
`gateway_upstream_io_nodes`	Availability	Kafka node loss	Below expected cluster size
`gateway_active_connections_vcluster`	Connections	Sudden drops or spikes	`delta < -50` in 5m, or `> 1000`
`gateway_upstream_connections_upstream_connected`	Connections	Connection instability	`delta < -5` in 5m
`gateway_upstream_connection_close_rate`	Connections	Connection churn	`> 5` closes/s
`gateway_failed_authentications`	Connections	Auth loops or brute force	`rate > 10/s`
`gateway_bytes_exchanged`	Throughput	Traffic anomalies	`> 50%` drop vs. baseline
`gateway_bytes_exchanged_vcluster`	Throughput	Per-tenant traffic spikes	`> 2x` above baseline
`gateway_latency_request_response`	Latency	Rising response times	p99 > 500ms
`gateway_apiKeys_throttle_ms`	Latency	Kafka-imposed throttling	`rate > 0`
`gateway_current_inflight_apiKeys`	Latency	Request backlog	`> 500` sustained
`gateway_thread_tasks`	Latency	Processing bottleneck	`> 100` sustained
`gateway_error_per_apiKeys`	Errors	Processing failures	`> 5%` of total traffic
`gateway_request_expired`	Errors	Kafka connectivity	`rate > 0`
`gateway_topic_log_end_offset` - `gateway_topic_current_offset`	Consumer lag	Growing lag	`> 10000`
`gateway_kcache_size`	Cache	Unexpected growth	`delta > 1000` in 1h

​Metrics endpoint

​Request flow overview

​Availability and license

​Gateway down

​License expiring

​Kafka node loss

​Connections

​Client connections

​Upstream connections

​Authentication failures

​Throughput

​Overall data flow

​Per-Virtual Cluster throughput

​Latency and performance

​Round-trip latency

​Throttling

​Inflight requests

​Processing backlog

​Errors and timeouts

​Error rate

​Request timeouts

​Consumer lag

​Cache health

​Quick reference

​Related resources