Documentation Index
Fetch the complete documentation index at: https://docs.conduktor.io/llms.txt
Use this file to discover all available pages before exploring further.
This page provides best practices and recommendations for monitoring and alerting on Conduktor Gateway deployments. Use these as starting points and tune thresholds to match your environment’s baseline.
Before you start, make sure you have set up monitoring and can access the Gateway Prometheus endpoint.
Example Grafana dashboards are available on Github, in the Conduktor Helm package at charts/gateway/grafana-dashboards.
Metrics endpoint
Gateway exposes a Prometheus-compatible /metrics endpoint on its HTTP API port. By default, this endpoint is unauthenticated, meaning anyone with network access to the port can scrape metrics.
For production deployments, we recommend configuring credentials in GATEWAY_ADMIN_API_USERS and enabling authentication on the metrics endpoint by setting the GATEWAY_SECURED_METRICS environment variable to true. When enabled, requests to the /metrics endpoint require the same credentials as other Gateway HTTP API calls.
Request flow overview
The diagram below shows how traffic flows through Gateway, and where key metrics are captured. Use it as a reference when choosing what to monitor.
The direction label on byte metrics refers to the direction of data flow: upstream means bytes flowing from clients toward Kafka, and downstream means bytes flowing from Kafka toward clients.
Where metrics are captured:
| Boundary | Key metrics |
|---|
| Client to Gateway | gateway_active_connections_vcluster, gateway_bytes_exchanged{direction="upstream"} |
| Gateway to Kafka | gateway_upstream_connections_upstream_connected, gateway_bytes_exchanged_vcluster{direction="upstream"} |
| Kafka to Gateway | gateway_bytes_exchanged_vcluster{direction="downstream"} |
| Gateway to Client | gateway_bytes_exchanged{direction="downstream"} |
| Round-trip | gateway_latency_request_response |
| Inside Gateway | gateway_current_inflight_apiKeys, gateway_thread_tasks |
Availability and license
Gateway down
Alert when the Gateway instance is unreachable.
alert: GatewayDown
expr: up{job="conduktor-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Gateway instance {{ $labels.instance }} is down"
License expiring
Track gateway_license_remaining_days and alert at two thresholds to give time for renewal.
alert: GatewayLicenseExpiringSoon
expr: gateway_license_remaining_days < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Gateway license expires in {{ $value }} days"
alert: GatewayLicenseCritical
expr: gateway_license_remaining_days < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Gateway license expires in {{ $value }} days - renew immediately"
Kafka node loss
Track gateway_upstream_io_nodes to detect when Gateway loses visibility of Kafka brokers.
alert: GatewayKafkaNodeLoss
expr: gateway_upstream_io_nodes < 3
for: 2m
labels:
severity: critical
annotations:
summary: "Gateway sees only {{ $value }} Kafka nodes (expected 3+)"
Adjust the threshold to match your Kafka cluster size.
Connections
Kafka connections are long-lived. In a stable environment, the active connection count should be relatively constant. Establish a baseline for your deployment and alert on unusual deviations.
Client connections
Monitor gateway_active_connections_vcluster for sudden drops (client disconnects) or spikes (connection storms).
alert: GatewayConnectionsDrop
expr: delta(gateway_active_connections_vcluster[5m]) < -50
for: 2m
labels:
severity: warning
annotations:
summary: "Sudden drop in client connections on vcluster {{ $labels.vcluster }}"
alert: GatewayConnectionsHigh
expr: gateway_active_connections_vcluster > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High client connection count ({{ $value }}) on vcluster {{ $labels.vcluster }}"
Upstream connections
gateway_upstream_connections_upstream_connected tracks the number of connections from Gateway to the backing Kafka cluster. This should be stable in an established environment. A sudden change could indicate Kafka broker issues or Gateway restarts.
alert: GatewayUpstreamConnectionsDrop
expr: delta(gateway_upstream_connections_upstream_connected[5m]) < -5
for: 2m
labels:
severity: warning
annotations:
summary: "Upstream connections dropped by {{ $value }} in 5 minutes"
gateway_upstream_connection_close_rate tracks how frequently upstream connections are closed. A high close rate relative to creation rate signals connection churn, which degrades performance.
alert: GatewayUpstreamConnectionChurn
expr: gateway_upstream_connection_close_rate > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High upstream connection churn ({{ $value }} closes/s)"
Authentication failures
Track gateway_failed_authentications to detect clients stuck in authentication loops or brute-force attempts.
alert: GatewayAuthFailuresHigh
expr: rate(gateway_failed_authentications_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate ({{ $value }}/s) for user {{ $labels.user }}"
Throughput
Overall data flow
gateway_bytes_exchanged tracks the total bytes exchanged between clients and Gateway, tagged by direction (upstream from clients toward Kafka, downstream from Kafka toward clients). Use this to monitor overall traffic volume and detect anomalies.
alert: GatewayTrafficDrop
expr: rate(gateway_bytes_exchanged[10m]) < 0.5 * rate(gateway_bytes_exchanged[1h] offset 1h)
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway traffic dropped by more than 50% compared to 1 hour ago"
Per-Virtual Cluster throughput
gateway_bytes_exchanged_vcluster breaks down bytes exchanged per Virtual Cluster. This is useful for:
- Identifying which tenants generate the most traffic
- Capacity planning per Virtual Cluster
- Detecting unexpected traffic spikes from specific tenants
alert: GatewayVClusterTrafficSpike
expr: rate(gateway_bytes_exchanged_vcluster[5m]) > 2 * rate(gateway_bytes_exchanged_vcluster[1h] offset 1h)
for: 5m
labels:
severity: warning
annotations:
summary: "Traffic spike on vcluster {{ $labels.vcluster }} — {{ $value | humanize }}B/s (2x above baseline)"
For per-topic granularity, use gateway_bytes_exchanged_topic_total to identify hot topics that may need partitioning or throttling.
Round-trip latency
gateway_latency_request_response measures the round-trip time from Gateway to Kafka and back. This includes the time for Gateway to send a request to the Kafka broker, receive the response, and process it.
Monitor percentiles (p50, p95, p99) rather than averages. A rising p99 often signals the need to scale before the average shows any degradation.
alert: GatewayLatencyHigh
expr: histogram_quantile(0.99, rate(gateway_latency_request_response_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency is {{ $value }}s"
For per-operation granularity, use gateway_apiKeys_latency_request_response to break down latency by API key (Produce, Fetch, Metadata, etc.) and isolate which operations are slow.
Throttling
gateway_apiKeys_throttle_ms reports the throttleTimeMs value from Kafka broker responses, broken down by API key. This is the throttle time imposed by Kafka itself (for example, due to quota violations), not by Gateway. Non-zero values mean Kafka is asking clients to back off.
alert: GatewayKafkaThrottling
expr: rate(gateway_apiKeys_throttle_ms_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka is throttling {{ $labels.apiKey }} requests on vcluster {{ $labels.vcluster }}"
Inflight requests
gateway_current_inflight_apiKeys tracks the number of requests currently in-flight for each Virtual Cluster, user, and API key combination. It increments when Gateway forwards a request to Kafka and decrements when the response is sent back to the client.
The request pipeline between clients and Kafka is decoupled through an internal buffer. If a request stays in the buffer too long, Gateway expires it and sends a timeout error to the client (tracked by gateway_request_expired below). The buffer capacity is controlled by the GATEWAY_NETWORK_MAX_PENDING_REQUESTS configuration parameter.
Watch this metric alongside latency — a rising inflight count with rising latency suggests Gateway is becoming a bottleneck and may need scaling.
alert: GatewayInflightRequestsHigh
expr: gateway_current_inflight_apiKeys > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Sustained high inflight requests ({{ $value }}) for {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"
Processing backlog
gateway_thread_tasks tracks pending tasks on the Gateway thread where request/response rebuilding happens. A sustained high value indicates a processing bottleneck.
alert: GatewayProcessingBacklog
expr: gateway_thread_tasks > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Processing backlog of {{ $value }} pending tasks on thread {{ $labels.thread }}"
To check load distribution across threads, use gateway_thread_request_received to verify requests are spread evenly.
Errors and timeouts
Error rate
gateway_error_per_apiKeys counts processing exceptions per API key for a given Virtual Cluster and user. Alert when the error rate exceeds a percentage of total traffic.
alert: GatewayHighErrorRate
expr: rate(gateway_error_per_apiKeys_total[5m]) / rate(gateway_current_inflight_apiKeys_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for API key {{ $labels.apiKey }} on vcluster {{ $labels.vcluster }}"
Request timeouts
gateway_request_expired counts client requests that timed out waiting for a response from Kafka. Non-zero values usually indicate connectivity problems between Gateway and the Kafka cluster.
alert: GatewayRequestTimeouts
expr: rate(gateway_request_expired_total[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Client requests are timing out waiting for Kafka responses"
Consumer lag
These metrics are populated for topics using topic concentration, where multiple logical topics share a single physical Kafka topic. They reflect Gateway-managed consumer groups, not the underlying Kafka offsets. Standard Kafka consumer lag tools cannot disaggregate lag per logical topic in this scenario, so Gateway provides these metrics as the only way to monitor consumer lag per concentrated topic.
gateway_topic_log_end_offset — the latest offset in each partition of a logical topic, representing the most recent message written. Labeled by vcluster, topic, and partition.
gateway_topic_current_offset — the last committed offset for a consumer group on a logical topic, representing how far it has read. Labeled by vcluster, topic, partition, and group.
The difference between them is the consumer lag: how many messages a group has yet to process. Because the two metrics have different label sets, use explicit label matching in PromQL:
alert: GatewayConsumerLagHigh
expr: >
gateway_topic_log_end_offset
- on(vcluster, topic, partition) group_right(group)
gateway_topic_current_offset > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Consumer lag of {{ $value }} on topic {{ $labels.topic }} for group {{ $labels.group }}"
Tune the threshold based on your expected throughput and processing speed.
Cache health
gateway_kcache_size reflects the number of key-value pairs in Gateway’s internal cache, broken down by the type label:
| Type | What it tracks |
|---|
topic | Topic mappings |
topicConcentrationRule | Topic concentration rules |
offsetTracking | Offset tracking entries |
aclRules | ACL rules |
encryptionCache | Encryption cache entries |
testTokenization | Test tokenization entries |
Monitor for unexpected growth in any of these types, which could indicate memory pressure or a configuration issue.
alert: GatewayCacheGrowth
expr: delta(gateway_kcache_size[1h]) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Cache type {{ $labels.type }} grew by {{ $value }} entries in the last hour"
Quick reference
| Metric | Category | What to watch for | Suggested alert |
|---|
up{job="conduktor-gateway"} | Availability | Instance unreachable | == 0 for 1m |
gateway_license_remaining_days | Availability | License approaching expiry | < 14 warning, < 3 critical |
gateway_upstream_io_nodes | Availability | Kafka node loss | Below expected cluster size |
gateway_active_connections_vcluster | Connections | Sudden drops or spikes | delta < -50 in 5m, or > 1000 |
gateway_upstream_connections_upstream_connected | Connections | Connection instability | delta < -5 in 5m |
gateway_upstream_connection_close_rate | Connections | Connection churn | > 5 closes/s |
gateway_failed_authentications | Connections | Auth loops or brute force | rate > 10/s |
gateway_bytes_exchanged | Throughput | Traffic anomalies | > 50% drop vs. baseline |
gateway_bytes_exchanged_vcluster | Throughput | Per-tenant traffic spikes | > 2x above baseline |
gateway_latency_request_response | Latency | Rising response times | p99 > 500ms |
gateway_apiKeys_throttle_ms | Latency | Kafka-imposed throttling | rate > 0 |
gateway_current_inflight_apiKeys | Latency | Request backlog | > 500 sustained |
gateway_thread_tasks | Latency | Processing bottleneck | > 100 sustained |
gateway_error_per_apiKeys | Errors | Processing failures | > 5% of total traffic |
gateway_request_expired | Errors | Kafka connectivity | rate > 0 |
gateway_topic_log_end_offset - gateway_topic_current_offset | Consumer lag | Growing lag | > 10000 |
gateway_kcache_size | Cache | Unexpected growth | delta > 1000 in 1h |