Request flow overview
The diagram below shows how traffic flows through Gateway, and where key metrics are captured. Use it as a reference when choosing what to monitor. Thedirection label on byte metrics refers to the direction of data flow: upstream means bytes flowing from clients toward Kafka, and downstream means bytes flowing from Kafka toward clients.
Where metrics are captured:
| Boundary | Key metrics |
|---|---|
| Client to Gateway | gateway_active_connections_vcluster, gateway_bytes_exchanged{direction="upstream"} |
| Gateway to Kafka | gateway_upstream_connections_upstream_connected, gateway_bytes_exchanged_vcluster{direction="upstream"} |
| Kafka to Gateway | gateway_bytes_exchanged_vcluster{direction="downstream"} |
| Gateway to Client | gateway_bytes_exchanged{direction="downstream"} |
| Round-trip | gateway_latency_request_response |
| Inside Gateway | gateway_current_inflight_apiKeys, gateway_thread_tasks |
Availability and license
Gateway down
Alert when the Gateway instance is unreachable.License expiring
Trackgateway_license_remaining_days and alert at two thresholds to give time for renewal.
Kafka node loss
Trackgateway_upstream_io_nodes to detect when Gateway loses visibility of Kafka brokers.
Connections
Kafka connections are long-lived. In a stable environment, the active connection count should be relatively constant. Establish a baseline for your deployment and alert on unusual deviations.Client connections
Monitorgateway_active_connections_vcluster for sudden drops (client disconnects) or spikes (connection storms).
Upstream connections
gateway_upstream_connections_upstream_connected tracks the number of connections from Gateway to the backing Kafka cluster. This should be stable in an established environment. A sudden change could indicate Kafka broker issues or Gateway restarts.
gateway_upstream_connection_close_rate tracks how frequently upstream connections are closed. A high close rate relative to creation rate signals connection churn, which degrades performance.
Authentication failures
Trackgateway_failed_authentications to detect clients stuck in authentication loops or brute-force attempts.
Throughput
Overall data flow
gateway_bytes_exchanged tracks the total bytes exchanged between clients and Gateway, tagged by direction (upstream from clients toward Kafka, downstream from Kafka toward clients). Use this to monitor overall traffic volume and detect anomalies.
Per-Virtual Cluster throughput
gateway_bytes_exchanged_vcluster breaks down bytes exchanged per Virtual Cluster. This is useful for:
- Identifying which tenants generate the most traffic
- Capacity planning per Virtual Cluster
- Detecting unexpected traffic spikes from specific tenants
gateway_bytes_exchanged_topic_total to identify hot topics that may need partitioning or throttling.
Latency and performance
Round-trip latency
gateway_latency_request_response measures the round-trip time from Gateway to Kafka and back. This includes the time for Gateway to send a request to the Kafka broker, receive the response, and process it.
Monitor percentiles (p50, p95, p99) rather than averages. A rising p99 often signals the need to scale before the average shows any degradation.
gateway_apiKeys_latency_request_response to break down latency by API key (Produce, Fetch, Metadata, etc.) and isolate which operations are slow.
Throttling
gateway_apiKeys_throttle_ms reports the throttleTimeMs value from Kafka broker responses, broken down by API key. This is the throttle time imposed by Kafka itself (for example, due to quota violations), not by Gateway. Non-zero values mean Kafka is asking clients to back off.
Inflight requests
gateway_current_inflight_apiKeys tracks the number of requests currently in-flight for each Virtual Cluster, user, and API key combination. It increments when Gateway forwards a request to Kafka and decrements when the response is sent back to the client.
The request pipeline between clients and Kafka is decoupled through an internal buffer. If a request stays in the buffer too long, Gateway expires it and sends a timeout error to the client (tracked by gateway_request_expired below). The buffer capacity is controlled by the gateway_max_pending_requests configuration parameter.
Watch this metric alongside latency — a rising inflight count with rising latency suggests Gateway is becoming a bottleneck and may need scaling.
Processing backlog
gateway_thread_tasks tracks pending tasks on the Gateway thread where request/response rebuilding happens. A sustained high value indicates a processing bottleneck.
gateway_thread_request_received to verify requests are spread evenly.
Errors and timeouts
Error rate
gateway_error_per_apiKeys counts processing exceptions per API key for a given Virtual Cluster and user. Alert when the error rate exceeds a percentage of total traffic.
Request timeouts
gateway_request_expired counts client requests that timed out waiting for a response from Kafka. Non-zero values usually indicate connectivity problems between Gateway and the Kafka cluster.
Consumer lag
These metrics track offsets as seen through Gateway’s Virtual Cluster abstraction — they reflect logical topics and Gateway-managed consumer groups, not the underlying Kafka offsets. Use them together to calculate consumer lag per topic and consumer group:gateway_topic_log_end_offset— the latest offset in each partition of a logical topic, representing the most recent message written. Labeled byvcluster,topic, andpartition.gateway_topic_current_offset— the last committed offset for a consumer group on a logical topic, representing how far it has read. Labeled byvcluster,topic,partition, andgroup.
Cache health
gateway_kcache_size reflects the number of key-value pairs in Gateway’s internal cache, broken down by the type label:
| Type | What it tracks |
|---|---|
topic | Topic mappings |
topicConcentrationRule | Topic concentration rules |
offsetTracking | Offset tracking entries |
aclRules | ACL rules |
encryptionCache | Encryption cache entries |
testTokenization | Test tokenization entries |
Quick reference
| Metric | Category | What to watch for | Suggested alert |
|---|---|---|---|
up{job="conduktor-gateway"} | Availability | Instance unreachable | == 0 for 1m |
gateway_license_remaining_days | Availability | License approaching expiry | < 14 warning, < 3 critical |
gateway_upstream_io_nodes | Availability | Kafka node loss | Below expected cluster size |
gateway_active_connections_vcluster | Connections | Sudden drops or spikes | delta < -50 in 5m, or > 1000 |
gateway_upstream_connections_upstream_connected | Connections | Connection instability | delta < -5 in 5m |
gateway_upstream_connection_close_rate | Connections | Connection churn | > 5 closes/s |
gateway_failed_authentications | Connections | Auth loops or brute force | rate > 10/s |
gateway_bytes_exchanged | Throughput | Traffic anomalies | > 50% drop vs. baseline |
gateway_bytes_exchanged_vcluster | Throughput | Per-tenant traffic spikes | > 2x above baseline |
gateway_latency_request_response | Latency | Rising response times | p99 > 500ms |
gateway_apiKeys_throttle_ms | Latency | Kafka-imposed throttling | rate > 0 |
gateway_current_inflight_apiKeys | Latency | Request backlog | > 500 sustained |
gateway_thread_tasks | Latency | Processing bottleneck | > 100 sustained |
gateway_error_per_apiKeys | Errors | Processing failures | > 5% of total traffic |
gateway_request_expired | Errors | Kafka connectivity | rate > 0 |
gateway_topic_log_end_offset - gateway_topic_current_offset | Consumer lag | Growing lag | > 10000 |
gateway_kcache_size | Cache | Unexpected growth | delta > 1000 in 1h |