Skip to main content
This feature is available with Conduktor Shield only.

Overview

Conduktor Chaos testing helps you validate your Kafka applications’ resilience by simulating real-world failure scenarios. Instead of discovering issues in production, you can proactively test how your applications handle problems like network failures, slow brokers, corrupted or duplicate messages. Testing these scenarios before they happen helps you:
  • Validate configurations - Are your timeouts long enough for rolling upgrades?
  • Test error handling - Does your application recover from corrupted messages?
  • Verify idempotency - Can your system handle duplicate payments correctly?
  • Understand behavior - How does your app react when brokers are slow?
Conduktor provides eight chaos testing Interceptors that cover the most common failure scenarios:

Essential Interceptors

Start with these - they cover the most common production scenarios and help validate critical resilience patterns.
InterceptorWhat it testsCommon use case
Duplicate messagesApplication idempotencyPayment processing, order fulfillment
Message corruptionDeserialization error handlingConsumer resilience, dead letter queues
Leader election errorsRolling upgrade survivalProduction deployments, Kafka upgrades
Broken brokersRetry and timeout configurationBroker failures, network issues
Invalid schema IDSchema registry error handlingSchema evolution, registry outages

Additional Interceptors

InterceptorWhat it testsCommon use case
Slow brokersTimeout configurationPerformance degradation
Slow producers/consumersApplication-level timeoutsTopic-specific slowness
Latency on all interactionsGeneral network delay toleranceNetwork issues, geographic distance

Duplicate messages

Essential for payment processing, order fulfillment and any business-critical flow.

What it simulates

Applications receive the same message multiple times with identical content. This happens in production due to:
  • Network retries from producers
  • Consumer re-balances before offset commits
  • Exactly-once processing failures
Gateway duplicates messages in two modes:
  • CONSUME mode (recommended): Client sees duplicates, but Kafka doesn’t store them
  • PRODUCE mode: Duplicates are written to Kafka

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.DuplicateMessagesPlugin
KeyTypeDefaultDescription
topicstring.*Topics that match this regex will have the Interceptor applied.
rateInPercentinteger100Percentage of records that will be duplicated. Adjust this value to simulate occasional duplicates.
targetenumCONSUMERecord is duplicated when client produces or consumes, values: PRODUCE or CONSUME
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "duplicateMessagesTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.DuplicateMessagesPlugin",
  "priority": 100,
  "config": {
    "topic": "payment-events",
    "rateInPercent": 20,
    "target": "CONSUME"
  }
}'

How to make your client survive this

The key is idempotency - processing the same message multiple times should produce the same result.

Choose your deduplication strategy

Idempotent producer (enable.idempotence=true):
  • Prevents duplicates from network retries at the Kafka level
  • Tolerate duplicates from application restarts
  • Combine with business-level deduplication (database unique constraints, in-memory caching)
  • Low-latency writes
  • Manage offsets manually (enable.auto.commit=false) to commit only after successful processing
Transactional producer (transactional.id=...):
  • Exactly-once semantics across application restarts
  • Stream processing (read-process-write operations)
  • Atomic multi-partition writes
  • Higher latency than idempotence alone
Business-level deduplication only:
  • When neither idempotence nor transactions fit your use case
  • Consuming from multiple sources (not just Kafka)
  • Deduplication across different systems
  • Design idempotent operations where repeating the same action produces the same result

Validation checklist

  • Send 100 unique messages
  • Consumer receives ~120 messages (20% duplicated)
  • Database shows exactly 100 records
  • Application metric duplicates_detected shows ~20
  • No duplicate transactions processed

Message corruption

Essential for consumer resilience and dead letter queue validation.

What it simulates

Messages arrive with corrupted data that fails deserialization. This happens in production due to:
  • Disk corruption on brokers
  • Network transmission errors
  • Producer bugs writing invalid data
  • Schema registry mismatches
Gateway appends random bytes to message values, causing SerializationException on consumer side.

Configure the Interceptor

You can simulate corruption on:
  • Consume side (recommended as it doesn’t affect the data stored in Kafka): io.conduktor.gateway.interceptor.chaos.FetchSimulateMessageCorruptionPlugin
  • Produce side: io.conduktor.gateway.interceptor.chaos.ProduceSimulateMessageCorruptionPlugin
KeyTypeDefaultDescription
topicstring.*Topics that match this regex will have the Interceptor applied.
sizeInBytesinteger50Number of random bytes to append to message data.
rateInPercentinteger100Percentage of records that will be corrupted.
When messages use Confluent Schema Registry format (magic byte + schema ID), the corruption is injected after the schema ID to preserve the registry lookup mechanism. This allows testing deserialization failures specifically, not schema registry connectivity issues.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "messageCorruptionTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.FetchSimulateMessageCorruptionPlugin",
  "priority": 100,
  "config": {
    "topic": "orders.*",
    "sizeInBytes": 50,
    "rateInPercent": 5
  }
}'

How to make your client survive this

The key is graceful error handling - don’t let one bad message stop all processing.

Client configuration

Consumer properties:
# Error handling
max.poll.records=100              # Smaller batches = faster recovery
session.timeout.ms=45000          # Time for error handling + DLQ writes
max.poll.interval.ms=300000       # Processing + error handling buffer

# Deserialization
value.deserializer=io.confluent.kafka.serializers.KafkaAvroDeserializer
schema.registry.url=http://schema-registry:8081
DLQ producer (separate producer for dead letter queue):
acks=1                    # Favor speed over durability for DLQ
retries=3                 # Limited retries
request.timeout.ms=5000   # Fail fast if DLQ unavailable
enable.idempotence=false  # DLQ order doesn't matter
buffer.memory=16777216    # 16MB
  • Implement per-record error handling - Catch SerializationException per message, send to DLQ, continue processing
  • Set up a dead letter queue - Capture corrupted messages for investigation
  • Keep batch sizes small - Limit impact of poison pill messages
  • Add observability - Monitor corruption rates, alert on thresholds
  • Test with compression - Production uses compression (lz4, gzip). Test with compression.type=lz4 on your topics
Compression impact: When using compression (production default), a corrupted batch fails decompression for ALL messages in that batch, not just the corrupted one. Test with low rateInPercent=1 to simulate realistic scenarios.

Validation checklist

  • Consumer receives SerializationException for corrupted messages
  • Corrupted messages sent to DLQ (not lost)
  • Consumer continues processing subsequent messages
  • Consumer lag doesn’t increase
  • Alert triggered if corruption rate exceeds 1%
  • DLQ write failures are logged and monitored
  • Test with compression enabled (matches production)

Leader election errors

Essential for surviving rolling upgrades and broker restarts.

What it simulates

Broker leadership changes during rolling upgrades, planned maintenance or failures. This is the most common production scenario - every Kafka upgrade triggers this. Gateway returns errors that occur during leader election:
  • LEADER_NOT_AVAILABLE - Election in progress
  • NOT_LEADER_OR_FOLLOWER - This broker is not the leader
  • BROKER_NOT_AVAILABLE - Broker temporarily unavailable
Real-world timing:
  • Leader election: 10-30 seconds typically
  • Rolling upgrade (3 brokers): ~5 minutes total (30s per broker × 3 + stabilization time)

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateLeaderElectionsErrorsPlugin
KeyTypeDescription
rateInPercentintegerPercentage of requests that will return leader election errors. Adapt this value to simulate a partial failure scenario.
This Interceptor also affects transactional producers by simulating leader election errors during transaction marker writes, ensuring your exactly-once semantics can survive broker failures.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "leaderElectionTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateLeaderElectionsErrorsPlugin",
  "priority": 100,
  "config": {
    "rateInPercent": 50
  }
}'

How to make your client survive this

The key is timeouts longer than leader election duration - don’t give up before the new leader is elected.

Configuration depends on your scenario

Scenario 1: Single broker failure (most common - one broker down, leader election)
# Producer
request.timeout.ms=45000      # 30s election + 15s safety margin
delivery.timeout.ms=120000    # 2 minutes total delivery window
metadata.max.age.ms=30000     # Refresh metadata to find new leader
retries=2147483647            # Retry until timeout

# Consumer
session.timeout.ms=45000      # 45s for leader election
max.poll.interval.ms=300000   # 5 minutes for processing + error handling
heartbeat.interval.ms=3000    # Keep consumer alive
Scenario 2: Rolling upgrade (multiple successive elections)
# Producer
request.timeout.ms=60000      # Longer per-request timeout
delivery.timeout.ms=360000    # 6 minutes (covers 5min upgrade cycle)
metadata.max.age.ms=20000     # More aggressive metadata refresh
retries=2147483647            # Retry until timeout

# Consumer
session.timeout.ms=60000      # 60s (covers longer disruption)
max.poll.interval.ms=600000   # 10 minutes (> 5m upgrade)
heartbeat.interval.ms=3000    # Keep consumer alive
Why the difference?
  • Rolling upgrades cause multiple successive leader elections (one per broker)
  • Each broker restart: ~30s election + 1-2min stabilization
  • 3 brokers = 5+ minutes total
  • Single failure = just one 30s election
  • Set request timeouts longer than election time - Leader elections typically take 10-30 seconds, so timeouts should exceed this
  • Configure infinite retries within a reasonable delivery timeout to handle transient failures automatically
  • Implement exponential backoff between retries to avoid overwhelming the cluster during elections
  • Keep consumer session timeouts generous to prevent unnecessary re-balances during leader changes
  • Refresh metadata proactively to discover new leaders faster after elections complete
  • Plan for rolling upgrades which can trigger multiple successive elections (budget 5+ minutes for a 3-broker cluster)
  • Monitor retry metrics to understand how often your applications encounter these scenarios
  • Avoid premature application restarts that would only worsen the situation during cluster maintenance

How partition count affects testing

Single-partition topics: Leader election is fast (~5-10s) Multi-partition topics: Leader election scales with partition count Example: Topic with 30 partitions across 3 brokers
  • Broker 1 fails → 10 partition leaders need election
  • Each election: ~5-10s
  • Some elections run in parallel (depends on controller load)
  • Total: 15-30s
Configuration adjustment for high partition count (50+):
request.timeout.ms=60000      # Longer timeout
metadata.max.age.ms=15000     # More frequent metadata refresh

Validation checklist

Test 1: Single broker failure (30s)
  • Producer retries 5-10 times over 30s
  • No TimeoutException thrown
  • All messages delivered after election completes
  • No application restarts needed
  • Metric record-retry-total increases
Test 2: Rolling upgrade simulation (5 minutes)
  • Enable Interceptor for 5 minutes with rateInPercent=33 (simulates 1 of 3 brokers down)
  • Producer survives all elections
  • Consumer doesn’t rebalance unnecessarily
  • Zero message loss
  • Application runs continuously
Common failure: request.timeout.ms=5000 (too short)
  • Symptom: TimeoutException: Topic not present in metadata after 5000ms
  • Fix: Increase to 45000ms (> 30s election time)

Broken brokers

Essential for validating retry logic and broker failure handling.

What it simulates

Complete or partial broker failures that return errors to clients. This happens in production due to:
  • Broker crashes or restarts
  • Network partitions
  • Disk failures on brokers
  • Out of memory on brokers
Confluent Schema Registry users: treat “Invalid schema ID” as essential. Schema Registry downtime is common during upgrades and can completely stop consumption if not handled properly.AWS Glue Schema Registry: this Interceptor does NOT work with AWS Glue (different wire format). To test AWS Glue scenarios, use the message corruption Interceptor instead.
Gateway returns specific Kafka error codes immediately, simulating broker-side failures. Common errors:
  • NOT_ENOUGH_REPLICAS - Not enough replicas available (with acks=all)
  • RECORD_LIST_TOO_LARGE - Batch exceeds broker limit
  • UNKNOWN_SERVER_ERROR - Generic broker failure
  • OFFSET_OUT_OF_RANGE - Consumer offset no longer valid

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin
KeyTypeDefaultDescription
rateInPercentinteger-Required. Percentage of requests that will return errors. Adapt this value to simulate a partial failure scenario.
errorMapmap{"FETCH": "UNKNOWN_SERVER_ERROR", "PRODUCE": "CORRUPT_MESSAGE"}Errors returned on consume and produce.
Available errors for PRODUCE:
  • NOT_ENOUGH_REPLICAS - Most common with acks=all
  • RECORD_LIST_TOO_LARGE - Batch size issue
  • CORRUPT_MESSAGE - Message validation failed
  • UNKNOWN_SERVER_ERROR - Generic failure
  • See all PRODUCE errors in Kafka source code
Available errors for FETCH:
Gateway validates error configurations at deployment time. If you specify an error that can’t occur for a given ApiKey (for example, NOT_ENOUGH_REPLICAS for FETCH), the Interceptor deployment will fail with a clear error message.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "brokenBrokersTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin",
  "priority": 100,
  "config": {
    "rateInPercent": 30,
    "errorMap": {
      "PRODUCE": "NOT_ENOUGH_REPLICAS"
    }
  }
}'

How to make your client survive this

The key is infinite retries with appropriate timeouts - most broker errors are temporary.
  • Enable aggressive retry behavior to handle temporary broker failures automatically
  • Set generous delivery timeouts that give the cluster time to recover (typically 2+ minutes)
  • Use exponential backoff to avoid flooding struggling brokers with retry attempts
  • Request acknowledgment from all replicas for maximum durability, which exposes replica availability issues
  • Enable producer idempotence to safely retry without creating duplicates
  • Allocate sufficient buffer memory to queue messages during temporary broker outages
  • Configure consumer session timeouts to tolerate broker issues without triggering re-balances
  • Balance request timeouts between being patient and not consuming the entire delivery window
  • Handle permanent failures gracefully (like oversized messages) with alternative processing paths
  • Distinguish between re-triable and non re-triable errors in your application logic
acks=all (aka acks=-1) provides:
  1. Maximum durability (waits for all in-sync replicas)
  2. Exposes replica availability issues that acks=1 would hide
  3. Required for exactly-once semantics (when combined with idempotence)
Relationship with NOT_ENOUGH_REPLICAS:
Topic config: min.insync.replicas=2
Scenario: Only 1 replica available
Producer config: acks=all

Result: Producer gets NOT_ENOUGH_REPLICAS error
Why this matters:
  • acks=1 would succeed (writes to leader only)
  • acks=all fails fast, exposing the replica issue
  • Your application can retry until replicas recover
  • No silent data loss from under-replicated topics

Validation checklist

Test NOT_ENOUGH_REPLICAS (most common):
  • Producer retries automatically
  • No messages lost after 30-60s recovery
  • Metric record-retry-total increases
  • No application restarts
  • Producer eventually succeeds
Test RECORD_LIST_TOO_LARGE (batch sizing):
  • Producer logs RecordTooLargeException
  • Application handles gracefully (alternative topic or rejection)
  • Alert triggered for oversized messages

Invalid schema ID

Essential for Schema Registry failure scenarios.

What it simulates

Schema registry is unavailable or returns errors, causing deserialization failures. This happens in production due to:
  • Schema registry downtime
  • Wrong schema registry URL configured
  • Schema deleted from registry
  • Network issues between consumer and schema registry
Gateway overwrites the schema ID in messages with an invalid value, causing schema registry lookup failures.

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateInvalidSchemaIdPlugin
KeyTypeDefaultDescription
topicstring.*Topics that match this regex will have the Interceptor applied.
invalidSchemaIdintegerRandom (per message)Invalid schema ID to use. If not specified, generates different random IDs for each message, simulating a misconfigured schema mapping.
targetenumCONSUMEWhen to inject error, values: PRODUCE or CONSUME.
This Interceptor only affects messages using Confluent Schema Registry wire format (magic byte 0x0 + 4-byte schema ID).Not supported:
  • AWS Glue Schema Registry (different wire format with UUID)
  • Plain messages without schema registry encoding
When invalidSchemaId is not specified, the Interceptor generates a different random schema ID for each message, simulating misconfigured schema mappings.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "invalidSchemaTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateInvalidSchemaIdPlugin",
  "priority": 100,
  "config": {
    "topic": "user-events",
    "invalidSchemaId": 9999,
    "target": "CONSUME"
  }
}'

How to make your client survive this

The key is treating schema errors like message corruption - handle gracefully and continue processing.

Client configuration

Consumer properties:
# Deserialization
value.deserializer=io.confluent.kafka.serializers.KafkaAvroDeserializer
schema.registry.url=http://schema-registry:8081

# Schema caching (helps with registry outages, not with invalid IDs)
schema.registry.cache.capacity=1000     # Default, increase for many schemas (100+ topics)

# Error handling
max.poll.records=100                    # Limit impact of schema errors
session.timeout.ms=45000                # Time for error handling + DLQ
max.poll.interval.ms=300000             # Processing + error handling buffer
  • Handle deserialization exceptions gracefully - Catch SerializationException, send to DLQ, continue
  • Set up dead letter queue - Capture messages with schema errors for investigation
  • Monitor schema registry health - Periodic health checks and alerts
  • Enable schema caching - Helps with registry outages (but not this specific test)
  • Test both scenarios - Invalid schema IDs (this test) AND registry outages (stop registry service)
What this Interceptor tests: Handling of messages with invalid schema IDs (deserialization errors from schema registry lookup failures).What this doesn’t test: Schema Registry outages. To test registry outages, stop the Schema Registry service separately and verify your consumer continues with cached schemas.Schema caching helps survive registry outages but doesn’t help with invalid schema IDs.

Validation checklist

  • Consumer receives schema registry error: Schema 9999 not found; error code: 40403
  • Messages with invalid schemas sent to DLQ
  • Consumer continues processing other messages
  • Alert triggered for schema registry issues
  • No consumer lag buildup
  • Test with invalidSchemaId null (random IDs per message)
  • Verify schema cache is working (disconnect registry, cached schemas still work)
Read more about Schema Registry

Slow brokers

What it simulates

Brokers respond slowly to produce and fetch requests, simulating:
  • Broker under heavy load
  • Disk I/O issues
  • Network congestion
  • GC pauses on brokers
Gateway adds random latency (between min and max) to responses returned to clients.

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateSlowBrokerPlugin
KeyTypeDefaultDescription
rateInPercentinteger-Required. Percentage of requests that will be slowed down. Adapt this value to simulate a partially slow environment.
minLatencyMsinteger-Required. Minimum latency in milliseconds. Must be ≤ maxLatencyMs.
maxLatencyMsinteger-Required. Maximum latency in milliseconds.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "slowBrokersTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateSlowBrokerPlugin",
  "priority": 100,
  "config": {
    "rateInPercent": 50,
    "minLatencyMs": 1000,
    "maxLatencyMs": 5000
  }
}'

How to make your client survive this

The key is timeouts longer than expected slowness - don’t timeout while broker is processing.
  • Set request timeouts longer than expected broker response times to avoid premature failures
  • Configure generous delivery timeouts that tolerate multiple slow responses with retries
  • Batch messages when possible to reduce the number of round trips to slow brokers
  • Balance fetch wait times between responsiveness and broker load
  • Monitor broker performance metrics to understand actual latency patterns
  • Consider circuit breaker patterns if slowness persists beyond acceptable thresholds
  • Test timeout values under realistic load conditions before production deployment

Validation checklist

  • Producer handles slow responses without timeouts
  • Consumer continues fetching despite latency
  • Application throughput degrades gracefully (not crashes)
  • Timeout values accommodate the maximum latency (5000ms in example)
  • Monitor latency metrics (p50, p99, max)

Slow producers and consumers

What it simulates

Specific topics experience slowness while others remain fast. This simulates:
  • Hot partitions with high load
  • Topic-specific issues
  • Consumer group lag on certain topics
Gateway adds latency to requests for specific topics only.

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateSlowProducersConsumersPlugin
KeyTypeDefaultDescription
topicstring.*Topics that match this regex will have latency applied.
rateInPercentinteger-Required. Percentage of requests that will be slowed down. Adapt this value to simulate a partially slow environment.
minLatencyMsinteger-Required. Minimum latency in milliseconds. Must be ≤ maxLatencyMs.
maxLatencyMsinteger-Required. Maximum latency in milliseconds.
When a single request contains multiple topics, the Interceptor calculates latency for each matching topic independently, then applies the maximum latency to the entire request. This means one slow topic can impact the entire batch.Example: If a FetchRequest fetches from three topics and one matches your regex with 5000ms latency, the entire fetch (all three topics) will be delayed by 5000ms.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "slowTopicTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateSlowProducersConsumersPlugin",
  "priority": 100,
  "config": {
    "topic": "high-volume-topic",
    "rateInPercent": 30,
    "minLatencyMs": 500,
    "maxLatencyMs": 2000
  }
}'

How to make your client survive this

Use similar approaches as slow brokers, with additional focus on topic isolation.
  • Apply the same timeout strategies as for slow brokers
  • Verify topic isolation - ensure slowness on one topic doesn’t cascade to others
  • Monitor per-topic metrics to identify which topics are experiencing issues
  • Consider separate consumer groups for critical vs non-critical topics
  • Implement backpressure mechanisms for topics under heavy load
  • Test cross-topic independence to ensure your architecture handles partial degradation

Validation checklist

  • Only matching topics experience latency
  • Non-matching topics remain fast
  • Application handles per-topic slowness gracefully
  • Consumer groups isolated properly (one slow topic doesn’t block others)
  • Per-topic metrics show latency differences

Latency on all interactions

What it simulates

General network latency affecting all Kafka operations, simulating:
  • Geographic distance to Kafka cluster
  • Network congestion
  • VPN overhead
  • Cloud region latency
Gateway adds latency to a percentage of all requests and responses.

Configure the Interceptor

Class: io.conduktor.gateway.interceptor.chaos.SimulateLatencyPlugin
KeyTypeDefaultDescription
appliedPercentageinteger-Required. Percentage of requests affected (0-100).
latencyMslong-Required. Milliseconds of latency to add (0 to 2,147,483,647).
This Interceptor applies globally to ALL request types (produce, fetch, metadata, etc.).Unlike topic-specific Interceptors, there’s no filtering. Use appliedPercentage carefully in production-like testing.
curl \
  --request PUT \
  --url 'http://localhost:8888/gateway/v2/interceptor' \
  --header 'Authorization: Basic YWRtaW46Y29uZHVrdG9y' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "name": "networkLatencyTest",
  "pluginClass": "io.conduktor.gateway.interceptor.chaos.SimulateLatencyPlugin",
  "priority": 100,
  "config": {
    "appliedPercentage": 50,
    "latencyMs": 1000
  }
}'

How to make your client survive this

Use the same approaches as slow brokers - ensure timeouts accommodate for the expected latency.
  • Configure timeouts based on your network environment (geographic distance, VPN overhead, cloud regions)
  • Test with realistic latency values that match your production deployment
  • Consider the impact on end-to-end latency for your application’s SLAs
  • Monitor network metrics to understand actual latency distributions
  • Design for degraded performance rather than complete failure under high latency

Testing geographic distance

Scenario: Application in different AWS region than Kafka cluster Use Latency on all interactions to simulate cross-region latency:
spec:
  pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateLatencyPlugin
  config:
    appliedPercentage: 100    # All requests
    latencyMs: 150            # Typical us-east-1 to eu-west-1
Recommended client configuration for cross-region:
# Producer
request.timeout.ms=60000           # 60s (accommodate 150ms RTT)
delivery.timeout.ms=180000         # 3 minutes
linger.ms=100                      # Batch more to reduce round trips
batch.size=32768                   # 32KB batches
compression.type=lz4               # Compress to reduce cross-region bandwidth

# Consumer
session.timeout.ms=60000           # 60s (higher than single-region)
heartbeat.interval.ms=3000         # 3s (default)
fetch.min.bytes=1024               # Wait for more data before responding
fetch.max.wait.ms=500              # Max wait for fetch.min.bytes

Validation checklist

  • Producer throughput doesn’t drop below acceptable threshold
  • Consumer lag remains stable despite latency
  • No timeouts during normal operation
  • Application handles occasional latency spikes (>500ms)
  • End-to-end latency meets SLA requirements
  • Test with realistic geographic latency values

Advanced scenarios

While there’s no dedicated chaos Interceptor for re-balances, you can trigger them by combining Interceptors.Rebalance trigger 1: Session timeout via slow consumer
# Simulate slow consumer that exceeds max.poll.interval.ms
spec:
  pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateSlowProducersConsumersPlugin
  config:
    topic: your-topic
    rateInPercent: 100
    minLatencyMs: 310000    # Exceeds default max.poll.interval.ms (300s)
    maxLatencyMs: 310000
Rebalance trigger 2: Leader election (indirect)
# Leader election can cause slow responses → session timeout → rebalance
spec:
  pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateLeaderElectionsErrorsPlugin
  config:
    rateInPercent: 50
What to validate:
  • Consumer resumes processing after rebalance completes
  • No duplicate processing (offsets committed correctly before rebalance)
  • Application doesn’t crash during rebalance
  • Metrics show rebalance duration (consumer group lag spike)
  • Other consumers in group pick up partitions
  • Rebalance completes within expected time (typically 5-30s)
Scenario: Transaction coordinator failure during commitUse the broken brokers Interceptor targeting PRODUCE with transaction-specific errors:
spec:
  pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin
  config:
    rateInPercent: 20
    errorMap:
      PRODUCE: INVALID_PRODUCER_EPOCH
What happens:
  • Producer gets INVALID_PRODUCER_EPOCH during commitTransaction()
  • This is a FATAL error - producer throws ProducerFencedException (not re-triable)
  • Producer cannot continue - must close and re-initialize with same transactional.id
  • Application must retry the entire read-process-write transaction
Producer configuration (already configured for transactions):
transactional.id=my-app-instance-1    # Same ID on re-init gets new epoch from coordinator
enable.idempotence=true               # Required for transactions (automatically set)
What to validate:
  • Application catches ProducerFencedException
  • Application closes fenced producer
  • Application creates new producer with same transactional.id (gets new epoch)
  • Application retries the entire read-process-write cycle
  • Consumer sees no partial transactions (isolation.level=read_committed)
Additional transaction errors to test:
  • INVALID_TXN_STATE - Transaction state machine violation
  • TRANSACTIONAL_ID_AUTHORIZATION_FAILED - ACL issue
  • CONCURRENT_TRANSACTIONS - Transactional ID reused

Quick reference: resilience strategies by scenario

Use this table to identify key resilience strategies for each chaos testing scenario:
ScenarioKey producer strategiesKey consumer strategies
Duplicate messagesEnable idempotence
Use transactions for exactly-once
Manual offset management
Business-level deduplication
Message corruptionN/APer-record error handling
Dead letter queue setup
Test with compression enabled
Leader electionLong request timeouts (>30s)
Aggressive retries
Metadata refresh
Long session timeouts (>30s)
Allow time for elections
Broken brokersInfinite retries with backoff
Long delivery timeouts
Idempotence
Generous session timeouts
Balanced request timeouts
Schema errorsN/ASchema caching
Graceful error handling
DLQ for failed messages
Slow brokersTimeouts > expected latency
Message batching
Timeouts > broker response time
Optimized fetch settings
Cross-regionHigher timeouts
Compression
Larger batches
Higher session timeouts
Optimized fetch settings