This feature is available with Conduktor Shield only.
Overview
Conduktor Chaos testing helps you validate your Kafka applications’ resilience by simulating real-world failure scenarios. Instead of discovering issues in production, you can proactively test how your applications handle problems like network failures, slow brokers, corrupted or duplicate messages. Testing these scenarios before they happen helps you:- Validate configurations - Are your timeouts long enough for rolling upgrades?
- Test error handling - Does your application recover from corrupted messages?
- Verify idempotency - Can your system handle duplicate payments correctly?
- Understand behavior - How does your app react when brokers are slow?
Essential Interceptors
Start with these - they cover the most common production scenarios and help validate critical resilience patterns.| Interceptor | What it tests | Common use case |
|---|---|---|
| Duplicate messages | Application idempotency | Payment processing, order fulfillment |
| Message corruption | Deserialization error handling | Consumer resilience, dead letter queues |
| Leader election errors | Rolling upgrade survival | Production deployments, Kafka upgrades |
| Broken brokers | Retry and timeout configuration | Broker failures, network issues |
| Invalid schema ID | Schema registry error handling | Schema evolution, registry outages |
Additional Interceptors
| Interceptor | What it tests | Common use case |
|---|---|---|
| Slow brokers | Timeout configuration | Performance degradation |
| Slow producers/consumers | Application-level timeouts | Topic-specific slowness |
| Latency on all interactions | General network delay tolerance | Network issues, geographic distance |
Duplicate messages
What it simulates
Applications receive the same message multiple times with identical content. This happens in production due to:- Network retries from producers
- Consumer re-balances before offset commits
- Exactly-once processing failures
- CONSUME mode (recommended): Client sees duplicates, but Kafka doesn’t store them
- PRODUCE mode: Duplicates are written to Kafka
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.DuplicateMessagesPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| topic | string | .* | Topics that match this regex will have the Interceptor applied. |
| rateInPercent | integer | 100 | Percentage of records that will be duplicated. Adjust this value to simulate occasional duplicates. |
| target | enum | CONSUME | Record is duplicated when client produces or consumes, values: PRODUCE or CONSUME |
- curl
- Conduktor CLI
How to make your client survive this
The key is idempotency - processing the same message multiple times should produce the same result.Choose your deduplication strategy
Idempotent producer (enable.idempotence=true):
- Prevents duplicates from network retries at the Kafka level
- Tolerate duplicates from application restarts
- Combine with business-level deduplication (database unique constraints, in-memory caching)
- Low-latency writes
- Manage offsets manually (
enable.auto.commit=false) to commit only after successful processing
transactional.id=...):
- Exactly-once semantics across application restarts
- Stream processing (read-process-write operations)
- Atomic multi-partition writes
- Higher latency than idempotence alone
- When neither idempotence nor transactions fit your use case
- Consuming from multiple sources (not just Kafka)
- Deduplication across different systems
- Design idempotent operations where repeating the same action produces the same result
Validation checklist
- Send 100 unique messages
- Consumer receives ~120 messages (20% duplicated)
- Database shows exactly 100 records
- Application metric
duplicates_detectedshows ~20 - No duplicate transactions processed
Message corruption
What it simulates
Messages arrive with corrupted data that fails deserialization. This happens in production due to:- Disk corruption on brokers
- Network transmission errors
- Producer bugs writing invalid data
- Schema registry mismatches
SerializationException on consumer side.
Configure the Interceptor
You can simulate corruption on:- Consume side (recommended as it doesn’t affect the data stored in Kafka):
io.conduktor.gateway.interceptor.chaos.FetchSimulateMessageCorruptionPlugin - Produce side:
io.conduktor.gateway.interceptor.chaos.ProduceSimulateMessageCorruptionPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| topic | string | .* | Topics that match this regex will have the Interceptor applied. |
| sizeInBytes | integer | 50 | Number of random bytes to append to message data. |
| rateInPercent | integer | 100 | Percentage of records that will be corrupted. |
When messages use Confluent Schema Registry format (magic byte + schema ID), the corruption is injected after the schema ID to preserve the registry lookup mechanism. This allows testing deserialization failures specifically, not schema registry connectivity issues.
- curl
- Conduktor CLI
How to make your client survive this
The key is graceful error handling - don’t let one bad message stop all processing.Client configuration
Consumer properties:Recommended approaches
- Implement per-record error handling - Catch SerializationException per message, send to DLQ, continue processing
- Set up a dead letter queue - Capture corrupted messages for investigation
- Keep batch sizes small - Limit impact of poison pill messages
- Add observability - Monitor corruption rates, alert on thresholds
- Test with compression - Production uses compression (lz4, gzip). Test with
compression.type=lz4on your topics
Compression impact: When using compression (production default), a corrupted batch fails decompression for ALL messages in that batch, not just the corrupted one. Test with low
rateInPercent=1 to simulate realistic scenarios.Validation checklist
- Consumer receives
SerializationExceptionfor corrupted messages - Corrupted messages sent to DLQ (not lost)
- Consumer continues processing subsequent messages
- Consumer lag doesn’t increase
- Alert triggered if corruption rate exceeds 1%
- DLQ write failures are logged and monitored
- Test with compression enabled (matches production)
Leader election errors
What it simulates
Broker leadership changes during rolling upgrades, planned maintenance or failures. This is the most common production scenario - every Kafka upgrade triggers this. Gateway returns errors that occur during leader election:LEADER_NOT_AVAILABLE- Election in progressNOT_LEADER_OR_FOLLOWER- This broker is not the leaderBROKER_NOT_AVAILABLE- Broker temporarily unavailable
- Leader election: 10-30 seconds typically
- Rolling upgrade (3 brokers): ~5 minutes total (30s per broker × 3 + stabilization time)
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateLeaderElectionsErrorsPlugin
| Key | Type | Description |
|---|---|---|
| rateInPercent | integer | Percentage of requests that will return leader election errors. Adapt this value to simulate a partial failure scenario. |
This Interceptor also affects transactional producers by simulating leader election errors during transaction marker writes, ensuring your exactly-once semantics can survive broker failures.
- curl
- Conduktor CLI
How to make your client survive this
The key is timeouts longer than leader election duration - don’t give up before the new leader is elected.Configuration depends on your scenario
Scenario 1: Single broker failure (most common - one broker down, leader election)- Rolling upgrades cause multiple successive leader elections (one per broker)
- Each broker restart: ~30s election + 1-2min stabilization
- 3 brokers = 5+ minutes total
- Single failure = just one 30s election
Recommended approaches
- Set request timeouts longer than election time - Leader elections typically take 10-30 seconds, so timeouts should exceed this
- Configure infinite retries within a reasonable delivery timeout to handle transient failures automatically
- Implement exponential backoff between retries to avoid overwhelming the cluster during elections
- Keep consumer session timeouts generous to prevent unnecessary re-balances during leader changes
- Refresh metadata proactively to discover new leaders faster after elections complete
- Plan for rolling upgrades which can trigger multiple successive elections (budget 5+ minutes for a 3-broker cluster)
- Monitor retry metrics to understand how often your applications encounter these scenarios
- Avoid premature application restarts that would only worsen the situation during cluster maintenance
How partition count affects testing
Single-partition topics: Leader election is fast (~5-10s) Multi-partition topics: Leader election scales with partition count Example: Topic with 30 partitions across 3 brokers- Broker 1 fails → 10 partition leaders need election
- Each election: ~5-10s
- Some elections run in parallel (depends on controller load)
- Total: 15-30s
Validation checklist
Test 1: Single broker failure (30s)- Producer retries 5-10 times over 30s
- No
TimeoutExceptionthrown - All messages delivered after election completes
- No application restarts needed
- Metric
record-retry-totalincreases
- Enable Interceptor for 5 minutes with
rateInPercent=33(simulates 1 of 3 brokers down) - Producer survives all elections
- Consumer doesn’t rebalance unnecessarily
- Zero message loss
- Application runs continuously
request.timeout.ms=5000 (too short)
- Symptom:
TimeoutException: Topic not present in metadata after 5000ms - Fix: Increase to 45000ms (> 30s election time)
Broken brokers
What it simulates
Complete or partial broker failures that return errors to clients. This happens in production due to:- Broker crashes or restarts
- Network partitions
- Disk failures on brokers
- Out of memory on brokers
Confluent Schema Registry users: treat “Invalid schema ID” as essential. Schema Registry downtime is common during upgrades and can completely stop consumption if not handled properly.AWS Glue Schema Registry: this Interceptor does NOT work with AWS Glue (different wire format). To test AWS Glue scenarios, use the message corruption Interceptor instead.
NOT_ENOUGH_REPLICAS- Not enough replicas available (withacks=all)RECORD_LIST_TOO_LARGE- Batch exceeds broker limitUNKNOWN_SERVER_ERROR- Generic broker failureOFFSET_OUT_OF_RANGE- Consumer offset no longer valid
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| rateInPercent | integer | - | Required. Percentage of requests that will return errors. Adapt this value to simulate a partial failure scenario. |
| errorMap | map | {"FETCH": "UNKNOWN_SERVER_ERROR", "PRODUCE": "CORRUPT_MESSAGE"} | Errors returned on consume and produce. |
NOT_ENOUGH_REPLICAS- Most common with acks=allRECORD_LIST_TOO_LARGE- Batch size issueCORRUPT_MESSAGE- Message validation failedUNKNOWN_SERVER_ERROR- Generic failure- See all PRODUCE errors in Kafka source code
OFFSET_OUT_OF_RANGE- Consumer offset expiredUNKNOWN_SERVER_ERROR- Generic failureNOT_LEADER_OR_FOLLOWER- Broker not leader- See all FETCH errors in Kafka source code
Gateway validates error configurations at deployment time. If you specify an error that can’t occur for a given ApiKey (for example, NOT_ENOUGH_REPLICAS for FETCH), the Interceptor deployment will fail with a clear error message.
- curl
- Conduktor CLI
How to make your client survive this
The key is infinite retries with appropriate timeouts - most broker errors are temporary.Recommended approaches
- Enable aggressive retry behavior to handle temporary broker failures automatically
- Set generous delivery timeouts that give the cluster time to recover (typically 2+ minutes)
- Use exponential backoff to avoid flooding struggling brokers with retry attempts
- Request acknowledgment from all replicas for maximum durability, which exposes replica availability issues
- Enable producer idempotence to safely retry without creating duplicates
- Allocate sufficient buffer memory to queue messages during temporary broker outages
- Configure consumer session timeouts to tolerate broker issues without triggering re-balances
- Balance request timeouts between being patient and not consuming the entire delivery window
- Handle permanent failures gracefully (like oversized messages) with alternative processing paths
- Distinguish between re-triable and non re-triable errors in your application logic
Why acks=all is recommended
acks=all (aka acks=-1) provides:- Maximum durability (waits for all in-sync replicas)
- Exposes replica availability issues that
acks=1would hide - Required for exactly-once semantics (when combined with idempotence)
acks=1would succeed (writes to leader only)acks=allfails fast, exposing the replica issue- Your application can retry until replicas recover
- No silent data loss from under-replicated topics
Validation checklist
Test NOT_ENOUGH_REPLICAS (most common):- Producer retries automatically
- No messages lost after 30-60s recovery
- Metric
record-retry-totalincreases - No application restarts
- Producer eventually succeeds
- Producer logs
RecordTooLargeException - Application handles gracefully (alternative topic or rejection)
- Alert triggered for oversized messages
Invalid schema ID
What it simulates
Schema registry is unavailable or returns errors, causing deserialization failures. This happens in production due to:- Schema registry downtime
- Wrong schema registry URL configured
- Schema deleted from registry
- Network issues between consumer and schema registry
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateInvalidSchemaIdPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| topic | string | .* | Topics that match this regex will have the Interceptor applied. |
| invalidSchemaId | integer | Random (per message) | Invalid schema ID to use. If not specified, generates different random IDs for each message, simulating a misconfigured schema mapping. |
| target | enum | CONSUME | When to inject error, values: PRODUCE or CONSUME. |
This Interceptor only affects messages using Confluent Schema Registry wire format (magic byte
0x0 + 4-byte schema ID).Not supported:- AWS Glue Schema Registry (different wire format with UUID)
- Plain messages without schema registry encoding
invalidSchemaId is not specified, the Interceptor generates a different random schema ID for each message, simulating misconfigured schema mappings.- curl
- Conduktor CLI
How to make your client survive this
The key is treating schema errors like message corruption - handle gracefully and continue processing.Client configuration
Consumer properties:Recommended approaches
- Handle deserialization exceptions gracefully - Catch SerializationException, send to DLQ, continue
- Set up dead letter queue - Capture messages with schema errors for investigation
- Monitor schema registry health - Periodic health checks and alerts
- Enable schema caching - Helps with registry outages (but not this specific test)
- Test both scenarios - Invalid schema IDs (this test) AND registry outages (stop registry service)
What this Interceptor tests: Handling of messages with invalid schema IDs (deserialization errors from schema registry lookup failures).What this doesn’t test: Schema Registry outages. To test registry outages, stop the Schema Registry service separately and verify your consumer continues with cached schemas.Schema caching helps survive registry outages but doesn’t help with invalid schema IDs.
Validation checklist
- Consumer receives schema registry error:
Schema 9999 not found; error code: 40403 - Messages with invalid schemas sent to DLQ
- Consumer continues processing other messages
- Alert triggered for schema registry issues
- No consumer lag buildup
- Test with
invalidSchemaIdnull (random IDs per message) - Verify schema cache is working (disconnect registry, cached schemas still work)
Slow brokers
What it simulates
Brokers respond slowly to produce and fetch requests, simulating:- Broker under heavy load
- Disk I/O issues
- Network congestion
- GC pauses on brokers
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateSlowBrokerPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| rateInPercent | integer | - | Required. Percentage of requests that will be slowed down. Adapt this value to simulate a partially slow environment. |
| minLatencyMs | integer | - | Required. Minimum latency in milliseconds. Must be ≤ maxLatencyMs. |
| maxLatencyMs | integer | - | Required. Maximum latency in milliseconds. |
- curl
- Conduktor CLI
How to make your client survive this
The key is timeouts longer than expected slowness - don’t timeout while broker is processing.Recommended approaches
- Set request timeouts longer than expected broker response times to avoid premature failures
- Configure generous delivery timeouts that tolerate multiple slow responses with retries
- Batch messages when possible to reduce the number of round trips to slow brokers
- Balance fetch wait times between responsiveness and broker load
- Monitor broker performance metrics to understand actual latency patterns
- Consider circuit breaker patterns if slowness persists beyond acceptable thresholds
- Test timeout values under realistic load conditions before production deployment
Validation checklist
- Producer handles slow responses without timeouts
- Consumer continues fetching despite latency
- Application throughput degrades gracefully (not crashes)
- Timeout values accommodate the maximum latency (5000ms in example)
- Monitor latency metrics (p50, p99, max)
Slow producers and consumers
What it simulates
Specific topics experience slowness while others remain fast. This simulates:- Hot partitions with high load
- Topic-specific issues
- Consumer group lag on certain topics
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateSlowProducersConsumersPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| topic | string | .* | Topics that match this regex will have latency applied. |
| rateInPercent | integer | - | Required. Percentage of requests that will be slowed down. Adapt this value to simulate a partially slow environment. |
| minLatencyMs | integer | - | Required. Minimum latency in milliseconds. Must be ≤ maxLatencyMs. |
| maxLatencyMs | integer | - | Required. Maximum latency in milliseconds. |
- curl
- Conduktor CLI
How to make your client survive this
Use similar approaches as slow brokers, with additional focus on topic isolation.Recommended approaches
- Apply the same timeout strategies as for slow brokers
- Verify topic isolation - ensure slowness on one topic doesn’t cascade to others
- Monitor per-topic metrics to identify which topics are experiencing issues
- Consider separate consumer groups for critical vs non-critical topics
- Implement backpressure mechanisms for topics under heavy load
- Test cross-topic independence to ensure your architecture handles partial degradation
Validation checklist
- Only matching topics experience latency
- Non-matching topics remain fast
- Application handles per-topic slowness gracefully
- Consumer groups isolated properly (one slow topic doesn’t block others)
- Per-topic metrics show latency differences
Latency on all interactions
What it simulates
General network latency affecting all Kafka operations, simulating:- Geographic distance to Kafka cluster
- Network congestion
- VPN overhead
- Cloud region latency
Configure the Interceptor
Class:io.conduktor.gateway.interceptor.chaos.SimulateLatencyPlugin
| Key | Type | Default | Description |
|---|---|---|---|
| appliedPercentage | integer | - | Required. Percentage of requests affected (0-100). |
| latencyMs | long | - | Required. Milliseconds of latency to add (0 to 2,147,483,647). |
- curl
- Conduktor CLI
How to make your client survive this
Use the same approaches as slow brokers - ensure timeouts accommodate for the expected latency.Recommended approaches
- Configure timeouts based on your network environment (geographic distance, VPN overhead, cloud regions)
- Test with realistic latency values that match your production deployment
- Consider the impact on end-to-end latency for your application’s SLAs
- Monitor network metrics to understand actual latency distributions
- Design for degraded performance rather than complete failure under high latency
Testing geographic distance
Scenario: Application in different AWS region than Kafka cluster Use Latency on all interactions to simulate cross-region latency:Validation checklist
- Producer throughput doesn’t drop below acceptable threshold
- Consumer lag remains stable despite latency
- No timeouts during normal operation
- Application handles occasional latency spikes (>500ms)
- End-to-end latency meets SLA requirements
- Test with realistic geographic latency values
Advanced scenarios
Testing consumer re-balances
Testing consumer re-balances
While there’s no dedicated chaos Interceptor for re-balances, you can trigger them by combining Interceptors.Rebalance trigger 1: Session timeout via slow consumerRebalance trigger 2: Leader election (indirect)What to validate:
- Consumer resumes processing after rebalance completes
- No duplicate processing (offsets committed correctly before rebalance)
- Application doesn’t crash during rebalance
- Metrics show rebalance duration (consumer group lag spike)
- Other consumers in group pick up partitions
- Rebalance completes within expected time (typically 5-30s)
Testing transactional producers
Testing transactional producers
Scenario: Transaction coordinator failure during commitUse the broken brokers Interceptor targeting PRODUCE with transaction-specific errors:What happens:What to validate:
- Producer gets
INVALID_PRODUCER_EPOCHduringcommitTransaction() - This is a FATAL error - producer throws
ProducerFencedException(not re-triable) - Producer cannot continue - must close and re-initialize with same transactional.id
- Application must retry the entire read-process-write transaction
- Application catches
ProducerFencedException - Application closes fenced producer
- Application creates new producer with same transactional.id (gets new epoch)
- Application retries the entire read-process-write cycle
- Consumer sees no partial transactions (
isolation.level=read_committed)
INVALID_TXN_STATE- Transaction state machine violationTRANSACTIONAL_ID_AUTHORIZATION_FAILED- ACL issueCONCURRENT_TRANSACTIONS- Transactional ID reused
Quick reference: resilience strategies by scenario
Use this table to identify key resilience strategies for each chaos testing scenario:| Scenario | Key producer strategies | Key consumer strategies |
|---|---|---|
| Duplicate messages | Enable idempotence Use transactions for exactly-once | Manual offset management Business-level deduplication |
| Message corruption | N/A | Per-record error handling Dead letter queue setup Test with compression enabled |
| Leader election | Long request timeouts (>30s) Aggressive retries Metadata refresh | Long session timeouts (>30s) Allow time for elections |
| Broken brokers | Infinite retries with backoff Long delivery timeouts Idempotence | Generous session timeouts Balanced request timeouts |
| Schema errors | N/A | Schema caching Graceful error handling DLQ for failed messages |
| Slow brokers | Timeouts > expected latency Message batching | Timeouts > broker response time Optimized fetch settings |
| Cross-region | Higher timeouts Compression Larger batches | Higher session timeouts Optimized fetch settings |