Skip to main content
Learn how log compaction retains only the latest value per key in 20 minutes Log compaction is Kafka’s alternative cleanup policy that keeps only the most recent value for each key, making it ideal for changelog-style topics. This guide covers both the theory and hands-on practice of log compaction. What you’ll learn:
  • The difference between delete and compact cleanup policies
  • How log compaction works internally
  • Log compaction guarantees and common misconceptions
  • How to configure and test log-compacted topics

Kafka log cleanup policies

Kafka stores messages for a set amount of time and purges messages older than the retention period. This expiration happens due to a policy called log.cleanup.policy. There are two cleanup policies:
PolicyDefault forBehavior
deleteUser topicsDeletes events older than retention time
compact__consumer_offsetsKeeps only the most recent value per key

Purpose of log cleanup

Kafka was not initially meant to keep data forever (although now some people are going in that direction with large disks or Tiered Storage), nor does it wait for all consumers to read a message before deleting it. By configuring a retention policy for each topic, it allows administrators to:
  • Control the size of the data on the disk and delete obsolete data
  • Limit maintenance work on the Kafka Cluster
  • Limit the amount of historical data a consumer may have to consume to catch up on the topic

Kafka log compaction theory

Kafka supports use cases by allowing the retention policy on a topic to be set to compact, with the property to only retain at least the most recent value for each key in the partition. It is very useful if we just require a SNAPSHOT instead of full history.

Log compaction example

We want to keep the most recent salary for our employees. We create a topic named employee-salary for the purpose. We don’t want to know about the old salaries of the employees. Kafka Compaction diagram showing how log compaction works in a real world example. Applications producing these events should contain both a key and a value. The key, in this case, will be employee id, and the value will be their salary. As the data comes in, it will be appended into segments of a partition. After compaction, a new segment is created with only the latest events for a key being retained. The older events for that key are deleted, the offset of the messages are kept intact.

Log compaction guarantees

There are some important guarantees that Kafka provides for messages produced on the log-compacted topics:
  • Tail consumers see all messages: Any consumer that is reading from the tail of a log, i.e., the most current data, will still see all the messages sent to the topic
  • Ordering preserved: Ordering of messages at the key level and partition level is kept, log compaction only removes some messages, but does not re-order them
  • Offsets are immutable: The offset of a message never changes. Offsets are just skipped if a message is missing
  • Deleted records visible briefly: Deleted records can still be seen by consumers for a period of log.cleaner.delete.retention.ms (default is 24 hours)

Log compaction myth busting

Let us look at some of the misconceptions around log compaction and clear them. What log compaction does NOT do:
  • It doesn’t prevent duplicate data: De-duplication is done after a segment is committed. Your consumers will still read from the tail as soon as data arrives
  • It doesn’t prevent reading duplicates: If a consumer re-starts, it may see duplicate data based on at-least-once semantics
You also can’t trigger log compaction using an API call—it happens in the background automatically if enabled.

How log compaction works

If compaction is enabled when Kafka starts, each broker will start a compaction manager thread and a number of compaction threads. These are responsible for performing the compaction tasks. Diagram showing the way Kafka Log Compaction is achieved using Cleaner Threads created by brokers when Log Compaction is enabled.
  1. Cleaner threads start with the oldest segment and check their contents. The active segments are left untouched
  2. If the message it has just read is still the latest for a key, it copies over the message to a replacement segment. Otherwise it omits the message
  3. Once the cleaner thread has copied over all the messages that still contain the latest value for their key, we swap the replacement segment for the original
  4. At the end of the process, we are left with one message per key - the one with the latest value
Ordering intactNew segments are created/combined from old segments after the cleaner thread has done its work. The offsets are left untouched and the key-based ordering as well.

Log compaction configurations

ConfigurationDefaultDescription
log.cleaner.enabletrueEnable/disable log compaction
log.cleaner.threads1Background threads for log cleaning
log.segment.ms7 daysMax time before closing active segment
log.segment.bytes1GBMax size of a segment
log.cleaner.delete.retention.ms24 hoursHow long tombstones are visible
log.cleaner.backoff.ms15 secondsSleep time when no logs to clean
min.cleanable.dirty.ratio0.5Minimum dirty ratio to trigger cleaning

Log compaction tombstones

Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. Such a record is sometimes referred to as a tombstone. This delete marker will cause any prior message with that key to be removed. Tombstones are themselves cleaned out of the log after delete.retention.ms to free up space.

Log compaction practice

Windows Users: you can’t use log compaction if you’re not using WSL2Windows has a long-standing bug (KAFKA-1194) that makes Kafka crash if you use log cleaning. The only way to recover from the crash is to manually delete the folders in log.dirs directory.

Create a log-compacted topic

Create a log-compacted topic named employee-salary with a single partition and a replication factor of 1:
kafka-topics --bootstrap-server localhost:9092 --create --topic employee-salary \
    --partitions 1 --replication-factor 1 \
    --config cleanup.policy=compact \
    --config min.cleanable.dirty.ratio=0.001 \
    --config segment.ms=5000
Configuration explanation:
  • cleanup.policy=compact: Enables log compaction
  • min.cleanable.dirty.ratio=0.001: Ensures log cleanup is triggered frequently (for testing)
  • segment.ms=5000: New segment every 5 seconds (compaction only happens on closed segments)
Describe the topic to verify:
kafka-topics --bootstrap-server localhost:9092 --describe --topic employee-salary

Produce messages with keys

Start a Kafka console producer with key parsing enabled:
kafka-console-producer --bootstrap-server localhost:9092 \
    --topic employee-salary \
    --property parse.key=true \
    --property key.separator=,
Produce messages with duplicate keys:
Patrick,salary: 10000
Lucy,salary: 20000
Bob,salary: 20000
Patrick,salary: 25000
Lucy,salary: 30000
Patrick,salary: 30000
Wait a minute, and produce a few more messages:
John,salary: 0

Consume and verify compaction

Start a consumer to read all messages:
kafka-console-consumer --bootstrap-server localhost:9092 \
    --topic employee-salary \
    --from-beginning \
    --property print.key=true \
    --property key.separator=,
After compaction completes, you’ll see only the unique keys with their latest values:
Bob,salary: 20000
Lucy,salary: 30000
Patrick,salary: 30000
John,salary: 0
Log compaction will take place in the background automatically. We cannot trigger it explicitly. However, we can control how often it is triggered with the log compaction properties.
See it in practice with ConduktorConduktor Console lets you browse compacted topics and see the latest value for each key. Monitor compaction progress and verify your cleanup policy configuration is working as expected.

Next steps