We don’t want to have compaction too often as it impacts the performance. Compaction just like delete will not take place on active segment. Kafka start compaction once it has 50% of dirty records. After all of this process we just have messages with one key and swap the segment with compacted one. If it read a message and key is already there in offset map, then it’ll remove message if value doesn’t match map’s value as we have a new message for that key in partition. If it reads one message and map does not contain the key of that message, it means we got a new message and it’ll store this message. Once this offset map is ready, it’ll start reading all messages. It’ll start compaction by maintaining an in memory offset map. Each of these worker threads choose the partition which has highest dirty to clean ratio at that time. So once compaction is enabled, broker will start a compaction manager thread and some worker threads. Notice here 0–6 part is clean and after compaction message 4 and 5 are deleted. But for compaction, we need to have key value pairs as we care about latest value for a particular key and having a null key will not work. For this kafka can set the retention policy to compact. There are times when we just want to store current states of application on kafka, in this case we don’t care about old states and these can be deleted in compaction. If a producer is sending all messages in batch in compressed format, then kafka will add a wrapper message over that batch, which gives a better performance over the network and disk. ![]() By doing this we enable kafka to use zero-copy optimisation (CPU do not copy message from one place to another). Kafka stores compressed(if applied) messages in file with offsets. As kafka delete a complete segment, it’ll never delete active segment (in which data is being written currently). ![]() Once a segment is full, new messages produced by producers will be written in new segment. Kafka stores partition in segments so that finding some message and deleting them is easy. And after one of the limits exceeds, kafka will start purging old messages systematically. You can set the maximum time of data retention in kafka also you can define maximum amount of data stored in kafka. Kafka is not a permanent data store solution. By this we maintain the balance of partitions in all directories. It is quite simple, we just look at all the directories and check out the directory with the least amount of partitions and add new partition in that directory. Now we know about the partition allocation, let’s look at, how directories are decided for new partitions. ![]() Partition allocation on different brokers with awareness of racks Otherwise if that particular rack is down then, we can face the unavailability situation… Check out the allocation in diagram: If we have information of racks, then all replicas of a partition should not be stored in the brokers which are on the same rack. Otherwise we can’t leverage the benefits of replication and it can lead us to an unreliable system. So in our example if leader replica of partition 1 is on broker 1 then rest of the replicas should not be on same broker (other replicas on broker 2 and 3). Make sure that each replica of a partition is on different broker. So all brokers should get 3 replicas each. ![]() In this scenario a good allocation will look like: - Replicas should be allocated evenly among all brokers. Let’s suppose we have 5 partitions under a topic and we want to keep replication factor of 3 and we have 5 brokers. Let’s see how partitions are allocated between brokers. We can define, where kafka will store these partitions by setting logs.dirs param. The basic storage unit of kafka is partition.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |