Current location - Plastic Surgery and Aesthetics Network - Plastic surgery and medical aesthetics - A deep understanding of kafka (V) log storage
A deep understanding of kafka (V) log storage
5. 1 file directory layout

There are five checkpoint files in the root directory: cleaner-offset-checkpoint, log-start-offset-checkpoint, meta.properties, recovery-point-offset-checkpoint and replication-offset-checkpoint.

There are the following directories under the partition directory: 0000xxx.index (the offset is 64 bits long and the length is fixed at 20 bits), 0000xxx.log, 0000xxx.timeindex 。

It may also contain temporary files, such as deleted.cleaned.swap, and may also contain .snapshot.txnindex leader-epoch-checkpoint.

5.2 Evolution of Log Format

5.2. 1 v0 version

Before Kafka 0. 10.0

RECORD_OVERHEAD includes an offset (8B) and a message size (4B).

Records include:

Crc32(4B):crc32 check value

Magic( 1B): message version number 0.

Attribute (1B): Message attribute. The lower 3 bits indicate the compression type: 0-none1-gzip2-snappy3-lz4 (0.9.x push out).

Key Length (4B): indicates the length of the message key. -1 means empty.

Key: Optional

Value Length (4B): the length of the actual message body. -1 means empty.

Value: Message body. Can be empty, such as tombstone message.

Version 5.2.2 v 1

Kafka 0. 1.0-0. 1.0

There is a timestamp (8B) field greater than v0, indicating the timestamp of the message.

Property is also used. 0 indicates that the timestamp type is CreateTime, and 1 indicates that the timestamp type is LogAppendTime.

The timestamp type is configured by the proxy parameter log.message.timestamp.type, and the default value is CreateTime, that is, the timestamp created by the producer is adopted.

message compression

In order to ensure end-to-end compression, the server is configured with compression.type, which is "producer" by default, indicating that the compression method used by producer is retained. It can also be configured as "gzip", "Snappy" and "LZ4".

Multiple messages are compressed into the value field to improve the compression rate.

Variable length field

Varints: The most significant bit of each byte has a most significant bit, except that the last byte is 1, the rest are zeros, and the bytes are arranged in reverse order.

In order to make the coding more efficient, Varints uses zigzag coding: sint32 corresponds to (n

5.2.5 Version V2

Record batch

First migration:

Length:

Partition leading era:

Magic: Fixed at 2

Attribute: two bytes. The lower 3 bits indicate the compression format, the 4th bit indicates the timestamp type, the 5th bit indicates the transaction (0- non-transaction 1- transaction), and the 6th bit indicates the control message (0- non-control 1 control).

First timestamp:

Maximum timestamp:

Producer id:

Producer era:

The first sequence:

Record count:

Version v2 message deletes crc field, increases length (total message length), timestamp increment (timestamp increment), offset increment (displacement increment) and header information, and discards these attributes.

record

Length:

Attribute: deprecated, but still accounts for1b.

Timestamp increment:

Offset increment:

Title:

5.3 Log Index

Sparse index: whenever a certain amount is written (specified by the broker parameter log.index.interval.bytes, the default value is 4096B), the offset index file and the time index file correspond to one index item respectively.

Log segmentation strategy:

The size of 1. exceeds the value configured by the proxy parameter log.segment.bytes, and the default value is1073741824 (1GB).

2. The difference between the maximum timestamp of the current log segment message and the timestamp of the current system is greater than that of log.roll.ms or log.roll.hours, and the priority of ms is higher, and the default value is log.roll.hours= 168(7 days).

3. The size of the index file or timestamp index file is larger than the value configured by log.index.size.max.bytes, and the default value is 10485760 (100 MB).

4. offset-basic offset >: integer. maximum

5.3. 1 migration index

Each index entry occupies 8 bytes and is divided into two parts:1.relativeoffset (4B) 2. Position physical address (4b).

Use kafka-dump-log.sh script for parsing. Index files (including. Time index, Snapshot,. Txnindex, etc. ), as follows:

bin/Kafka-dump-log . sh-files/tmp/Kafka-logs/topic id-0/00……00 . index

If the proxy parameter log.index.size.max.bytes is not a multiple of 8, it will be automatically converted to a multiple of 8 internally.

Time stamp index

Each index entry occupies 12 bytes, which is divided into two parts: 1. Timestamp The maximum timestamp of the current log segment (12b) 2. The relative offset corresponds to the relative offset timestamp (4B).

If the broker side parameter log.index.size.max.bytes is not a multiple of 12, it will be automatically converted to a multiple of 12 internally.

5.4 Log cleaning

The log cleaning policy can be controlled to the topic level.

5.4. 1 Log deletion

The proxy parameter log.cleanup.policy is set to delete (the default is delete).

Detection cycle proxy end parameter log.retention.check.interval.ms = 300000 (the default value is 5 minutes).

1. Based on time

Broker parameters log.retention.hours, log.retention.minutes, log.retention.ms, priority ms >;; Minutes & gt hours

When deleted, the suffix. Delete is added first, and delayed deletion is configured according to file.delete.delay.ms (default is 60000).

2. Based on the log size

The total log size is the proxy parameter log.retention.bytes (the default value is-1, which means infinity).

The log segment size is the proxy parameter log.segment.bytes (default is 107374 1824, 1GB).

3. Based on the log start offset

DeleteRecordRequest request

DeleteRecord () of 1 KafkaAdminClient

2.kafka-delete-record.sh script

Log compression

The proxy parameter log.cleanup.policy is set to compact and log.cleaner.enable is set to true (the default value is true).

5.5 disk storage

Related tests: A random pile of disks form six RAID-5 arrays, with 7200r/min linear writing of 600MB/s, random writing of 100KB/s, random writing of 400MB/s and linear memory of 3.6gb/s..

5.5. 1 page cache

The vm.dirty_background_ratio parameter of Linux operating system is used to specify that when the number of dirty pages reaches the percentage of the system, pdflush/flush/kdmflush will be triggered, which is generally less than 10, and 0 is not recommended.

Vm.dirty_ratio indicates that the disk is cleaned after reaching the percentage of dirty pages, but new io requests are blocked.

Kafka also provides the functions of synchronous disk brushing and intermittent forced disk brushing (fsync), which can be controlled by parameters such as log.flush.interval.messages and log.flush.interval.ms

Kafka does not recommend using swap partitions. The upper limit of the vm.swappiness parameter is 100, and the lower limit is 0. The recommended setting is 1.

Disk input/output process

There are four situations for general disk IO:

1. The user calls the standard C library for IO operation, and the data flow is: application buffer->; C library standard iobuffer->; The file system is also cached->; Through a specific file system to disk

2. the user calls the file IO, and the data flow is: application buffer->; The file system is also cached->; Through a specific file system to disk

3. When the user opens the file, use O_DIRECT to bypass the page cache and read and write the disk directly.

4. Users use dd-like tools, use the direct parameter to bypass the system cache, and directly use the file system to read and write disks.

There are four IO scheduling strategies in Linux system:

1. No surgery.

2.CFQ

3. Deadline

4. Happened early

Zero copy

Refers to the data copied directly from the disk file to the network card device, without application.

For linux, it depends on the underlying sendfile ()

For java, the underlying implementation of FileChannal.transferTo () is sendfile ().