kakfa索引

Index file stores the offsets and physical position of the message in the log file.

If you need to read the message at offset 1, you first search for it in the index file and figure out that the message is in position 79. Then you directly go to position 79 in the log file and start reading. This makes it quite effective as you can use binary search to quickly get to the correct offset in the already sorted index file.

kakfa内部用的是稀疏索引,一般来说每个block 512Kb,这样通过offset定位到512KB的数据,然后在这批数据中进行下一步查询,如果是 稠密索引就更简单了,直接定位到目标的record(因为segment中的消息是顺序写入的,稀疏索引定位到某个offset对应的512kB的block后是可以找到其中真正所需的消息的)

index

稀疏索引中,左边的offset对应右边的消息,一个索引记录对应3个消息,真是的case可能是512Kb,比如offset为5的消息怎么定位呢?先找到索引记录3,然后找到右边具体的block,就是3到5,这个block肯定包含offset为5的记录,然后在这个有限数据中通过二分查找找到offset为5的记录效率会很快

https://medium.com/r/?url=https%3A%2F%2Fwww2.cs.sfu.ca%2FCourseCentral%2F354%2Fzaiane%2Fmaterial%2Fnotes%2FChapter11%2Fnode5.html
https://medium.com/@durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f
https://medium.com/@durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f
https://stackoverflow.com/questions/19394669/why-do-index-files-exist-in-the-kafka-log-directory
https://tech.meituan.com/2015/01/13/kafka-fs-design-theory.html
https://blog.csdn.net/idber/article/details/8096941

kakfa自我随记

  1. kafka利用page cache提升性能,但是page cache有风险,如果机器crash了,数据会丢失,但是kakfa利用replica保证了可用性

  2. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
    一个topic ,默认分区的replication个数,不能大于集群中broker的个数。
    分区数量可以超过机器数量,但是replication数量超过机器数量没意义了,因为这个本来就是用来容错的

  3. consumer-offset这个内置topic的 分区数量和replication数量是server.proxx配置的
    其他的topic可以命令行创建的时候设定分区数量和replication数量

zero-copy

“Zero-copy” describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is frequently used to save CPU cycles and memory bandwidth when transmitting a file over a network.[1]

https://en.wikipedia.org/wiki/Zero-copy

https://blog.csdn.net/u013256816/article/details/52589524

Kafka在 消息消费的时候使用该技术
Consumer从Broker获取文件数据的时候,直接通过下面的方法进行channel到channel的数据传输。

1
2
3
4
java.nio.FileChannel.transferTo(
long position,
long count,
WritableByteChannel target)`

也就是说你的数据源是一个Channel,数据接收端也是一个Channel(SocketChannel),则通过该方式进行数据传输,是直接在内核态进行的,避免拷贝数据导致的内核态和用户态的多次切换。

https://cloud.tencent.com/developer/article/1195251

Java input streams can support zero-copy through the java.nio.channels.FileChannel’s transferTo() method if the underlying operating system also supports zero copy.[5]

https://www.jianshu.com/p/7e5e558bf93e
http://tech.dianwoda.com/2019/01/02/linux-iohe-zero-copyzong-jie/

用户空间-内核空间

RAM is divided into two distinct regions- the user space and the kernal space.

User Space- It is set of locations where normal user processes run. These processes can’t access kernal space directly. Some part of kernal space can be accessed via system calls.These system calls acts as software interrupts in kernal space.

Kernal Space - kernal runs in the dedicated part of memory. Role of kernal space is to manage applications/ processes running in user space. It can access all the memory. If a process perform a system call, a software interrupt is sent to kernal which then dispatches an appropriate interrupt handler.

https://www.quora.com/What-is-the-difference-between-user-space-and-the-kernel-space

用户模式-内核模式

Kernel Mode
In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.

User Mode
In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.

https://blog.codinghorror.com/understanding-user-and-kernel-mode/

内存管理

Logical address
Included in the machine language instructions to specify the address of an operand or of an instruction. This type of address embodies the well-known 80 × 86 segmented architecture that forces MS-DOS and Windows programmers to divide their programs into segments. Each logical address consists of a segment and an offset (or displacement) that denotes the distance from the start of the segment to the actual address.

Linear address (also known as virtual address)
A single 32-bit unsigned integer that can be used to address up to 4 GB - that is,up to 4,294,967,296 memory cells. Linear addresses are usually represented in hexadecimal notation; their values range from 0x00000000 to 0xffffffff.

Physical address
Used to address memory cells in memory chips. They correspond to the electrical. signals sent along the address pins of the microprocessor to the memory bus.Physical addresses are represented as 32-bit or 36-bit unsigned integers.
The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit; subsequently, a second hardware circuit called a paging unit transforms the linear address into a physical address.
现代操作系统普遍采用虚拟内存管理机制,这需要MMU的支持。MMU通常是CPU的一部分,如果处理器没有MMU,或者有MMU但没有启用,CPU执行单元发出的内存地址将直接传到芯片引脚上,被内存芯片(物理内存)接收,这称为物理地址(Physical Address),如果处理器启用了MMU,CPU执行单元发出的内存地址将被MMU截获,从CPU到MMU的地址称为虚拟地址(Virtual Address),而MMU将这个地址翻译成另一个地址发到CPU芯片的外部地址引脚上,也就是将虚拟地址映射成物理地址

io读写

随机-顺序读写

这个问题要分情况讨论:在机械硬盘上写还是在固态硬盘上写。尽管结论都是顺序写比随机写快,但是原因却是不一样的。
首先说机械硬盘,我先介绍一下它的存储原理。机械硬盘的结构你可以想象成一个唱片机,它有一个旋转的盘片和一个能沿半径方向移动的磁头。处理读取和写入请求时,首先可以根据请求的开始地址算出要处理的数据在磁盘上的位置,之后要进行以下几步工作:
1、磁头沿半径方向移动,直至移动到数据所在的柱面(相同半径的磁道组成的环面)
2、盘片高速旋转,使磁头到达数据的起始位置
3、磁头沿磁道从磁盘读取或写入数据
当一次读取的数据量很少的时候,1、2步骤带来的开销是无法忽略的,这使得随机写相对于顺序写会有巨大的性能劣势。因为在顺序写的时候,1、2步骤只需要执行一次,剩下的全是数据传输所需要的固有开销;而每次随机写的时候,前两个步骤都需要执行,带来了极大的额外开销。
比如kafka的那么多个消息如果每次都随机写的话要经历很多次的寻道,但是顺序写就简单多了,比如累计到1M的消息可以一次性顺序append到commit log后面去
你平时使用java访问文件既可以是顺序的也可以是随机的,取决于你使用的Java API。比如RandomAccessFile和FileInputStream/FileOutputStream这样的stream类就是两种不同的方式。
具体到Kafka而言,它使用了基于日志结构(log-structured)的数据格式,即每个分区日志只能在尾部追加写入(append),而不允许随机”跳到”某个位置开始写入,故此实现了顺序写入。 kafka没有必要把N个消息随即写入到文件的各个position
其次说固态硬盘。理论上来说,它不应该存在明显的随机写与顺序写的速度差异,因为它就是一块支持随机寻址的存储芯片,没有寻道和旋转盘片的开销,但是随机写实际上还是比顺序写要慢。这是由于其存储介质闪存的一些特性导致的,简单来说:
1、闪存不支持in-place update:你更新一个数据,不可以直接在原有数据上改,而要写到新的空白的地方,并把原有数据标记为失效。
2、标记失效的数据不是浪费空间么?可以将其清除。但是闪存上清除操作的最小单位是一个大块,大约128K-256K的大小。一次清除会影响到还未标记失效的有用的数据,要先把它们移走。
这种感觉就如同你在网格纸上写一篇文章,一格一格往下写,只能写在空白的格子里;但是你若要清除之前写的内容,只能整行擦除。非常难受而且浪费空间对吧?所以固态硬盘里实现了垃圾回收算法,用来更好地利用存储空间,同时减少数据迁移,保护闪存寿命。
那么随机写显然比顺序写带来更大的碎片化,从而带来更多的垃圾回收开销、数据迁移开销,自然就比顺序写要慢了。

磁盘

物理概念

物理接口
M.2 , U.2 , AIC, NGFF 这些属于物理接口。
高速信号协议
SAS,SATA,PCIe 这三个是同一个层面上的,模拟串行高速接口。
传输层协议
SCSI,ATA,NVMe 都属于这一层。主要是定义命令集,数字逻辑层。
NVMe,AHCI和IDE是传输协议(语言)。 它们运行在诸如PCIe或SATA(口头,书写)之类的传输接口之上。
现在市面上 SSD接入PC 的主要接口 是 SATA。其采用命令协议AHCI(它还支持IDE)是为寻道旋转磁盘设计的, 而不是闪存。 SATA三代 传输速率 每秒在150 MB ~ 600 MB 之间。对于SSD的大多数消费者来说,这绝对是足够的。
PCIe(PCI Express)取代SATA 作为最新的高带宽接口。
NVMe是最新的高性能和优化协议,取代了AHCI,并支持PCIe技术。
RAID 0:如果你有n块磁盘,原来只能同时写一块磁盘,写满了再下一块,做了RAID 0之后,n块可以同时写(并发write),速度提升很快,但由于没有备份,可靠性很差。n最少为2。
RAID 1:正因为RAID 0太不可靠,所以衍生出了RAID 1。如果你有n块磁盘,把其中n/2块磁盘作为镜像磁盘,在往其中一块磁盘写入数据时,也同时往另一块写数据。坏了其中一块时,镜像磁盘自动顶上,可靠性最佳,但空间利用率太低。n最少为2。
RAID0性能最高,RAID1可靠性最高
可以用ssd来做磁盘阵列,如果要性能就RAID0,可靠性就RAID1

文件系统相关

Once a file system is established the block size remains the same. Some partitioning tools can change this after the fact, but not while the OS is running.

A sector has traditional been a fixed 512 byte size, but a few drives have 4096 bytes sectors.

A sector is the smallest individual reference-able regions on a disk.

The block size refers to the allocation size the file system uses. The common options are 512, 1024, 2048, 4096, 8192, 16384, or 32678. Generally anything larger would be so inefficient nobody would use it, and you can’t go smaller than 1 disk sector.

Sure you can write 10 bytes to a file, but behind the scenes it is allocated 1 block whether you use it all or not.

操作系统记录了文件的block number,读取的时候是根据block number定位到具体的sector,block是操作系统分配的,操作系统(文件系统)记录block和sector的对应关系

所以操作系统一次读取是以block为单位的,比如sector size是4 kb, block size可能是4kb也可能是8kb或者更多,哪怕你只需要一个字节的数据,但是读取的时候仍然读取了8kb在操作系统缓存

格式化是操作系统的功能,操作系统可以对整个硬盘格式化分区为ntfs格式,但是分区后也可以用windows自带的压缩卷功能分出一块新的区域用来安装linux,linux再对这块新的空闲区域进行格式化分区为ext3格式

参考

https://medium.com/r/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FDisk_sector
https://medium.com/r/?url=http%3A%2F%2Fwww.mathcs.emory.edu%2F~cheung%2FCourses%2F377%2FSyllabus%2F1-files%2Fintro-disk.html
https://docstore.mik.ua/orelly/other/LRH/chp-2-sect-4.html
https://www.zhihu.com/question/48972075
kafka存储 https://medium.com/@durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f