Before Moore's Law became invalid, improving the performance of the processor through technologies such as frequency boosting and hardware hyper-threading can meet the application requirements. With the increase of the main frequency approaching the wall of the speed of light, Moore's law began to fail gradually, and multi-core integration became the mainstream means to improve the performance of processors. It is difficult to see single-core processors on the market now, which is the evidence of this development trend. To give full play to the rich advantages of multi-core computing resources, parallel programming under multi-core is inevitable, and Linux kernel is a typical multi-core parallel programming scenario. But parallel programming under multi-core has many challenges.
Challenges of Multi-core Parallel Programming
At present, the mainstream computers are all Von Neumann architecture, that is, the computing mode of * * * shared memory, which is not friendly to parallel computing. The following figure is a typical computer hardware architecture.
In this architecture, there are the following design features:
Multi-CPU cores improve the computing power of the processor;
Multi-level cache improves the efficiency of CPU accessing main memory;
Each CPU has local memory (NUMA), which further improves the efficiency of CPU accessing main memory.
The storage buffer module improves the problem of writing pause caused by response delay in cache writing;
The invalid queue module improves the delay of invalid response, and sends the response immediately after putting the invalid command in the queue;
Peripheral DMA supports direct access to main memory to improve CPU efficiency;
These hardware system design features also introduce many problems, the biggest problem is cache consistency and out-of-order execution.
Cache coherence is solved by cache coherence protocol MESI, which is guaranteed by hardware and transparent to software. The MESI protocol ensures that all CPUs modify a single variable in a single cache line in the same order, but it cannot guarantee that the modifications of different variables appear in the same order on all CPUs. This caused confusion. Not only that, there are many reasons for disorder:
The processing delay caused by storage buffer will cause confusion;
Delayed processing caused by invalid queues will cause confusion;
Compilation optimization will cause disorder;
CPU hardware optimization techniques such as branch prediction and multi-pipeline will cause disorder;
Peripheral DMA will cause data disorder;
This situation makes it impossible to guarantee atomicity even by simple++operation. These problems must be solved by new technical means of multi-core parallel programming.
Key technologies of multi-core parallel programming
Locking technology
Linux kernel provides a variety of locking mechanisms, such as spin lock, semaphore, mutual exclusion, read-write lock, sequential lock and so on. The simple comparison of various locks is as follows, and the details of implementation and use are not expanded here. You can refer to the relevant chapters of the book Linux Kernel Design and Implementation.
Spin lock, non-hibernation, no process context switching overhead, can be used in the case of interrupt context and small critical area;
Semaphore, which can sleep, supports multiple concurrent entities to enter the critical area at the same time, and can be used in situations where it may sleep or the critical area is very long;
Mutually exclusive, similar to semaphores, but only supports that only one concurrent body enters the critical section at the same time;
Read-write lock, supporting concurrent reading, mutually exclusive writing/reading/writing, reading will delay writing, friendly to reading, suitable for reading-oriented occasions;
Sequential lock, supporting reading concurrency, writing/reading and writing mutually exclusive, delaying reading when writing, friendly to writing, suitable for writing-oriented occasions;
Although lock technology can effectively provide competition protection under parallel execution, the parallel scalability of lock is poor, which can not give full play to the performance advantages of multi-core Too coarse granularity of locks will limit scalability, and too fine granularity will lead to huge system overhead, difficult design and easy deadlock. In addition to poor concurrency scalability and deadlock, locks will also introduce many other problems, such as lock panic, live locks, hunger, unfair locks, priority inversion and so on. However, there are some technical means or guiding principles that can solve or reduce the risks of these problems.
Use a unified order of locks (lock hierarchy) to solve the deadlock problem;
Exponential regression to solve the live lock/hunger problem;
Range lock (tree lock) solves the problem of lock alarm group;
Priority inheritance to solve the problem of priority inversion;
Atomic technology
Atomic technology is mainly to solve the inconsistency between cache and memory and the destruction of atomic access by out-of-order execution. The main atomic primitives in Linux kernel are:
ACCESS_ONCE (), READ_ONCE () and WRITE_ONCE (): prohibit the compiler from optimizing data access and force the data to be obtained from memory instead of cache;
Barrier (): Out-of-order access memory barrier, which limits out-of-order optimization of compiler;
Smb_wmb (): Write memory barrier, refresh memory buffer, and limit out-of-order optimization of compiler and CPU;
Smb_rmb (): Read the memory barrier, refresh the invalid queue, and limit the out-of-order optimization of compiler and CPU;
Smb_mb (): Read-write memory barrier, refresh memory buffer and invalid queue at the same time, and limit out-of-order optimization of compiler and CPU;
Atomic_inc()/atomic_read (), etc. : integer atomic operation;
Strictly speaking, as a system software, the realization of Linux kernel is greatly influenced by hardware, and different hardware has different memory models. Therefore, unlike high-level languages, there is no unified model of atomic primitive semantics in Linux kernel. For example, on SMP's ARM64 CPU, the implementation of barrier, smb_wmb and smb_rmb is as volatile as smb_mb.
In addition, it is worth mentioning that atomic_inc () primitive needs to refresh the cache in order to ensure atomicity, but the propagation of cache lines in multi-core systems is quite time-consuming, and its parallel scalability in multi-core systems is poor.
Lockless technology
The aforementioned atomic technology is one of the lock-free technologies. In addition, lock-free technology also includes RCU, danger pointer and so on. It is worth mentioning that these lock-free technologies are based on the memory barrier.
Danger pointer is mainly used for life cycle management of objects, which is similar to reference counting, but has better parallel scalability than reference counting.
RCU is suitable for many scenarios, which can be replaced by read-write lock, reference counting, garbage collector and waiting for things to end, and has better parallel scalability. But RCU also has some inapplicable scenarios, such as writing key points; Person in charge of key areas; Scenes such as dormancy in key areas.
However, all lock-free primitives can only solve the problem of parallel scalability at the reading end, and the parallel scalability at the writing end can only be solved by data segmentation technology.
Data segmentation technology
Dividing data structure and reducing shared data is the fundamental way to solve parallel scalability. Partition-friendly (i.e. parallel-friendly) data structures are:
rank
Hash Table
Radix tree)/sparse array
Skip list
Using these easily partitioned data structures will help us to improve parallel scalability through data partitioning.
In addition to using appropriate data structures, reasonable segmentation criteria are also important:
Separation of reading and writing: separation of reading-based data from writing-based data;
Path segmentation: execute path segmentation data according to independent codes;
Special partition: bind frequently updated data to specified CPU/ thread;
Attribution division: divide the data structure according to the number of CPU/ threads and divide the data into per-CPU/per-thread;
Among the four division rules, the ownership division is the most thorough.
These multi-core parallel programming contents basically cover all key technologies of Linux kernel concurrent programming. Of course, there are many other technologies of parallel programming that have not been applied to the Linux kernel, such as parallel functional programming technology (Erlang/Go, etc.) without side effects. ), messaging, MapReduce and so on.