ARM7 is a Von Neue slow structure and a three-stage pipeline structure.
ARM9 and ARM 1 1 are Harvard structures and five-stage pipeline structures, so the performance is higher.
ARM9 and ARM 1 1 mostly have memory managers, which are good for running operating systems, and ARM7 is suitable for streaking.
Our usual ARM9 series includes ARM9 and ARM9E, among which ARM9 belongs to ARM4T architecture, and the typical processors are ARM9TDMI and ARM922T. The ARM9E belongs to the ARM v5TE architecture, and the typical processors are ARM926EJ, ARM946E, etc. Because the number of chips and applications of the latter are more extensive, when we refer to ARM9, we refer more specifically to ARM9E series processors (mainly ARM926EJ and ARM946E). The following introduction about ARM9 also focuses on ARM9E.
Pipeline difference between ARM7 processor and ARM9E processor
For embedded system designers, hardware is usually the first consideration. For processors, pipeline is the most obvious sign of hardware differences, and different pipeline designs will produce a series of hardware differences. Let's compare the assembly lines of ARM7 and ARM9E.
ARM9E has been increased from the 3-stage pipeline of ARM7 to the 5-stage pipeline. The pipeline of ARM9E accommodates more logical operations, but the logical operations at each stage become simpler. For example, the original three-stage pipeline of ARM7 needs to read the register internally first, then perform related logic and arithmetic operations, and then write back the processing results, which is very complicated. In the five-stage pipeline of ARM9E, register reading, logic operation and result write-back are scattered in different pipelines, which makes the actions of each stage pipeline processing very simple. This can greatly improve the main frequency of the processor. Because each stage of pipeline corresponds to one clock cycle of CPU, if the logic in the first stage pipeline is too complicated, the execution time will remain high, which will inevitably lead to the required clock cycle becoming longer and the main frequency of CPU will not be improved. Therefore, the extension of pipeline is beneficial to the improvement of CPU frequency. Under the common chip production process, ARM7 generally runs at around 100MHz, while ARM9E is at least above 200MHz.
Memory subsystem of ARM9E processor
ARM926EJ and ARM946E are the two most common ARM9E processors, which have a set of memory subsystems to improve system performance and support large operating systems. As shown in fig. 2, the memory subsystem includes MMU (memory management unit) or MPU (memory protection unit), cache and write buffer; The CPU is connected with the system memory system through this subsystem.
The introduction of cache and write cache is based on the fact that the processor speed is much higher than the memory access speed; If memory access becomes the bottleneck of system performance, then the processor will be wasted no matter how fast it is, because the processor needs to spend a lot of time waiting for memory. Cache is used to solve this problem. You can store the recently used code and data and provide them to the CPU for processing as quickly as possible (the CPU does not need to wait to access the cache).
A memory subsystem in a complex processor.
MMU is a hardware unit supporting memory management, which meets the requirements of modern platform operating system for memory management. It mainly includes two functions: one is to support virtual/physical address mapping, and the other is to provide protection mechanism for different memory address spaces. A simple example can help us understand the function of MMU.
Under an operating system, program developers develop programs under the API and programming model given by the operating system; Operating systems usually only open a certain memory address space to users. This brings a direct problem. All applications use the same memory address space. If these programs are started at the same time (which is very common in current multitasking systems), there will be a memory access burst. How does the operating system avoid this problem?
The operating system will use MMU hardware unit to complete the conversion from virtual address to physical address of memory access. The so-called virtual address is the logical address used by the programmer in the program, while the physical address is the spatial address of the real memory unit. MMU can map the same virtual address to different physical addresses through certain rules. In this way, even if multiple program processes using the same virtual address are started, they can be mapped to different physical addresses through MMU scheduling without causing system errors.
Function and function of MMU.
In addition to the address mapping function, MMU can also set different access attributes for different address spaces. For example, the operating system sets its own kernel program address space as inaccessible in user mode, so that user applications cannot access the space, thus ensuring the security of the operating system kernel. The difference between MPU and MMU is that it only has the function of setting access attributes for address space, but has no address mapping function.
The introduction of hardware units such as Cache and MMU has brought many brand-new changes to the programming mode of system programmers. In addition to mastering basic concepts and usage methods, the following points are both interesting and important for system optimization:
1, the system considers in real time.
Because the page table for storing address mapping rules is very large, usually only a small part of the commonly used page table content is stored in MMU, and most of the page table content is stored in main memory; When a new address mapping rule is called, MMU may need to read the main memory to update the page table. This will lead to the loss of real-time performance of the system in some cases. For example, when a key program code needs to be executed, if the address space used by this code is not within the current page table processing range of MMU, MMU needs to update the page table first, then complete the address mapping, and then access the corresponding memory; The whole address decoding process is very long, which has a great adverse effect on real-time performance. So generally speaking, the real-time performance of a system with MMU and Cache is not as good as that of some simple processors; However, there are some ways to improve the real-time efficiency of these systems.
A simple way is to turn off MMU and Cache when necessary, thus becoming a simple processor, which can immediately improve the real-time performance of the system. Of course, this is not feasible in many cases; In the design of ARM MMU and Cache, there is a locking function, that is to say, you can specify a page table in MMU not to be updated, and a certain piece of code or data can be locked in Cache not to be refreshed; Programmers can use this function to support those codes with the highest real-time requirements and ensure that these codes always get the fastest response and support.
2. System software optimization
In the development of embedded systems, many system software optimization methods are the same and common, and this rule is also applicable to the ARM9E architecture in most cases. If you are already an expert in ARM7 programming, congratulations. The optimization methods you have mastered in the past can be applied to the new ARM9E platform, but there will be some new features that you need to pay more attention to. The most important thing is the role of cache. Cache itself has not changed the programming model and interface, but if we examine the behavior of cache, we can find that cache has a greater impact on software optimization.
The cache is physically a high-speed SRAM, and the cache line of ARM9E is 4 words (i.e. 32 bytes). The behavior of the cache is controlled by the system controller instead of the programmer, and the system controller will copy the contents near the recently accessed memory address into the cache. In this way, when the CPU accesses the next memory cell (this access may be both fetching data and data), the contents of this memory cell may already be in the Cache, so the CPU does not really need to read the contents from the main memory, but can directly read the contents in the cache, thus speeding up the access. From the working principle of Cache, it can be seen that the scheduling of Cache is based on probability, and the data to be accessed by CPU may or may not exist Cache hit. In the case of cache miss, the speed of CPU accessing memory will be slower than that without cache, because CPU needs to deal with the judgment of cache hit or miss and the refresh of cache content besides accessing data from memory. Only when the benefits brought by cache hits exceed the sacrifices brought by cache misses can the overall performance of the system be improved, so the cache hit rate becomes a very important optimization index.
According to the characteristics of caching behavior, we can intuitively get some methods to improve the cache hit rate, such as putting the code and data related to the function together as much as possible to reduce the number of jumps; Jumps usually result in cache misses. Keep the function size appropriate, and don't write too many too small function bodies, because linear program execution is the most cache-friendly. It is best to put the loop body at the address where four words are aligned, so as to ensure that the loop body is line-aligned in the cache and occupies the least number of cache lines, so that the loop body that has been called many times can obtain better execution efficiency.
Improve performance and efficiency
This paper introduces the performance improvement of ARM9E compared with ARM7, which is not only reflected in the faster main frequency and more hardware features of ARM9E, but also reflected in the execution efficiency of some instructions. The execution efficiency can be measured by the number of clock cycles of CPU; Running the same program, the processor of ARM9E can save about 30% clock cycles compared with ARM7.
The improvement of efficiency mainly comes from the enhancement of the execution efficiency of load and store instructions by ARM9E. As we know, in RISC architecture processor, about 30% of the instructions in the program are load-store instructions, and the efficiency of these instructions contributes most obviously to the system efficiency. There are two factors in ARM9E that help to improve the efficiency of loading storage instructions:
1)ARM9 kernel is Harvard architecture, with independent instruction and data bus; Accordingly, the ARM7 kernel is the key to the reuse of instruction and data bus. Neumann architecture.
ARM 9' s five-stage pipeline design plays back memory access and register writing on different pipelines.
The combination of the two makes it possible to load or store instructions every CPU clock cycle during the execution of the instruction stream. The following table compares the load-store instructions between ARM7 and ARM9 processors. It can be seen that all storage instructions ARM9 save 1 cycle compared with ARM7, and loading instructions can save 2 cycles (without interlocking, the compiler can eliminate most of the possibility of interlocking through compilation optimization).
Combined with various factors, the performance of ARM9E processor is very powerful. However, in the actual system design, the designer does not always maximize the performance of the processor. Ideally, the operating frequency of the processor and system should be reduced to make the performance just meet the application requirements. So as to achieve the purpose of saving power consumption and cost. In the process of evaluating the processor capacity that the system can provide, DMIPS index is adopted by many people; At the same time, it is also widely used in performance comparison between different processors.
However, using DMIPS to measure processor performance has great defects. DMIPS doesn't literally mean millions of instructions per second. It is a unit to measure the relative performance of CPU when it runs a test program named Dhrystone (MIPS is also used as the unit of this performance index in many occasions). Because the test based on process sequence is easily disturbed by malicious optimization, and the release of DMIPS index values is not supervised by any organization, it is necessary to be cautious when using DMIPS for evaluation. For example, different compilation processes of Dhrystone test program can also get very different results when running on the same processor, as shown in Figure 4, which is the result of running the test program on 32-bit 0-wait memory by ARM926EJ. ARM- 1 always adopts a conservative value as the nominal value of DMIPS of CPU, for example, ARM926EJ is 1. 1DMPS/MHz.
Figure 4: DMIPS values of ARM 926EJ processor under different test conditions.
Another disadvantage of DMIPS is that it can't measure the digital signal processing ability of processor and the performance of Cache/MMU subsystem. Because the Dhrystone test program does not contain DSP expressions, but only contains some integer operations and string processing, the test program is too small to run in the cache almost completely, and there is no need to interact with external memory. This makes it difficult to reflect the real performance of processors in real systems.
An encouraging evaluation method is to look at the problem from the perspective of the system, not just the CPU itself; The best test vector for system performance evaluation is the user application or similar test program, which is the most realistic result that users need.
DSP Computing Capability of ARM9E Processor
With the diversification and complexity of applications, multimedia, audio and video functions are also blooming in embedded systems. These applications require considerable DSP processing power; If these algorithms are implemented on the traditional RISC architecture, the required resources (frequency and memory, etc. ) it will be very uneconomical. One of the most important advantages of ARM9E processor is its lightweight DSP processing ability, which brings very practical DSP performance at a very low cost (additional hardware is needed for CPU addition).
Because the DSP capability of CPU is not directly reflected in the evaluation index like DMIPS, and the previous ARM7 processor has no similar concept; Therefore, this is an important point for all people who use ARM9E processor for development.
The DSP extension instructions of ARM9E are shown in Table 2, which mainly include three types.
1) One-cycle 16x 16 and 32x 16 MAC operations are very useful for operand segmentation in 32-bit registers because there are few operands in digital signal processing.
2) Adding the saturation processing extension to the original arithmetic operation instruction. The so-called saturation operation means that when the operation result is greater than an upper limit or less than a lower limit, the result is equal to the upper limit or lower limit; Saturation processing is widely used in audio data and video pixel processing. Now, a single-cycle saturated operation instruction can complete a series of operations of "operation-judgment-value" of ordinary RISC instructions.
3) Leading Zero (CLZ) operation instruction improves the performance of normalization, floating-point operation and division operation.
Take the popular MP3 decoding program as an example. The three front-end steps in the whole decoding process are the most computationally intensive, including bit stream reading (unpacking), Huffman decoding and inverse quantization sampling (inverse transformation). ARM9E's DSP instruction can just accomplish these operations efficiently. Take an MP3 music file with a bit rate of 44. 1 KHz@ 128 kbps as an example. ARM7TDMI needs to occupy more than 20MHz resources, while ARM926EJ only needs less than 10MHz resources.
In the process of platform conversion from ARM7 to ARM9, one thing is very lucky, that is, ARM9E is fully backward compatible with the software on ARM7; Moreover, the programming model and framework foundation faced by developers are also consistent. But after all, many new functions have been added to ARM9E. In order to make full use of these new resources and optimize system performance, we need to know more about ARM9E.