Current location - Plastic Surgery and Aesthetics Network - Plastic surgery and beauty - Dpkqos framework -3. Implementation of Hierarchical Scheduling Module
Dpkqos framework -3. Implementation of Hierarchical Scheduling Module
The following figure shows the details of the internal data structure.

Handling the enqueue/dequeue operation of the same send port on different CPU cores may have a significant impact on the performance of the scheduler, so it is not recommended.

Port enqueue/dequeue operations require * * * access to the following data structures:

The following actions will cause the expected performance degradation:

Therefore, the enqueue and dequeue operations of the scheduler must run in the same thread, thus allowing the queue and bitmap operations to be thread-safe, and keeping the data structure of the scheduler in the same CPU core. Try to localize the data structure and avoid cross-thread sharing.

Increasing the number of NIC ports only requires proportionally increasing the number of CPU cores used for traffic scheduling.

The package processing steps are as follows:

It should be noted that there is a strong dependence between these steps. Steps 2 and 3 cannot be started until the results of steps 1 and 2 are available, so it is impossible to take advantage of any performance optimization provided by the processor out of order.

When there are a large number of packets and queues to be processed, it is very likely that the data structure needed to process the current packet is not in the L 1 and L2 cache, so the memory access result of the above three steps will lead to a miss in the L 1/L2 cache. Considering the performance factor, it is unacceptable that each packet causes L 1/L2 cache to miss three times.

The solution is to prefetch the required data structures in advance. There is an execution delay in the prefetch operation, during which the processor should not attempt to access the currently prefetched data structure, so the processor should perform other work. The only other work that can be done is to perform different steps of the enqueue operation flow on other input packets, so as to process the enqueue operation in a pipelined manner.

Figure 2 shows the pipeline implementation of queuing operation, in which there are four pipeline stages, each of which processes two different input packets. In a given time, no input packet can be part of multiple pipeline stages.

The above describes a very basic congestion management scheme of enqueue pipeline: when the queue is full, all packets arriving in the same queue will be discarded, and new packets can only be inserted into the queue until some packets are consumed by dequeue operation. If red /WRED is used to improve the enqueue pipeline, we can make enqueue/discard decisions for specific packets by looking at the queue occupancy and the priority of packets (instead of enqueuing or discarding all packets indiscriminately).

The steps of scheduling and processing the next packet in the current pipeline are as follows:

The data structures (pipes, queues, queue arrays, mbufs) involved in the above operations can be prefetched into the cache before being accessed, thus avoiding cache misses. After prefetching the data of the current pipeline (pipeline A), you can immediately switch to another pipeline (pipeline B) for processing, and then switch back after the prefetching operation of pipeline A is completed, so that the data of multiple pipelines can be processed synchronously by using the time delay of prefetching itself.

The dequeue state machine uses the processor cache mechanism to process as many active pipes as possible at the same time, thus sending more packets.

The output port is a queue that is constantly filled with data sent by the scheduler. For 10 GbE, the port scheduler needs to fill 1.25 billion bytes per second. If the scheduler is not fast enough to fill the queue (assuming there are enough packets and credits to transmit), some bandwidth will be idle, resulting in waste.

In principle, the dequeue operation of hierarchical scheduler should be triggered by NIC transmitter. Usually, once the input traffic of NIC send queue falls below the predefined threshold, the port scheduler will wake up (based on interrupt or polling, whether to wake up by continuously monitoring the queue occupation) and fill the queue with more packets.

The scheduler needs to track the credit information that changes with time, which needs to update the credit based on time (for example, the traffic shaping of sub-ports and pipes, the implementation of the upper limit of traffic groups, etc.). ).

Every time the scheduler decides to send a packet to the network card transmitter, the scheduler will adjust its internal time information accordingly. Because the time required to send a byte on the physical interface is fixed, it can be used as the reference value of internal time in bytes, which will be more convenient to handle. When a packet is ready for transmission, the time increases in the form of (n+h), where n is the packet length in bytes and h is the number of bytes of frame data (including frame header, CRC, etc.). ) data.

The scheduler needs to align its internal time reference value with the port rate to ensure that the scheduler will not fill in data exceeding the sending rate of the network card, thus preventing packet loss due to the full queue of the network card.

The scheduler needs to read the current time of each dequeue call. The CPU timestamp can be obtained by reading TSC register (timestamp counter) or HPET register (high precision event timer). The current CPU timestamp needs to convert the CPU clock into bytes by the following formula: time _ bytes = time _ cycles/cycles _ per _ byte, where cycles _ per _ byte is the CPU cycle consumed to transmit one byte (for example, the port of 10GbE transmits data at the CPU frequency of 2GHz, and cycles _ per _ byte = 2g/(65438

The scheduler will maintain an internal time reference for the network card time, and the network card time will increase with the increase of the scheduling packet length (including the frame overhead). On each dequeue call, the scheduler will check its internal reference to the network card time according to the current time:

The scheduler round trip delay (SRTD) is the time (in CPU cycles) between two consecutive checks of the same pipeline by the scheduler.

In order to keep up with the rate of the sending port (that is, to avoid wasting the bandwidth of the network card), the scheduler should be able to schedule packets faster than the actual transmission speed of the network card sender.

Assuming that the port bandwidth is not oversold, the scheduler needs to track the rate of each pipeline according to the configured token bucket. This means that the size of the token bucket of the pipeline should be set high enough to prevent it from overflowing due to the large SRTD, which will lead to the loss of available bandwidth of the pipeline.

When all the following conditions are met, the scheduler can make a reasonable scheduling decision and send the next packet for (sub-port S, pipeline P, traffic group TC, queue Q):

If all the above conditions are met, a packet can be selected for transmission, and necessary credits can be subtracted from the sub-port S, the flow group TC of the sub-port S, the pipeline P and the flow group TC of the pipeline P. ..

Because the greatest common denominator of all packet lengths is one byte, the unit of measurement of credit is bytes. The number of credits required to transmit a packet of n bytes is equal to (n+h), where h is equal to the frame overhead (in bytes) of each packet.

Traffic shaping of sub-port and pipeline is realized by token bucket, each token bucket is realized by a saturation counter, and the value of credit is tracked by the counter.

The following table shows the common parameters and operations of token bucket:

In order to realize the general operation of the above token bucket, the current design uses the persistent data structure given in the following table:

The bucket rate (bytes per second) can be calculated by the following formula:

Where r = port line speed in bytes per second.

The implementation of token bucket operation is shown in the following table:

The priority of TC in the same pipeline is strictly defined, and its scheduling is realized by the pipeline dequeue state machine, and the selection queues are arranged in ascending order of priority. Therefore, queue 0 (TC with the highest priority related to TC 0) is processed before queue 1 (TC 1 with a lower priority than TC 0), and queue 1 is processed before queue 2 (TC 2 with a lower priority than TC 1) and continues until all but the lowest priority TC. Finally, process the queue 12- 15 (best effort TC, lowest priority TC).

Traffic shaping is not supported by TC at the pipeline and sub-port levels, so token buckets are not maintained in the context of these two levels. The TC upper limit of pipeline and sub-port level is forcibly defined as periodic backfill credit, and every time a packet is processed at pipeline/sub-port level, credit will be consumed, as shown in the following two tables.

Persistent data structure of TC mandatory upper limit of sub-port/pipeline;

The WRR design scheme with the lowest priority TC (best effort TC) has evolved from simple to complex as shown in the following table.

Overselling of the sub-port service group X is a configuration problem, which will occur when the bandwidth allocated to the service group X by the member pipes of the sub-port is greater than the bandwidth allocated to the same service group at the higher sub-port level.

Overselling of specific sub-ports and traffic groups is entirely the result of pipeline and sub-port level configuration, rather than the dynamic evolution of traffic load (such as congestion) at runtime.

If the overall demand of the current sub-port for traffic group X is low, the existence of oversold does not represent a problem, because the demand of all member pipelines for traffic group X can be fully met. However, when the demand of the flow group X of the member pipes of all sub-ports together exceeds the limit configured at the sub-port level, this will no longer be possible.

The following table summarizes some possible solutions to this problem, and the current implementation is based on the third solution.

Generally, the oversold of sub-port TC is only enabled for the lowest priority traffic group, which is usually used for best-effort traffic, and the management plane needs to prevent other (higher priority) TCs from happening.

For the convenience of implementation, it is also assumed that the upper limit of the lowest priority TC of the sub-port is set to 100% of the sub-port rate, and the upper limit of the lowest priority TC of the pipeline is set to 100% of the pipeline rate for all the member pipelines of the sub-port.

The algorithm will calculate a watermark first, and the watermark will be updated regularly according to the current demand of the sub-port member pipeline. Its purpose is to limit the traffic of the lowest priority TC that each pipeline is allowed to send. At the beginning of each TC upper limit calculation cycle, the capacity is calculated at the sub-port level, and the same capacity value is used on the member pipes of all sub-ports in the current implementation cycle. The following explains how the capacity calculated as the sub-port level at the beginning of each cycle propagates to the member pipes of all sub-ports.

At the beginning of the current calculation cycle (the same as the end of the last calculation cycle), the capacity value needs to be adjusted according to the bandwidth value allocated to the best-effort TC at the beginning of the last cycle but not consumed by the sub-port member pipes.

If the sub-port has unused best-effort TC bandwidth, increase the capacity value of the current cycle to encourage the member pipes of the sub-port to consume more bandwidth. Otherwise, reduce the capacity value to force the best-effort TC bandwidth consumption of the sub-port member pipes to be equal.

Capacity values will be increased or decreased in small increments, so it may take several execution cycles to reach a balanced state. This state can be changed at any time due to the change of the demand of the sub-port member pipes for best-effort TC, for example, due to the increase of demand (when the capacity value needs to be reduced) or the decrease of demand (when the capacity value can be increased).

When the demand is low, it is necessary to set a higher capacity value to prevent the sub-port member pipes from occupying too much bandwidth. The highest value of capacity is selected as the highest rate of sub-port member pipeline configuration. The following table describes the capacity operation.

At the beginning of each TC flow limiting execution cycle, the capacity transfer from the sub-port level to the member pipes:

Capacity value calculation:

In order to schedule messages, the scheduler must check messages and credits in multiple queues. The more queues the scheduler needs to check, the lower its performance.

The scheduler maintains a bitmap of the active queue so that it can directly skip the inactive queue. However, in order to detect whether a specific pipeline has enough credit, it is necessary to use the pipeline sending state machine for detection. No matter what the final scheduling result is (no message is generated or at least one message is generated), it will always consume CPU time.

This scenario emphasizes the importance of policy to the performance of the scheduler: if the pipeline does not have enough credit, it should discard its packets as soon as possible (before triggering the hierarchical scheduler), so that the pipeline queue can be marked as inactive and the sender can skip the pipeline without consuming resources to check the pipeline credit.

The performance of the port scheduler is optimized for a large number of queues. However, if the number of queues is small, the performance of the port scheduler is expected to be worse than that of a small group of message queues for the same level of active traffic.