Foundation _8. Audio coding MP3\AAC

Learn and organize links to related chapters:

Basic _ 1. Audio and video learning framework

Basic article 2. Color space model RBG, YUV, HSV

Basic articles _3. Image coding Bmp

Basic articles _4. Basic concepts of audio

Basic chapter 5. audio data acquisition

Basic articles _6. Audio coding PCM

Basic articles _7. Audio coding

Foundation _8. Audio coding MP3\AAC

For more information about mp3 encoding, please refer to the following links.

/p/58df 7 1a 1990 1

AAC is the abbreviation of advanced audio coding, which appears in 1997. It was originally an audio coding technology based on MPEG-2. By Flawn Hof IIS, Dolby Lab, in&; T, Sony and other * * * companies jointly developed to replace the MP3 format. In 2000, MPEG-4 standard came out, and AAC re-integrated other technologies (PS, SBR). In order to distinguish it from the traditional MPEG-2 AAC, AAC with SBR or PS characteristics is also called MPEG-4 AAC.

AAC is a new generation of audio lossy compression technology. Through some additional coding techniques (such as PS, SBR, etc. ), which derives three main codes: LC-AAC, He-AAC and He-AAC V2. LC-AAC is the traditional AAC, which is relatively mainly used for middle and high code rates (>: =80Kbps), and HE-AAC (equivalent to AAC+SBR) is mainly used for middle and low codes (

AAC*** has 9 specifications to meet the needs of different occasions:

MPEG-2 AAC LC is low complexity)-standard-relatively simple, without gain control, but it improves coding efficiency, and can find a balance between coding efficiency and sound quality at medium bit rate.

Main specifications of MPEG-2 AAC

MPEG-2 AAC SSR variable sampling rate (scalable sampling rate)

Low complexity of MPEG-4 AAC LC)-Specification-The audio file of this specification contains the audio part of MP4 files that are common in mobile phones.

The main specification of MPEG-4 AAC Main-including all functions except gain control, has the best sound quality.

MPEG-4 AAC SSR variable sampling rate (scalable sampling rate)

MPEG-4 AAC LTP long-term prediction specification (long-term prediction)

MPEG-4 AAC LD Low Delay Specification

MPEG-4 AAC HE Efficiency)- Specification-This specification is suitable for low bit rate coding, including

Nero ACC encoder support

At present, LC and HE (suitable for low bit rate) are the most widely used. At present, the popular Nero AAC coding program only supports LC, he and HEv2, and the encoded AAC audio and specification display are LC. It is actually AAC(LC)+SBR technology, and HEv2 is AAC(LC)+SBR+PS technology;

** Hev 1 and HEv2 are simply represented by this figure: * *

* * (AAC in the figure refers to the original AAC-LC)**

* * He: "High efficiency". HE-AAC v 1 (also known as AACPlusV 1, SBR), AAC(LC)+SBR technology is realized by container method. SBR actually stands for spectral band replication. Simply put, the main spectrum of music is concentrated in the low frequency band, and the high frequency band is very small, but it is very important, which determines the sound quality. If the whole frequency band is encoded, if the high frequency band is protected, the low frequency band will be encoded too finely and the file will be huge; If the main components of low frequency are retained and the high frequency components are lost, the sound quality will be lost. SBR cuts the frequency spectrum, encodes the low frequency separately to save the main components, and amplifies the high frequency separately to save the sound quality, which perfectly solves this contradiction while reducing the file size and "balancing" the sound quality.

* * HeV2: * * Container method includes HE-AAC v 1 and PS technology. PS stands for "parametric stereo". The file size of the original stereo file is twice that of the channel. However, there are some similarities between the two channels. According to Shannon's entropy coding theorem, the correlation should be removed to reduce the file size. So PS technology stores all the information of one channel, and then uses a few bytes to describe the parameters of another channel and their differences.

(1)AAC is an audio compression algorithm with high compression ratio, but its compression ratio far exceeds that of older audio compression algorithms, such as AC-3 and MP3. Its quality is comparable to that of uncompressed CD.

(2) Like other similar audio coding algorithms, AAC also adopts transform coding algorithm, but AAC uses filter banks with higher resolution, so it can achieve higher compression ratio.

(3)AAC adopts the latest technologies, such as temporary noise recombination, backward adaptive linear prediction, joint stereo technology, quantized huffman encoding and so on, which further improves the compression ratio.

(4)AAC supports more sampling rates and bit rates, from 1 to 48 tracks, up to 15 low-frequency tracks, multilingual compatibility, up to 15 embedded data streams.

(5)AAC supports a wider sound frequency range, with the highest reaching 96kHz and the lowest reaching 8KHz, which is much wider than the range of 16KHz-48kHz of MP3.

(6) Unlike MP3 and WMA, AAC hardly loses very high frequency and very low frequency components in sound frequency, and is closer to the original audio in spectrum structure than WMA, so the fidelity of sound is better. Professional evaluation shows that AAC is clearer and closer to the original sound than WMA.

(7)AAC uses the optimized algorithm to achieve higher decoding efficiency, and only needs less processing power when decoding.

Audio data interchange format. The characteristic of this format is that the beginning of this audio data can be found with certainty, and there is no need to start decoding in the middle of the audio data stream, that is, its decoding must be carried out at the clearly defined beginning. Therefore, this format is often used for disk files.

Audio data transport stream. The characteristic of this format is that it is a bit stream with synchronization words, and decoding can start from anywhere in this stream. Its characteristics are similar to mp3 data stream format.

Simply put, ADTS can decode in any frame, which means that it has header information in every frame. ADIF only has a unified header, so it must get all the data before decoding. And the formats of these two headers are also different. At present, the encoded and extracted audio streams are in ADTS format. The specific organizational structure of the two is as follows:

The ADIF format of AAC is shown in the following figure:

The general format of AAC ADTS is shown in the following figure:

The figure shows the concise structure of a frame of ADTS, and the blank rectangles on both sides represent the data before and after a frame.

Title information of ADIF:

ADIF header information is located at the beginning of AAC file, followed by continuous original data blocks.

The fields that make up the ADIF header information are as follows:

Fixed header information of ADTS:

Variable header information of ADTS:

The purpose of (1) frame synchronization is to find out the position of the frame header in the bit stream. 138 18-7 specifies that the header sync word in aac ADTS format is 1 165438+.

(2) 2) The header information of ADTS consists of two parts, one is fixed header information, followed by variable header information. The data in the fixed header information is the same every frame, while the variable header information is variable from frame to frame.

In AAC, the original data block can be composed of six different elements:

Single channel element. Single channel element. A single channel element basically consists of only one integrated circuit. The original data block probably consists of 16 SCEs.

CPE: Channel Pair Element dual-channel element, which consists of two IC's that may * * * enjoy side information and some joint stereo coding information.

CCE: coupled channel element coupled channel element. A block representing multi-channel joint stereo information or dialogue information of multi-language programs.

LFE: low-frequency element, low-frequency element. Contains a channel that enhances the low sampling frequency.

DSE: Data stream element A data stream element that contains some additional information that is not audio.

PCE: Program configuration element Program configuration element. Contains channel configuration information. It may appear in the ADIF header information.

FIL: fill element fill element. Contains some extended information. Such as SBR, dynamic range control information and so on.

AAC decoding process

[image upload failed ... (image -EAF 24c- 1543569949388)]

As shown in the figure:

After the main control module starts running, the main control module puts a part of the AAC bit stream into the input buffer, and gets the start of a frame by looking up the synchronization word. After finding it, we began to perform noise-free decoding according to the syntax described in ISO/IEC 138 18-7, which is actually Huffman decoding. After dequantization, joint stereo, perceptual noise replacement (PNS), instantaneous noise shaping (TNS), inverse discrete cosine transform (IMDCT) and frequency band duplication (SBR), PCM code streams of left and right channels are obtained, which are then put into the output buffer by the main control module and output to sound playback equipment.

Technical analysis:

1. Main control module:

The so-called main control module, its main task is to operate the input and output buffer and call other modules to work together.

Among them, the input and output buffers are all provided with interfaces by the DSP control module. The data stored in the output buffer is decoded PCM data, which represents the amplitude of sound. It consists of a fixed length buffer. The head pointer is obtained by calling the interface function of DSP control module. When the output buffer is full, the interrupt processing is called to output to the audio ADC chips (stereo audio DAC and DirectDrive headphone amplifier) connected with the I2S interface to output analog sound.

2. Noisy decoding (Noisy decoding):

Noise-free coding is huffman encoding, and its function is to further reduce the scale factor and the redundancy of quantized spectrum.

Scale factor and huffman encoding of quantized spectrum information. The global gain is encoded as an 8-bit unsigned integer, and the first scale factor is differentially encoded with the global gain value, and then huffman encoding is performed by using a scale factor encoding table. The subsequent scale factor is differentially encoded with the previous scale factor. The noise coding of quantized spectrum has two spectral coefficients. One is the division of 4-tuple and 2 yuan group, and the other is the division of section. For the previous division, determine whether the value found in the Huffman table is 4 or 2. For the latter division, determine which Huffman table should be used. Each part contains several scale factor bands, and only one Huffman table is used in each part.

-Bar

Noise-free coding divides the input 1024 quantized spectral coefficients into several segments, and all points in the segments are utilized.

For the same Huffman table, considering the coding efficiency, it is best that the boundary of each segment coincides with the boundary of the scale factor band. Therefore, the information that each segment must transmit should include: segment length, scale factor band and Huffman table used.

-Grouping and alternating

Grouping refers to grouping continuous spectral coefficients with the same scale factor band into a group, ignoring the window where the spectral coefficients are located.

In a word, * * * enjoys the scale factor to obtain better coding efficiency. Doing so will inevitably lead to alternation, that is, it was originally based on

C[ group] [window] [scale factor band] [coefficient index] is a sequential arrangement of coefficients, which changes to putting the same coefficients with scale factors together: c[ group] [scale factor band] [window] [coefficient index]

This causes the coefficients of the same window to change alternately.

-Processing of large-scale values

There are two ways to deal with quantized values in AAC: using escape flags or using pulses in huffman encoding table.

Escape method. The former is similar to the mp3 coding method, which uses a special Huffman table when a large number of quantized values appear, implying that it will follow a pair of symbols of escape value and opposite value after huffman encoding. When using the pulse escape method, a large value is reduced by a difference to become a small value, and then Huffman table coding is used, followed by a pulse structure to help restore the difference.

3. Scale factor decoding and inverse quantization

In AAC coding, the inverse quantization of spectral coefficients is realized by non-uniform quantizer, and its inverse operation is needed in decoding. That is, keep the sign and perform 4/3 power operation. The basic method to adjust quantization noise in frequency domain is to use scale factor to shape noise. The scale factor is the amplitude gain value used to change all spectral coefficients in the scale factor band. The mechanism of using scale factor is to use non-uniform quantizer to change the bit allocation of quantization noise in frequency domain.

-Scale factor-band

According to the auditory characteristics of human ears, the frequency lines are divided into several groups, and each group corresponds to several scale factors. These groups are called scale factor bands. In order to reduce the side information containing short windows, continuous short windows can be grouped, that is, several short windows are transmitted together as one window, and then the scale factor will be applied to all grouped windows.

4. Joint stereo

The purpose of joint stereo is to render the original samples and make the sound more "pleasant".

5. Perceptual noise replacement (PNS)

Perceptual noise replacement module is a module that simulates noise through parameter coding. When judging the noise in audio value

After the sound, these noises are not quantized and coded, but some parameters are used to tell the decoder that it is some kind of noise, and then the decoder will use some random coding to generate this type of noise.

In the specific operation, PNS module detects the signal components with frequency lower than 4kHz in each scale factor band. If this

A signal is considered a noise signal because it is neither a tone nor a strong energy that changes with time. Calculate the pitch and energy changes of the signal in the psychoacoustic model.

When decoding, if the Huffman table 13 (noise _HCB) is found, PNS is used. Because M/S stereo decoding and PNS decoding are mutually exclusive, the parameter ms_used can be used to indicate whether the two channels use the same PNS. If the ms_used parameter is 1, the two channels will use the same random vector to generate noise signals. The energy signal of PNS is represented by noise_nrg. If PNS is used, energy signals will be transmitted instead of their respective scale factors. Noise energy coding, like scale factor, adopts differential coding. The first value is also the global gain value. It is alternated with intensity stereo position value and scale factor, but it ignores each other for differential decoding. That is, the next noise energy value is the standard differential decoding of the previous noise energy value, not the intensity stereo position or scale factor. Random energy will produce the average energy distribution calculated by noise_nrg in the scale factor band. This technology will only be used in MPEG-4 AAC.

6. Instantaneous noise shaping (TNS)

This magical technology can trim the distribution of quantization noise in time domain through frequency domain prediction. In a place

TNS technology has made great contributions to the improvement of sound quality in the quantization of some special parts and the dramatically changing signals! TNS transient noise shaping is used to control the transient noise pattern in the conversion window. It is realized by a single-channel filtering process. Traditional transform coding schemes often encounter the problem that signals change dramatically in time domain, especially speech signals. This problem is because the quantization noise distribution is controlled in the frequency domain, but it is distributed as a constant in the transform block in the time domain. If the signal in this block changes significantly, but does not become a short block, this constant distribution of noise will be heard. The principle of TNS is to use the duality of time domain and frequency domain and the time-frequency symmetry of LPC (Linear Predictive Coding), that is, coding in any domain is equivalent to coding in another domain, that is, coding in one domain can increase its resolution in another domain. Quantization noise is generated in frequency domain, which reduces the resolution of time domain, so prediction coding is done here in frequency domain. In AACplus, because it is based on AAC profile LC, the filtering order of TNS is limited to 12.

7. Inverse discrete cosine transform

The process of converting audio data from frequency domain to time domain is mainly realized by filling frequency domain data into a set of IMDCT filters. After IMDCT transform, the output value is windowed and superimposed, and finally the time domain value is obtained.

8. Band Replication (SBR)

Simply put, the main spectrum of music is concentrated in the low frequency band, and the amplitude of the high frequency band is very small, but it is very important and determines.

Sound quality. If the whole frequency band is encoded, if the high frequency band is protected, the low frequency band will be encoded too finely and the file will be huge; If the main components of low frequency are retained and the high frequency components are lost, the sound quality will be lost. SBR cuts the frequency spectrum, codes the low frequency separately to save the main components, amplifies the high frequency separately to save the sound quality, and "balances" the file size.

Sound quality perfectly solves this contradiction.

9. Parametric Stereo (PS)

For the previous stereo file, the file size is twice that of mono, but the sound of two channels exists.

According to Shannon's entropy coding theorem, some similarities should be deleted to reduce the file size. So PS technology stores all the information of one channel, and then uses a few bytes as parameters to describe the difference between another channel and it.

Three commonly used sentence patterns in college English compositions

What is the song of every moment when you hug me?

What about Fuzhou Messi Culture Media Co., Ltd.?

Angela, baby, did you have a plastic surgery?

Generation of electromagnetic waves

An Wei Huagou Town Family Planning Office Zhang Yong Tel

Four Basic Data Types of C Language

Empty total forming

Is it useful to call the police when a beauty loan is cheated?

Personality characteristics of Aries Martians Love characteristics of Aries Martians