Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578

Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578 Email : r93087@csie.ntu.edu.tw

History and Comparison Speech and audio history.

Speech and Audio Speech is sounds human can utter, but audio is what human can hear. The basic bandwidth of speech is 4KHz. On the other hand, the basic bandwidth of audio is 22.05KHz. The research of speech coding started earlier than audio coding.

SPL: Sound Pressure Level

Speech Codec The first speech codec standard is PCM (Pulse Code Modulation). It used simple sampling and quantization to represent digital speech information. PCM is 64Kbps. It is also called CCITT G.711. (International Telephone and Telegraph Consultative Committee)

The goal of speech codec is low bit-rate. ADPCM (Adaptive Differential PCM), also called CCITT G.721, is the representative of 32Kbps. Because the neighborhood of speech sampling is usually similar, we use their differential to compress original data.

Later, CCITT G.723, G.726 appeared. They are also ADPCM but support many bit-rate selections, such as 40Kbps, 32Kbps, 24Kbps, 16Kbps. CCITT G.727, G.728 are 16Kbps, and they are representative of middle bit-rate. They use the technique of backward-CELP. This technique pays attention to short delay time. CELP (Code Excited Linear Prediction) is 8Kbps.

MOS: Mean Opinion Score

Audio Codec After speech codec, many companies and committees invest in audio codec. ISO formulates a suite of video and audio standards called MPEG. Dolby develops AC-1, AC-2, and AC-3. ISO (International Organization for Standardization) MPEG (Moving Pictures Experts Group) AC-3 (Audio Codec 3)

DAB: Digital Audio Broadcast DCC: Digital Compact Cassette ISDN: Integrated Services Digital Network MD: Micro Drive

Why Transform? Two main reasons.

Benefit of Transformation There are two main reasons to transform one kind of information or data from one domain to another domain. 1. Data compression. 2. Some operations can only be done in some domain.

Data Compression

Data Compression (cont.)

Disadvantage It is not a good method for us to use transformation in audio data compression. Our ears are more sensitive in some frequency (e.g. 1kHz - 5kHz). This kind of data compression does not consider psychoacoustic factors.

Frequency Domain Human ears hear sounds according to its frequency. Some operations must be in frequency domain. Many psychoacoustic studies are based on frequency domain.

Pulse-Code Modulation Raw data of sound.

Modulation Modulation is a means of encoding information for the purpose of transmission or storage. Such as amplitude modulation (AM) and frequency modulation (FM) have long been used to modulate carrier frequencies with analog audio information for radio broadcast.

Amplitude Modulation

Frequency Modulation

PWM / PPM pulse-width modulation pulse-position modulation

PAM / PNM pulse-amplitude modulation pulse-number modulation

PCM pulse-code modulation It is the most commonly used modulation method

Lossless and Lossy Compression Two main models of compression.

Terminology E ( ): encoding algorithm D ( ): decoding algorithm M: original data m = E (M) -> encoding M M ’ = D(m) -> decoding m If M = M ’, then we call the algorithm as lossless compression, otherwise as lossy compression.

Compression Ratio Compression ratio p = (M – m) / M * 100% Generally, lossy compression is better than lossless compression in compression ratio.

Psychoacoustics and Human Ear Sounds of Human feeling.

Terminology Loudness: Sound loudness is a subjective term describing the strength of the ear's perception of a sound. Intensity: Sound intensity is defined as the sound power per unit area. The basic units are watts/m 2 or watts/cm 2.

Threshold of Hearing This is audibility curve. Below the curve, we can not hear anything. Human ears can hear the sound scale from 20-20000 Hz. Many sound intensity measurements are made relative to a standard threshold of hearing intensity: I 0 = 10 -12 watts/m 2 = 10 -16 watts/cm 2

Intensity Level Decibel (dB) : The sound intensity I 1 may be expressed in decibels above the standard threshold of hearing I 0. Intensity level = 10 log 10 ( I 1 / I 0 ) (dB) I 0 : threshold of hearing 10 - ¹² watts / m² I 1 : the intensity we want to measure

Threshold of Feeling It is the upper bound curve that human can bear. Over this curve, human ears could be hurt. It is not a horizontal line, either. In lower frequency, human ears are more sensitive, so the curve has a wave trough there.

Equal-Loudness Curve At any equal-loudness curve, human hear the same loudness. Equal-loudness curves are not horizontal lines. Between threshold of hearing and threshold of feeling, there are infinite equal-loudness curves.

Human Hearing

Sound Masking Time / frequency sound masking.

Frequency Masking If many tones play simultaneously, some tones will be masked by others. We can draw a frequency masking curve, and we can not hear sounds under the curve. The curve ’ s slope steep at low frequency, but slow at high frequency.

Frequency Masking (cont.) The louder masking sounds, the larger masked area. If we use the frequency masking technique, we can reduce the coding bits.

Time Masking If one sound is played, it may generate pre-masking and post-masking. Post-masking is longer than pre- masking. The larger the sound, the longer the masking.

MP3 MPEG1 Layer3

Introduction MPEG: Moving Pictures Experts Group MP3: MPEG-1 Layer-3 Why is MP3 so popular? Open standard Availability of hardware and software Near CD (Compact Disk) quality Fast Internet access for universities and businesses

MP3 Format An MPEG audio file is separated into smaller parts called frames. Each frame is independent. Each frame has its own header and audio information. There is no file header. Therefore, you can cut any part of MPEG audio file and play it correctly.

The frame header is constituted by the first four bytes (32 bits) in a frame. aaaaaaaa aaabbccd eeeeffgh iijjklmm We can know some information from the frame header, such as: What are the version and layer? Is it protected by CRC (Cyclic Redundancy Check)? What are the bit-rate and frequency?

The tag is used to describe the MPEG audio file. It contains information about artist, title, album, publishing year, genre, and comments. It is exactly 128 bytes long and is located at the end of the audio data. AAABBBBB BBBBBBBB BBBBBBBB BBBBBBBB BCCCCCCC CCCCCCCC CCCCCCCC CCCCCCCD DDDDDDDD DDDDDDDD DDDDDDDD DDDDDEEE EFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFG

MP3 Encoder MDCT: Modified Discrete Cosine Transform FFT: Fast Fourier Transform

MP3 Encoder 768Kbps = 32K samples/second * 24 bits/sample

MP3 Decoder iMDCT: inverse Modified Discrete Cosine Transform

MP3 Decoder

Psychoacoustic Principles Critical band Sound masking: Time masking Frequency masking

Filter Bank Hybrid filter bank Polyphase and MDCT (Modified Discrete Cosine Transform) 32 channels of polyphase sub-band MDCT transforms each sub-band into 18 smaller channels.

MDCT DFT DCT

CELP Code Excited Linear Prediction

Background Over the years many speech coding techniques have been developed starting from PCM and ADPCM (Adaptive Differential Pulse Code Modulation) in the 60s, to linear prediction in the 70s, and CELP in the late 80s and 90s. Because we discover that speech spectra are similar at nearby samples, we use the method of prediction.

Person Model For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened.

The shape of your vocal tract determines the sound that you make. The shape of the vocal tract changes relatively slowly (on the scale of 10 ms to 100 ms). The amount of air coming from your lung determines the loudness of your voice.

Math Model

Vocal Tract  H(z) (LPC (Linear Predictive Coding) Filter) Air  u(n) (Innovations) Vocal Cord Vibration  V (voice) Vocal Cord Vibration Period  T (pitch period) Fricatives and Plosives  UV (unvoiced) Air Volume  G (gain)

LPC It stands for Linear Prediction Coefficients. : spectra : error LPC is the basic technique of CELP. Because CELP uses the prediction method, its bit-rate can be lower.

CELP Encoder

AC-3 Audio Codec 3

What Is AC-3? AC-3 refers to a multichannel music compression technology that has been developed by Dolby Laboratories. Dolby Laboratories has used the term Dolby Digital to refer to this digital system in the film and theater industries, and has used the term Dolby Surround AC-3 to refer to the system in the home theater market.

The AC-3 can carry from 1 to 5.1 channels. It provides five full range channels (3 Hz to 20,000 Hz): three front channels (left, center, and right), plus two surround channels. A sixth bass-only effects channel (3 Hz to 120 Hz), also called sometimes “ Low Frequencies Enhancement channel" (LFE).

How Does AC-3 Work? It uses lossy compressions. Like MP3 or AAC, AC-3 uses sound properties to achieve its compression. Input uncompressed PCM samples must be 32, 44.1, or 48 kHz on up to 20 bits.

AC-3 Encoder

AC-3 Decoder

AAC MPEG-2 Advanced Audio Coding

Advertisement Because of its exceptional performance and quality, Advanced Audio Coding (AAC) is at the core of the MPEG4, 3GPP ( 3rd Generation Partnership Project ) specifications and is the new audio codec of choice for Internet, wireless, and digital broadcast arenas. AAC provides audio encoding that compresses much more efficiently than older formats such as MP3, yet delivers quality rivaling that of uncompressed CD (Compact Disk) audio.

Why AAC? The driving force to develop AAC was the quest for an efficient coding method for surround signals, like 5-channel signals (left, right, center, left-surround, right-surround) as being used in cinemas today. One aim of AAC was a considerable decrease of necessary bit-rate.

Low Delay Low Delay audio coding is needed whenever some sort of communication is transmitted over low bandwidth channels in both directions, i.e. live broadcasts on TV (Television) or radio stations or in mobile phone networks (3G: 3rd Generation). Both AAC and CELP have low delay property.

AAC vs. MP3 MPEG-2 AAC is the consequent continuation of the truly successful coding method MPEG1 Layer-3. The crucial differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio Layer-3 are shown as follows:

Quantization: By allowing finer control of quantization resolution, the given bit rate can be used more efficiently. Prediction: A technique commonly established in the area of speech coding systems. It benefits from the fact that certain types of audio signals are easy to predict. Bit-stream format: The information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible. The optimization of these coding methods together with a flexible bit-stream structure has made further improvement of the coding efficiency possible.

WMA Windows Media Audio

What Is WMA? It is an audio format by Microsoft. Its file size is only one half the same data of MP3 file, but sound quality is similar to MP3. Because it is proprietary, we hardly know its detailed codec.

The Difference between ASF and WMA/WMV The only difference between ASF files and WMA or WMV files are the file extensions and the MIME types. The MIME type for a WMV file is video/x-ms- wmv, and for WMA it is audio/x-ms-wma. The MIME type for ASF is video/x-ms-asf. The basic internal structure of the files is identical. MIME: Multipurpose Internet Mail Extensions WMV: Windows Media Video ASF: Active Streaming Format

MIDI Musical Instrument Digital Interface

What Is MIDI? MIDI is a method of communication between digital instruments. It was created at 1982. Unlike so called speech or audio, MIDI is similar to one kind of music score. It is unrelated to codec.

We can write some musical notes on one MIDI file, then computer looks up a table for corresponding musical note and its sound. Therefore, we just change the table, then we can set the sounds in violin, piano, or other instruments. MIDI file is much smaller than general audio file.

Use What Is Fitting MPEG1 Layer2 at 192 Kbps was in 7 of 8 cases significantly better than AAC at 96 Kbps, and in 6 of 8 cases better than AAC at 128 Kbps. Under the condition of twice cascading, the quality of AAC was much inferior to Layer2. It should also be noted, that there would be a significant difference in the processing time delay between Layer2 (which needs approximately 70ms) and AAC (about 300ms).

Reference K. C. Pohlmann, Principles of Digital Audio, Fourth Edition, McGraw-Hill, New York, 2000. 吳炳飛, Audio Coding 技術手冊, 全華科技圖書, 台北, 2004. AudioCoding.com, “ Welcome to the World of Audiocoding, ” http://faac.sourceforge.net/oldsite/wiki/, 2005. http://faac.sourceforge.net/oldsite/wiki/

Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578

Similar presentations

Presentation on theme: "Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578

Similar presentations

Presentation on theme: "Introduction of Digital Audio Name: Yao-Cheng Chuang Phone: 0919005578"— Presentation transcript:

Similar presentations

About project

Feedback