Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Audio Compression CIS 465 Spring 2013. Speech Compression Compression of voice data ◦ We have previously mentioned several methods that are used.

Similar presentations


Presentation on theme: "Digital Audio Compression CIS 465 Spring 2013. Speech Compression Compression of voice data ◦ We have previously mentioned several methods that are used."— Presentation transcript:

1 Digital Audio Compression CIS 465 Spring 2013

2 Speech Compression Compression of voice data ◦ We have previously mentioned several methods that are used to compress voice data  mu-law and A-law companding  ADPCM and delta modulation ◦ These are examples of methods which work in the time domain (as opposed to the frequency domain)  Often they are not even considered compression methods

3 Speech Compression Although the previous techniques are generally applied to speech data they are not designed specifically for such data Vocoders, instead, are ◦ Can’t be used with other analog signals ◦ Model speech so that the salient features can be captured in as few bits as possible ◦ Linear Predictive Coders model the speech waveform in time ◦ Also channel vocoders and formant vocoders ◦ In electronic music, vocoders allow a voice to modulate a musical source (via synthesizer, e.g.)

4 General Audio Compression If we want to compress general audio (not just speech), different techniques are needed ◦ In particular, music compression is a more general form of audio compression We make use of psychoacoustical modeling ◦ Enable perceptual encoding based upon an analysis of the ear and brain perceive sound ◦ Perceptual encoding exploits audio elements that the human ear cannot hear well

5 Psychoacoustics If you have been listening to very loud music, you may have trouble afterwards hearing soft sounds (that normally you could hear) ◦ Temporal masking A loud sound at one frequency (a lead guitar) may drown out a sound at another frequency (the singer) ◦ Frequency masking

6 Equal-Loudness Relations If we play two pure tones, sinusoidal sound waves, with the same amplitude but different frequencies ◦ One may sound louder than another ◦ The ear does not hear low or high frequencies as well as mid-range ones (speech) ◦ This can be shown with equal-loudness curves which plot perceived loudness on the axes of true loudness and frequency

7 Equal-Loudness Relations

8 Threshold of Hearing The following image is a plot of the threshold of human hearing for pure tones – at loudness below the curve, we don’t hear a tone

9 Threshold of Hearing A loud sound can mask other sounds at nearby frequencies as shown below

10 Frequency masking We can determine how a pure tone at a particular frequency affects our ability to hear tones at nearby frequencies Then, if a signal can be decomposed into frequencies, for those frequencies that are only partially masked, only the audible part will be used to set the quantization noise thresholds

11 Critical Bands Human hearing range divides into critical bands  Human auditory system cannot resolve sounds better than within about one critical band when other sounds are present  Critical bandwidth represents the ear’s resolving power for simultaneous tones  At lower frequencies the bands are narrower than at higher frequencies  The band is the section of the inner ear which responds to a particular frequency

12 Critical Bands

13 Generally, the audio frequency range for hearing (20 Hz – 20 kHz) can be partitioned into about 24 critical bands (25 are typically used for coding applications ◦ The previous slide does not show several of the highest frequency critical bands ◦ The critical band at the highest audible frequency is over 4000 Hz wide ◦ The ear is not very discriminating within a critical band

14 Temporal Masking A loud tone causes the hearing receptors in the inner ear to become saturated, and they require time to recover ◦ This leads to the temporal masking effect ◦ After the loud tone we cannot immediately hear another tone – post-masking  The length of the masking depends on the duration of the masking tone ◦ A masking tone can also block sounds played just before – pre-masking (shorter time)

15 Temporal Masking MPEG audio compression takes advantage of both temporal and frequency masking to transmit masked frequency components using fewer bits

16 MPEG Audio Compression MPEG (Motion Picture Experts Group) is a family of standards for compression of both audio and video data ◦ MPEG-1 (1991) CD quality audio ◦ MPEG-2 (1994) Multi-channel surround sound ◦ MPEG-4 (1998) Also includes MIDI, speech, etc. ◦ MPEG-7 (2003) Not compression – searching ◦ MPEG-21 (2004) Not compression – digital rights management

17 MPEG Audio Compression MPEG-1 defined three downward compatible layers of audio compression ◦ Each layer offers more complexity in the psychoacoustic model used and hence better compression ◦ Increased complexity leads to increased delay ◦ Compatibility achieved by shared file header information ◦ Layer 1 – used for Digital Audio Tape ◦ Layer 2 – proposed for digital audio broadcasting ◦ Layer 3 – music (MPEG-1 layer 3 == mp3)

18 MPEG Audio Compression MPEG audio compression relies on quantization, masking, critical bands ◦ The encoder uses a bank of 32 filters to decompose the signal into sub-bands  Uniform width – not exactly aligned to crit. bands  Overlapping ◦ A Fourier transform is used for the psycho- acoustical model ◦ Layer 3 adds a DCT to the sub-band filtering so that layers 1 and 2 work in the temporal domain and layer 3 in the frequency domain

19 MPEG Audio Compression PCM input filtered into 32 bands PCM FFT transformed for PA model Windows of samples (384, 576, 1152) coded at a time

20 MPEG Audio Compression Since the sub-bands overlap, aliasing may occur ◦ This is overcome by the use of a quadrature mirror filter bank  Attenuation slopes of adjacent bands are mirror images

21 MPEG Audio Algorithm The PCM audio data is assembled into frames ◦ Header – sync code of 12 1s ◦ SBS format – describe how many sub-band samples (SBS) are in the frame ◦ The SBS (384 in Layer 1, 1152 in Layers 2, 3) ◦ Ancillary data – e.g. multi-lingual data or surround-sound data

22 MPEG Audio Algorithm The sampling rate determines the frequency range That range is divided up into 32 overlapping bands The frames are sent through a corresponding 32-filter filter bank If X is the number of samples per frame, each filter produces X/32 samples ◦ These are still samples in the temporal domain

23 MPEG Audio Algorithm The Fourier transform is performed on a window of samples surrounding the samples in the frame (either 1024 or 2*1024 samples) ◦ This feeds into the psychoacoustic model (along with the subband samples) ◦ Analyze tonal and nontonal elements in each band ◦ Determine spreading functions (how much each band affects another)

24 MPEG Audio Algorithm Find the masking threshold and signal-to- mask ratios for each band The scaling factor for each band is the maximum amplitude of the samples in that band The bit-allocation algorithm takes the SMRs and scaling factor and determines how many bits can be allocated (quantization granularity) for each band ◦ In MP3, the bits can be moved from band to band as needed to ensure a minimum amount of compression while achieving higher quality

25 MPEG Audio Algorithm Layer 1 has 12 samples encoded per band per frame Layer 2 has 3 groups of 12 (36 samples) per frame Layer 3 has non-equal frequency bands Layer 3 also performs a Modified DCT on the filtered data, so we are in the frequency (not time) domain Layer 3 does non-uniform quantization followed by Huffman coding ◦ All of these modifications make for better (if more complex) performance for MP3

26 Stereo Encoding MPEG codes stereo data in several different ways ◦ Joint stereo ◦ Intensity stereo ◦ Etc. ◦ We are not discussing these

27 MPEG File Format MPEG files do not have a header (so you can start playing/processing anywhere in the file) ◦ Consist of a sequence of frames ◦ Each frame has a header followed by audio data

28 MPEG File Format

29 ID3 is a metadata container most often used in conjunction with the MP3 audio file format. Allows information such as the title, artist, album, track number, year, genre, and other information about the file to be stored in the file itself. Last 128 bytes of the file

30 Bit Rates Audio (or Video) compression schemes can be characterized as either constant bit rate (CBR) or variable bit rate (VBR) ◦ In general, higher compression can be achieved with VBR (at the cost of added complexity for code/decode) ◦ MPEG-1 Layers 1 and 2 are CBR only ◦ MP3 is either VBR or CBR ◦ Average Bit Rate (ABR) is a compromise

31 MPEG-2 AAC MPEG-2 (which is used for encoding DVDs) has an audio component as well MPEG-2 AAC (Advanced Audio Coding) standard was aimed at transparent sound reproduction for theatres ◦ 320 kbps for five channels (left, right, center, left- surround and right-surround) ◦ 5.1 channel systems include a low-frequency enhancement channel (“woofer”) ◦ AAC can also deliver high-quality stereo sound at bitrates less than 128 kbps

32 MPEG-2 AAC AAC is the default audio format for (e.g.): YouTube, iPod (iTunes), PS3, Nintendo Dsi, etc. Compared to MP3 ◦ More sampling frequencies ◦ More channels ◦ More efficient, simpler filterbank (pure MDCT) ◦ Arbitrary bit rates and variable frame lengths ◦ Etc. etc.

33 MPEG-4 Audio MPEG-4 audio integrates a number of audio components into one standard ◦ Speech compression ◦ Text-to-speech ◦ MIDI ◦ MPEG-4 AAC (similar to MPEG-2 AAC) ◦ Alternative coders (perceptual coders and structured coders)


Download ppt "Digital Audio Compression CIS 465 Spring 2013. Speech Compression Compression of voice data ◦ We have previously mentioned several methods that are used."

Similar presentations


Ads by Google