James D. Johnston Microsoft Corporation Audio Architect

James D. Johnston Microsoft Corporation Audio Architect
Conversion – Issues in Hearing, Sampling, Quantization, and Implementation James D. Johnston Microsoft Corporation Audio Architect Copyright © 2006 by James D. Johnston All rights reserved.

Basic Hearing Issues Parts of the ear.
Their contribution to the hearing process. What is loudness? What is intensity? How “loud” can you hear? How “quiet” can you hear? How “high” is high? How “low” is low? What do two ears do for us?

Fundamental Divisions of the Ear
Outer ear: Head and shoulders Pinna Ear Canal Middle Ear Eardrum 3 bones and spaces Inner ear Cochlea Organ of Corti Basilar Membrane Tectoral Membrane Inner Hair Cells Outer Hair Cells

Functions of the Outer Ear
In a word, HRTF’s. HRTF means “head related transfer functions”, which are defined as the transfer functions of the body, head, and pinna as a function of direction. Sometimes people refer to HRIR’s, or “head related impulse responses”, which are the same information expressed in the time, as opposed to frequency, domain. HRTF’s provide information on directionality above and beyond that of binaural hearing. HRTF’s also provide disambiguation of front/back, up/down sensations along the “cone of confusion”. The ear canal resonates somewhere between 1 and 4 kHz, resulting in some increased sensitivity at the point of resonance. This resonance is about 1 octave wide, give or take.

Middle Ear The middle ear acts primarily as a highpass filter (at about 700 Hz) followed by an impedance-matching mechanical transformer. The middle ear is affected by muscle activity, and can also provide some level clamping and protection at high levels. You don’t want to be exposed to sound at that kind of level.

Inner Ear (Cochlea) In addition to the balance mechanism, the inner ear is where most sound is transduced into neural information. The inner ear is a mechanical filterbank, implementing a filter whose center-frequency tuning goes from high to low as one goes farther into the cochlea. The bandpass nature is actually due to coupled tuning of two highpass filters, along with detectors (inner hair cells) that detect the difference between the two highpass (HP) filters.

Critical Bandwidths The bandwidth of a filter is referred to as the “Critical Band” or “Equivalent Rectangular Bandwidth” (ERB). ERB’s and Critical Bands (measured in units of “Barks”, after Barkhausen) are reported as slightly different. ERB’s are narrower at all frequencies. ERB’s are probably closer to the right bandwidths, note the narrowing of the filters on the “Bark” scale in the previous slide at high Bark’s (i.e. high frequencies). I will use the term “Critical Band” in this talk, by habit. None the less, I encourage the use of a decent ERB scale. Bear in mind that both Critical Band(widths) and ERB’s are useful, valid measures, and that you may wish to use one or the other, depending on your task. There is no established “ERB” scale to date, rather researchers disagree quite strongly, especially at low frequencies. It is likely that leading-edge effects as well as filter bandwidths lead to these differences. The physics suggests that the lowest critical bands or ERB’s are not as narrow as the literature suggests.

What are the main points?
The cochlea functions as a mechanical time/frequency analyzer. Center frequency changes as a function of the distance from the entrance end of the cochlea. High frequencies are closest to the entrance. At higher center frequencies, the filters are roughly a constant fraction of an octave bandwidth. At lower center frequencies, the filters are close to uniform bandwidth. The filter bandwidth, and therefore the filter time response length varies by a factor of about 40:1.

What happens as a function of Level?
As level rises, the ear desensitizes itself by many dB. As level rises, the filter responses in the ear shift slightly in position vs. frequency. The ear, with a basic 30dB SNR (1000^.5) in the detector, can range over at least 120dB of level.

What does this mean? The internal experience, called Loudness, is a highly nonlinear function of level, spectrum, and signal timing. The external measurement in the atmosphere, called Intensity, behaves according to physics, and is somewhat close to linear. The moral? There’s nothing linear involved.

Some points on the SPL Scale
(x is very loud pipe organ, o is threshold of pain/damage, + is moderate barometric pressure change

Edge effects and the Eardrum
The eardrum’s HP filter desensitizes the ear below 700Hz or so. The exact frequency varies by individual. This means that we are not deafened by the loudness of weather patterns, for instance. At both ends of the cochlea, edge effects lessen the compression mechanisms in the ear. The results of these two effects, coupled with the ear’s compression characteristics, results in the kind of response shown in the next slide.

Fletcher and Munson’s famous “equal loudness curves”.
The best one-picture summary of hearing in existence.

What’s quiet, and what’s loud?
As seen from the previous graph, the ear can hear to something below 0dB SPL at the ear canal resonance. As we approach 120dB, the filter responses of the ear start to broaden, and precise frequency analysis becomes difficult. As we approach 120dB SPL, we also approach the level at which near-instantaneous injury to the cochlea occurs. Air is made of discrete molecules. As a result, the noise floor of the atmosphere at STP is approximately 6dB SPL white noise in the range of 20Hz-20kHz. This noise may JUST be audible at the point of ear canal resonance. Remember that the audibility of such noise must be calculated inside of an ERB or critical band, not broadband.

So, what’s “high” and what’s “low” in frequency?
First, the ear is increasingly insensitive to low frequencies, as shown in the Fletcher curve set. This is due to both basilar membrane and eardrum effects. 20Hz is usually mentioned as the lowest frequency detected by the hearing apparatus. Low frequencies at high levels are easily perceived by skin, chest, and abdominal areas as well as the hearing apparatus.

At higher frequencies, all of the detection ability above kHz lies in the very first small section of the basilar membrane. While some young folks have been said to show supra-20kHz hearing ability (and this is likely true due to their smaller ear, ear canal, and lack of exposure damage), in the modern world, this first section of the basilar membrane appears to be damaged very quickly by “environmental” noise. At very high levels, high (and ultrasonic) signals can be perceived by the skin. You probably don’t want to be exposed to that kind of level.

What about binaural issues?
Binaurally, with broadband signals, we can distinguish 10 microsecond shifts in left vs. right stimulii of the right characteristics. While this has implications in block-processed algorithms with pre-echo, it does not generally relate substantially to ADC and DAC hardware that is properly clocked.

The results? For presentation (NOT capture, certainly not processing), a range of 6dB SPL (flat noise floor) to 120dB (maximum you should hear, also maximum most systems can achieve) should be more than sufficient. This is about 19 bits. An input signal range of 20Hz to 20kHz is probably enough, there are, however, some filter issues that will be raised later, that may affect your choice of sampling frequency.

Sampling and Quantization

Sampling and Quantization
Continuous domain vs. sampled domain. Sampling Aliasing Discrete level (quantized) vs. noisy continuous domain Quantization Dithering Time/frequency duality FFT’s (DFT’s too)

What do “Analog” and “Digital” really mean?
The domain we commonly refer to as “analog” is a time-continuous (at least to mortal eyes) domain, with “continuous” level resolution limited by physical properties of material. The level resolution and time resolution are never exact due to basic physics. The “digital” domain is a sampled, quantized domain. That means that we only know the value of the signal at specified time, and that the level of the signal occupies one of a set of discrete levels. The set of levels, and the times that the signal has a value, however, are exact in the digital framework. (Although there may, of course, be errors in acquisition.)

Properties of the analog domain
The time domain is continuous. This means that any frequency limits come from physical processes, not from mathematical restrictions. However, physics places some very strong constraints on such signals: All signals have finite energy All signals have finite bandwidth All signals have finite duration All signals have a finite noise floor All four of the points above are very important!

A reminder about Duality in the Fourier domain
Multiplying two signals means that you convolve their (full, complex) spectra. Convolving two signals means that you multiply their (full, complex) spectra. These two properties of Fourier Analysis (other commonly used transforms obey them as well) are very important. Remember them even if you don’t know anything about Fourier Analysis.

Fourier Domain Properties
Please remember the properties. I don’t want or expect anyone to understand all the details, but please remember the PROPERTIES. Fourier analysis is valid on all finite energy, finite-bandwidth signals. That describes all real-world audio signals that we care to deal with. (The only counterexamples occur in astrophysics and particle physics, neither of which a listener can be comfortably seated near.)

So, let’s sample that analog signal.
Sampling means capturing the value of the signal at a periodic rate. This means that we MULTIPLY the signal by the specific impulse at a regular interval. Quite obviously, that’s not what actual hardware does. Most use a track/hold, or other capture method. The result, in the sampled domain, is the same. That means that we CONVOLVE the signal spectrum and the sampling spectrum. This means that the spectrum repeats at every multiple of the sampling frequency. Hence, we have the Nyquist criterion, later proven by Shannon.

Three examples 2*b<fs 2*b>fs 2*b=fs
In each case, top is signal spectrum (same for all 3) middle is sampling spectrum and bottom is the result

The Nyquist Criterion Simply put, if we wish to sample a signal of bandwidth ‘B’, we must sample it at least at 2B sampling rate. If you think about this briefly, that follows from the previous slide, where the spectrum (extending to +-B from DC) will overlap if you sample it at a lower frequency This overlap is called “aliasing”.

A Graphical Example Top to bottom: Sampling train, spectrum of sampling train, Sine wave below half the sampling rate and resulting samples, their spectra, sine wave as far above half the sampling rate and resulting samples, their spectra

What would I hear if there was a demo of aliasing?
Aliasing and imaging (imaging, as we will see shortly, is the reconstruction version of aliasing) sound awful. Aliasing in general is anharmonic, and remarkably annoying. THEREFORE: Filtering is a requirement. It’s not an option. The presence of a filter has consequences that we will discuss later. This leads us to the sampling theorem.

Hence, the Sampling Theorem
We must limit the bandwidth of the signal to fs/2, where fs is the sampling frequency. (This is saying the same thing as the Nyquist conjecture, restated in terms of the data to be captured, rather than the sampling rate.) While this does not mean dc to fs/2, that’s what we do in audio, since we want signals close to dc. (There are other sampling methods that sample other regions of frequency.) This means that we must band limit the signal into the sampler. An anti-alias filter is not just a good idea, it’s mathematically necessary.

Consequences of the Sampling Theorem
We must band limit the signal in order to avoid aliasing. Any out-of-band signals will alias back into the base band. That ^^ has consequences far beyond the initial sampling of the material. We’ll get to that later when we talk about things like clipping and nonlinearities, or jitter. This means that we know what times the samples were taken, so we can “reconstruct” that periodicity later, without an error that always grows with time, distance, number of copies, etc.

Ok, but we represent those samples as binary values, right?
Yes, we do. That’s called “quantization”. That’s the other necessary process in digitization of a signal. Quantization is why digital signals can be saved and re-saved without degradation in terms of level. It’s also why digital PCM signals have a fixed, unchanging noise floor. (There are other possiblities, we’ll talk about those later.)

So, quantization is like rounding, right? Well, Let’s see!
Using rounding only: Original, quantized, error, and spectrum of original and error.

To drive the point even farther home:
That’s right, we have to dither quantizers. It’s not just a “good idea”.

Dither? What’s that? As the spectra (and error waveform of the second slide) show, the error of an undithered quantizer is highly correlated to the original signal. Dither consists of adding some random function BEFORE the quantization so that the noise is decorrelated. The first kind of dither people tried was called “uniform”. Let’s see how that works out.

Uniform +- ½ step-size dither.
Not bad, but notice the noise level coming and going around the zero crossing?

Hence, Triangular PDF Dither:
Now, the noise stays constant over all amplitudes.

To recap: Notice, even in this very mild case, where harmonics do not alias over each other, in addition to eliminating tonal components and preserving information, dither RAISES the noise floor, and lowers the PERCIEVED noise floor.

To summarize: A digital signal is sampled and quantized.
Sampling requires anti-aliasing filters. Quantization requires TPD. Dither and Anti-aliasing are not options!

What about reconstruction?
Yes, that convolution theorem applies again, this time usually convolving a “square pulse” with the digital signal. This leads to a form of signal “images”. While “images” and “aliases” come about by mathematically similar processes, people persist in having different names for them. Some (many in the high end) omit the anti-imaging filter, and imagine that there is a ‘beating’ problem. If you haven’t heard about this “problem” yet, you will at some point. The next few graphs show why it isn’t so.

Basically, the same thing happens.
Top Line: Sine wave Below fs/2 Next line Sine wave plus first alias pair (blue) and just the aliases (red) Each line adds another pair, except the last, which adds the first 100 pairs. The gain of the red waveform is greatly increased in order to make it visible. Notice that after 100 alias pairs are added in the original waveform has the familiar “stairstep”

Notice, at the bottom, the “beating” that some audio enthusiasts complain about. Notice that “beating” only happens when aliases are added.

Low frequency reconstruction example.

Elements of reconstruction:
In reconstruction, filtering is also necessary to remove the “image” signals that originate in the same fashion as aliases arise in sampling. In reconstruction, the waveform is sometimes a “step” rather than an impulse, so other compensation is sometimes necessary to get a flat frequency response. Why? Again, using the “step” in time (convolving) means that you multiply the signal by the frequency response of the “step” in the frequency domain, leading to a rolloff like sin(x)/x. This rolloff can be as much as -3.92dB at fs/2, and can cause audible “softness” if not corrected somehow. Modern converters of the delta-sigma variety do not use a step at the final sampling frequency (although they certainly use a “step” it’s at a much higher frequency). Their design, however, introduces other issues. That discussion comes later.

How to Build converters
A quick survey of methods. No real commercial converter is Described.

Baseband Conversion A to D converter Audio Input Filter Sampling Clock This is the basic block diagram for any PCM converter. In this converter, the filter is outside the converter, and the quantizer is part of the sampling mechanism. This method is not very common any more, but we will discuss its properties before moving on to oversampling converters.

Spectrum of Signal and Noise
Original Spectrum of Original Quantized and Dithered Spectrum of Quantized, Dithered Signal Red 8 bits, green 9 bits (note, to make scaling easier, I will use 7/8/9 bit quantization)

NOTICE THAT NOISE SPREADS OVER THE ENTIRE OUTPUT SPECTRUM.
Things to notice. The noise floor is flat. If you sum up all of the energy in the noise floor, you will wind up with the SNR you expect. Notice that practically all of the noise is IN BAND when there is no oversampling. It’s really hard to see quantization at even very noisy levels in a waveform plot. Each bit of quantization is worth 6.02dB of signal to noise ratio. 1 more bit will drop the noise floor by 6 dB. 1 less bit raises the noise floor by 6dB. NOTICE THAT NOISE SPREADS OVER THE ENTIRE OUTPUT SPECTRUM.

Why so much emphasis on how the noise spectrum spreads out?
Therein lies the beginnings of oversampling.

Sine wave, original sampling rate
Spectrum of 8 bit quantization Sine wave, 4x sampling rate Spectrum of 7 bit quantization, shown in same bandwidth! 4x oversampled. Full spectrum of 7 bit quantized signal at 4x sample rate. Notice that the noise has 4x the bandwidth, but ¼ of it falls in the original passband

That demonstrates the most trivial form of oversampling.
This trivial form of oversampling provides the equivalent of 1 bit, in-band, for every 4x the sampling frequency, i.e. 3db per doubling. Now we move on to more sophisticated forms of oversampling, with the noise spectrum shaped as well.

Noise Shaping Output Bits Error Signal + H(s) Quantizer -
Quantized Signal This is the basic form of a noise-shaper. I’m not going to do a full mathematical analysis for the sake of time. What H(s) does is shape the noise floor. This can be done with or without oversampling. Two examples will follow. The output bits of this system are PCM. ALL OVERSAMPLED SYSTEMS ARE PCM SYSTEMS AT THEIR HEART!!!!

Adding this H(s) introduces some Issues, of course.
The values of H(s) must be carefully controlled in order to ensure stability. H(s) has storage in it, so quantization noise gets stored. This means that you have “more noise” than just the basic quantization noise. So, there is a penalty, especially if there is a lot of “storage” or “memory” in the noise shaper, as well as a gain. The shape of the noise is closely related to the inverse of H(s). I won’t try to present a full analysis, that’s for the hardware engineer and chip designer.

An example of noise shaping with no oversampling:
NOTE: This is an example, nobody uses this particular H(z), and in fact I’ve not even tested it for stability!!!! The point is simple, you CAN do noise shaping even with no oversampling, and some DAC’s do it, to attempt to match zero loudness curves. We can discuss the utility of that in Q/A.

What’s the point? Within limits, using a noise-shaping system, you can move the noise around in frequency. You can, for instance, push lots of the noise up to high frequencies. That is one of the reasons for oversampling.

What’s another reason for Oversampling?
You get to control the response of the initial anti-aliasing/anti-imaging filter digitally. As most everyone knows by now, high-order analog filters have a variety of problems: They are hard to manufacture That means expensive Their long-term performance is very hard to assure. That means that they tend to annoy the customer Since they are IIR filters, they have startling phase problems near the transition and stop bands. That annoys the customer, too

What sort of oversampling does the filter issue lead us to?
4x Oversampling, Digital FIR filter 5th order Analog filter This is shown as an example

The results? The 13th order analog filter (with horrible phase response) is replaced by a 5th order analog filter. The first, sharp antialiasing filter is now a digital filter, with deterministic behavior and performance. All it takes is “MIPS”. Nowdays, MIPS are cheap. The filter is trustworthy. It won’t drift, oscillate, distort, etc, if it’s designed and implemented properly. Its characteristics are exactly known.

Remember the Filters in the ear?
Your ear is a time analyzer. It will analyze the filter taps, etc, AS THEY ARRIVE, it doesn’t wait for the whole filter to arrive. If the filter has substantial energy that leads the main peak, this may be able to affect the auditory system. In Codecs this is a known, classic problem, and one that is hard to solve. In some older rate converters, the pre-echo was quite audible. The point? We have to consider how the ear will analyze any anti-aliasing filter. Two examples follow.

An example of a filter with passband ripple and barely enough stop band rejection.

An example of a good, longer filter, with less passband ripple.

An interesting result Trying to use the shortest possible filter (i.e. minimizing MIPS) results in a worse time response from the point of view of the auditory system. Passband ripple means that there are “tails” on the filter.

Another interesting result
Sharper filters have more “ringing”, and may have more auditory problems: The main lobe of a filter cutting off in 2.05 kHz must necessarily have a wider main lobe than the narrowest (in time) cochlear filter. df * dt >=1. The main lobe of a filter cutting off over 4kHz will have a main lobe a bit smaller than the narrowest cochlear filter. This suggests that for higher sampling rates, we do not want the ‘fastest’ filter, rather a filter with a wider transition band, and narrower time response.

Two examples:

Is this audible? That’s a good question. Since we are stuck, in general, with the filters our ADC’s and DAC’s use, it’s dreadfully hard to actually run this listening test. How would I do that? Get a DAC with a SLOW rolloff running at 4x (192K). Make a DC to 20 K Gaussian pulse at 192kHz. Downsample by zeroing 3 of every 4 samples and multiplying the others by 4. Generate a third signal with a TIGHT filter. Compare the three signals in a listening test.

Is this 4x oversampling what people do?
Not generally. That’s what they did for a while, until MIPS got even cheaper. What they did was go to more oversampling, a LOT more oversampling? Yes. Uses more digital, less analog There are a whole variety of circuitry and linearity reasons, almost all of them point toward much more oversampling and less “analog” hardware.

Massive oversampling:
Remember: One gets 3dB per doubling of Fs from oversampling with a flat noise floor. If we also put a single integrator with its zero at 20kHz into H(s), we will see that the increased SNR available is db/doubling of Fs. There will be some cost in the form of a constant negative term to this SNR, which is overcome by very moderate levels of oversampling. Each additional order of integration adds another 6dB/doubling of the sampling frequency. On the next page are some curves:

7 Noise shape Vs. Order, integrator pole At w=1 6 5 4 3 2 1 (Note: Curves as examples only. Real-world circuit considerations limit these curves)

Right. So what does that do for me, anyhow?
Remember, nearly the same amount of noise is being shaped in each case. As there is more “space” under the curve at high frequencies, more of the noise moves to high frequencies. That means there is LESS noise at low frequencies. Therefore, if we FILTER OUT the high frequencies, we wind up with a lower sampling rate signal with a higher SNR.

Some Examples (low order)
Original SNR 0dB. Base before upsampling = 48kHz

SNR vs. order vs downsampling rate for that (ideal) system.
1x 0 (dB) 2x 5.9 8.9 11.9 15.0 17.8 29,8 4x 11.8 23.7 30.0 35.6 41.6 8x 17.6 26.5 35.3 44.1 53.8 61.9 16x 23.1 34.8 46.5 58.1 69.8 81.5 32x 28.1 40.3 57.0 71.2 85.5 99.8 64x 32.1 47.8 66.3 82.8 99.4 116.0 128x 34.7 50.6 74.0 92.1 110.8 129.3

Real converters IC designers have found that having 4-bit flash converters (16 levels, 24dB SNR) inside a delta-sigma converter is often the cheapeast way to achieve the required results with present-day digital circuitry. The sampling rate can be lower, so the circuitry and power run slower. The filters can be shorter. The flash converter takes some space, but less power and space than additional DSP circuitry.

Details about those examples
All of the examples have ‘n’ integrators with a knee at 20kHz. This is not necessarily the optimum solution, it is used for example The examples are theoretically calculated, there is no component or electrical error involved. Nothing is ever this good in the real world. Are you surprised?

SUMMARY

Auditory system characteristics
Everything must be considered within the relevant cochlear filter bandwidth. 0dB SPL is slightly below atmospheric noise level. 120dB SPL is a good maximum, even that level is very dangerous for hearing. High frequency issues may be due to actual hearing, to filter time response issues, or both. Gradual filters are safer than steep filters.

Quantization and Sampling
Antialiasing and antiimaging filters are not just a good idea, it’s a requirement. Dithering is not just a good idea, it’s a requirement. There are many ways to quantize and sample.

converter Technology converter technology exists to do proper, clean conversions that operate over the advisable part of the human hearing range, both in frequency and level. There is no basic mathematical difference in the result of SAR vs. a Delta-Sigma converter in terms of what it delivers to the PCM system, the differences are due to circuitry and cost issues. Capturing the direct delta-sigma waveform (single or multibit) can be done. One group of high-rate proponents sells such a system. The only things this results in, practically, are the removal of the sharp anti-aliasing filter and the retention of the high-frequency noise. This also makes it hard to process or capture the signal For instance, an IIR filter that does bass boost might take 128 bits of arithmetic width to impliement for a 1-bit data input. Most electronics then require a filter to protect them from the HF noise. Tweeters and power amps in particular do not like this kind of input at all.

How to test converters Noise in the presence of low signal
Noise in the presence of maximum signal Single tone source Broadband (i.e. something like the “room correction” noise) stimulii Multitones that test for aliasing. One could do another 2 hours on how to test converters. Is there a demand?

James D. Johnston Microsoft Corporation Audio Architect

Similar presentations

Presentation on theme: "James D. Johnston Microsoft Corporation Audio Architect"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

James D. Johnston Microsoft Corporation Audio Architect

Similar presentations

Presentation on theme: "James D. Johnston Microsoft Corporation Audio Architect"— Presentation transcript:

Similar presentations

About project

Feedback