Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright (c) 2006 by James D. Johnston. Permission granted for any educational use. Audio vs. Video What’s the difference? J.D. Johnston With contributions.

Similar presentations


Presentation on theme: "Copyright (c) 2006 by James D. Johnston. Permission granted for any educational use. Audio vs. Video What’s the difference? J.D. Johnston With contributions."— Presentation transcript:

1 Copyright (c) 2006 by James D. Johnston. Permission granted for any educational use. Audio vs. Video What’s the difference? J.D. Johnston With contributions from David Workman

2 First, a basic reminder of the perceptual aspects: What stands out for “audio”? What stands out for “video”? How are the perceptual systems the same? How are they different?

3 Audio vs. Video Perceptual Systems: Loudness Analysis Brightness Analysis Feature Analysis Feature Analysis Feature Extraction Feature Extraction Frequency Analysis MegabytesMegabyteKilobytes1-2Bytes Byte measurements are PER SECOND Mechanical Connection

4 Loudness Analysis Feature Analysis Feature Extraction Frequency Analysis Time component in analysis (leaving out some parts of time analysis involving adaptation) 20+kHz~50Hz~5Hz

5 Brightness Analysis Feature Analysis Feature Extraction ~50 Hz ~5 Hz Again, leaving out adaptation.

6 What’s the Point? The eye is a weak time analyzer, in terms of “response time” or Hz. The ear is a very good time/frequency analyzer. There are, of course, other differences: –The eye primarily receives data spatially –The ear primarily receives data as a function of time –The eye has no intrinsic time axis –The ear has an intrinsic time analyzer –The ear analyzes pressure (1 variable per ear) –The eye analyzes 4 variables (RGBY) over space with varying resolution for each variable depending on direction of sight.

7 Regarding Each of those steps Most theories of memory have humans possessing 3 levels of memory. Different people call them different things, but they can be summarized as follows for the auditory system: –Loudness memory (persists up to 200 ms) –Feature memory (persists for seconds) –Long term memory (persists indefinitely) At each step from Loudness  Feature, and Feature  Long Term, much information is lost as loudness is turned into features, and features into auditory objects.

8 The auditory effect of a Glitch There’s no getting around the frequency analysis! Glitch Free Signal Glitched Signal

9 The result? Because of the inherent frequency analysis, the loudness of the glitch can be literally ~10 times that of the original signal, give or take. The glitch is VERY perceptible and becomes the primary feature. It’s very roughly equal to having a photographic strobe go off in your face.

10 What happens when we repeat a frame? A good example is the 2:3 pull-up that happens during telecine 24fps content is padded out to 30fps by repeating “fields” (1/60 th of a second) of video (not “frames” – 1/30 th of a second) in a 2:3 pattern The motion looks smooth unless you know what to look for Repeating a single full frame is usually not noticeable, but in a sequence will give a very obvious “judder” to the image

11 Well? It’s not a great thing to do, but under most circumstances it doesn’t introduce the equivalent of a “WHITE FLASH” or something of that sort, even though it’s not imperceptible.

12 What happens when we drop a frame? Dropped frames are usually more noticeable than repeated frames. Dropped frames in a regular pattern (30 frames to 15 frames) are LESS objectionable than repeated frames in a 2/3 pull down pattern. Still not equivalent to having a “strobe light” flashed at you.

13 Therefore: A glitch in audio creates an event that is very “big”. The glitch, in and of itself, pushes a great deal of different information down the perceptual chain. A glitch in video is not necessarily “big” because adjacent frames don’t change that much in most cases, and are “events” when they do, in fact it’s usually small, because the frame to frame difference isn’t that big, and there is no inherent time analysis to capture the glitch.

14 Dynamic Range of Perception Audio provides at least 120dB of level change over which loudness changes are perceived under normal circumstances. The eye has a similar range over which brightness changes are perceived under normal circumstances. HOWEVER:

15 Adaptation in Audition: –Adaptation takes place over millisecond to second timescales. –A local transducer SNR of 30dB, give or take, is scaled rapidly over at least a 120dB range. –The frequency analysis can have 90dB of range at any given instant. –Linearity need not apply. Adaptation in Vision –Adaptation is over timescales of 1 second to 1000 seconds. –The local transducer SNR is the same. –In the short term, the eye is more “linear”.

16 Stereo Response Two eyes to see 3D, two ears to hear stereo … Cover one eye and you still have “hints” about how far away things are Plug one ear and you still have “hints” about how far away things are Both can be fooled HOWEVER:

17 “Stereo” in Audition Stereo sensations in the auditory system come about from two sources: –Interaural TIME difference –Interaural LEVEL difference The speed of sound allows for biologically meaningful delay between the ears. Direction cues must be derived from the time delay.

18 Stereo Vision The speed of light does not allow for biologically perceptible time delays between the eyes. The eyes, however, are spatial receptors. –Parallax between the two eyes provides distance cues. –Direction cues are BUILT IN to the 2D image.

19 Color Vision Rod cells detect luminance Cone cells detect chrominance Many more Rod cells than Cone cells – and conversely much better luma resolution Chroma perception drops off at low light levels Luma perception “bleaches out” at high light levels Therefore, day vision and night vision are substantially different

20 “Color Audition” Doesn’t exist. This variable does not exist in the auditory system. The auditory system works approximately the same for loud vs. quiet sounds. The changes are second order except at the thresholds of hearing and pain. This slide intentionally left blank.

21 Gamma Correction Video cameras have a Gamma curve applied to the sensing element Originally, this was to “pre-compensate” for non- linearities in the camera The opposite gamma curve is applied in a TV set to restore the linear levels Gamma curves also compensate for the difference in differential sensitivity of the eye at brighter and darker levels. Any “harmonic distortion” doesn’t matter, the eye is a spatial receptor.

22 Gamma correction in audio? That would introduce a great deal of nonlinear distortion to an audio signal. The frequency domain components would be audible (the ear is a frequency analyzer) A varying noise floor would be exposed by the 90+dB mechanical filtering mechanisms in the ear.

23 Conclusions :

24 The main differences: Time/Frequency vs. spatial analysis Dynamic range vs. time Linearity Stereo senses Color (the ear hasn’t any)

25 Spatial vs. Frequency Analysis The eye is primarily a spatial analyzer, it’s frequency analysis is at higher (cognitive) levels, and is “slow” compared to audio. Accuracy in the spatial domain is of primary importance. The ear is first and foremost a frequency analyzer, the time domain, and time/frequency domain accuracy, are of primary importance.

26 Dynamic Range vs. Time The ear adapts on a.001 to 1 second basis. Adaptation is effectively instantaneous. Unless the sound is loud enough to cause temporary threshold shift (and that’s loud) there is little “memory”. The eye adapts on a 1 to 1000 second basis. (enter a cave from bright sunshine, and see how long it takes you to adapt)

27 Dynamic Range vs. Frequency The ear’s mechanical filters can provide in excess of 90dB separation at different frequencies at any given instant, leading to PCM depths of at least 16 bits. The eye can handle much less range at a given instant, leading to 8 to 10 bit PCM video.

28 Linearity The ear isn’t linear. I think I said this once or twice, but I’ll say it again. The eye, given a fixed overall illumination level, is more linear. The ear, however, requires linear capture and reproduction, because nonlinearities introduce frequency components that were not present in the original signal. The eye, not having the frequency analyzer, does not suffer nearly as much from this problem. Video capture can use “gamma” or ‘warped’ capture curves as a result.

29 Stereoscopy The eye implicitly captures direction, it is a spatial sensor. The ear derives direction information from time and level information. The eye can derive distance information from parallax, due to its spatial ability. The ear, barely, can abstract some idea of distance, from tertiary cues. The ear can do pretty well once you learn the acoustics of your location.

30 Yeah, so? For hearing, keep the time domain intact. Errors in the time domain, drop outs, etc, create very annoying events. Don’t make errors that create new frequencies! For vision, keep the spatial domain intact. The time domain, while important, is secondary. Don’t worry about making new “frequencies”, the sensor isn’t that sensitive, BUT GET YOUR EDGES RIGHT.


Download ppt "Copyright (c) 2006 by James D. Johnston. Permission granted for any educational use. Audio vs. Video What’s the difference? J.D. Johnston With contributions."

Similar presentations


Ads by Google