1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering

Slides:



Advertisements
Similar presentations
Decibel values: sum and difference. Sound level summation in dB (1): Incoherent (energetic) sum of two different sounds: Lp 1 = 10 log (p 1 /p rif ) 2.
Advertisements

Frequency analysis.
Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements Christopher A. Shera, John J. Guinan, Jr., and Andrew J. Oxenham.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Audio Compression ADPCM ATRAC (Minidisk) MPEG Audio –3 layers referred to as layers I, II, and III –The third layer is mp3.
Department of Computer Engineering University of California at Santa Cruz MPEG Audio Compression Layer 3 (MP3) Hai Tao.
Introduction to MP3 and psychoacoustics Material from website by Mark S. Drew
Psycho-acoustics and MP3 audio encoding
MPEG/Audio Compression Tutorial Mike Blackstock CPSC 538a January 11, 2004.
CS335 Principles of Multimedia Systems Audio Hao Jiang Computer Science Department Boston College Oct. 11, 2007.
MPEG-1 MUMT-614 Jan.23, 2002 Wes Hatch. Purpose of MPEG encoding To decrease data rate How? –two choices: could decrease sample rate, but this would cause.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Digital Audio Compression
Auditory scene analysis 2
Frequency selectivity of the auditory system. Frequency selectivity Important for aspects of auditory perception such as, pitch, loudness, timbre, melody,
Timbre perception. Objective Timbre perception and the physical properties of the sound on which it depends Formal definition: ‘that attribute of auditory.
Digital Audio Coding – Dr. T. Collins Standard MIDI Files Perceptual Audio Coding MPEG-1 layers 1, 2 & 3 MPEG-4.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Pitch organisation in Western tonal music. Pitch in two dimensions Pitch perception in music is often thought of in two dimensions, pitch height and pitch.
Pitch Perception.
AUDIO COMPRESSION TOOLS & TECHNIQUES Gautam Bhattacharya.
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
1 Digital Audio Compression. 2 Formats  There are many different formats for storing and communicating digital audio:  CD audio  Wav  Aiff  Au 
Chapter 6: Masking. Masking Masking: a process in which the threshold of one sound (signal) is raised by the presentation of another sound (masker). Masking.
Auditory Scene Analysis (ASA). Auditory Demonstrations Albert S. Bregman / Pierre A. Ahad “Demonstration of Auditory Scene Analysis, The perceptual Organisation.
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
PH 105 Dr. Cecilia Vogel Lecture 10. OUTLINE  Subjective loudness  Masking  Pitch  logarithmic  critical bands  Timbre  waveforms.
SUBJECTIVE ATTRIBUTES OF SOUND Acoustics of Concert Halls and Rooms Science of Sound, Chapters 5,6,7 Loudness, Timbre.
On Timbre Phy103 Physics of Music. Four complex tones in which all partials have been removed by filtering (Butler Example 2.5) One is a French horn,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
Sound source segregation (determination)
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
© 2010 Pearson Education, Inc. Conceptual Physics 11 th Edition Chapter 21: MUSICAL SOUNDS Noise and Music Musical Sounds Pitch Sound Intensity and Loudness.
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
1 Audio Compression Multimedia Systems (Module 4 Lesson 4) Summary: r Simple Audio Compression: m Lossy: Prediction based r Psychoacoustic Model r MPEG.
Hearing.
Harmonics, Timbre & The Frequency Domain
Ni.com Data Analysis: Time and Frequency Domain. ni.com Typical Data Acquisition System.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Psycho- acoustics and MP3 audio encoding Physics of Music PHY103.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Sound Vibration and Motion.
Chapter 7: Loudness and Pitch. Loudness (1) Auditory Sensitivity: Minimum audible pressure (MAP) and Minimum audible field (MAF) Equal loudness contours.
Angelo Farina Dip. di Ingegneria Industriale - Università di Parma Parco Area delle Scienze 181/A, Parma – Italy
Applied Psychoacoustics Lecture 3: Masking Jonas Braasch.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
The Ear As a Frequency Analyzer Reinier Plomp, 1976.
IntroductiontMyn1 Introduction MPEG, Moving Picture Experts Group was started in 1988 as a working group within ISO/IEC with the aim of defining standards.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Pitch What is pitch? Pitch (as well as loudness) is a subjective characteristic of sound Some listeners even assign pitch differently depending upon whether.
Applied Psychoacoustics Lecture 4: Pitch & Timbre Perception Jonas Braasch.
SOUND PRESSURE, POWER AND LOUDNESS
Fletcher’s band-widening experiment (1940) Present a pure tone in the presence of a broadband noise. Present a pure tone in the presence of a broadband.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
4-3-3 Frequency Modulation.. Learning Objectives:At the end of this topic you will be able to; sketch, recognise and analyse the resulting waveforms for.
Fundamentals of Multimedia 2 nd ed., Chapter 14 Chapter 14 MPEG Audio Compression 14.1 Psychoacoustics 14.2 MPEG Audio 14.3 Other Audio Codecs 14.4 MPEG-7.
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Fletcher’s band-widening experiment (1940)
PSYCHOACOUSTICS A branch of psychophysics
Precedence-based speech segregation in a virtual auditory environment
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Pitch What is pitch? Pitch (as well as loudness) is a subjective characteristic of sound Some listeners even assign pitch differently depending upon whether.
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Pitch What is pitch? Pitch (as well as loudness) is a subjective characteristic of sound Some listeners even assign pitch differently depending upon whether.
MPEG-1 Overview of MPEG-1 Standard
Speech Perception (acoustic cues)
Govt. Polytechnic Dhangar(Fatehabad)
Presentation transcript:

1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering

Timber Perception (Ack. S. Zielinski) 2

What is Timbre? According to American Standard Association, it is defined as “that attribute of sensation in terms of which a listener can judge that two sounds have the same loudness and pitch are dissimilar”. Musically, it is “the quality of a musical note which distinguishes different types of musical instruments.” It can be defined as “everything that is not loudness, pitch or spatial perception”. Loudness Amplitude (frequency dependent) Pitch Fundamental Frequency Spatial perception IID, IPD Timbre ??? 3

Physical Parameters Timbre relates to: Static spectrum (e.g. harmonic content of spectrum) Envelope of spectrum (e.g. the peaks in the LPC spectrum which corresponds to formants) Dynamic spectrum (time evolving) Phase … 4

Static Spectrum 5

Spectrum Envelope Formant affects the sensation of timbre 6

Spectrum Envelope (cont) Formants determines not only timbre, but also the recognition of vowels 7

Spectrum Envelope (cont) This figure shows how the spectral envelope looks like in a trumpet sound 8

Spectrum Envelope (cont) The spectral envelopes of the flute (the above figure) and the piano (the below figure) suggest that they are different for different music instrument. 9

Dynamic Spectrum This figure shows how the spectral envelope looks like in a trumpet sound 10

Phase The above two magnitude spectra are identical, while their waveforms are totally different. The timbre of these two sounds are almost identical, and hence phase affects the timbre but to very little extent. This also suggests that human hearing is not sensitive to phase difference. 11

Demos for Timbre Perception Resources: Audio Box CD from Univ. of Victoria Examples of differences in timbres 12

Auditory Masking 13

What is masking ? Masking: One sound is made inaudible by another one. Simultaneous masking refers to the situation where one sound (signal) is made inaudible by another simultaneous sound (i.e. the masker). In other words, both the signal and the masker happen at the same duration. It is also known as frequency masking or spectral masking since if two sounds share a same frequency band, they can be perceived clearly when separated, but cannot be perceived clearly when simultaneous, such as the tones at 440Hz and 450Hz Non-simultaneous masking refers to the situation where one sound (signal) is made inaudible by another sound (i.e. the masker) that proceeds or follows the signal. In other words, they do not present at the same time. 14

What is masking? (cont) 15

Simultaneous Masking On-frequency masking The masker and the signal are within the same auditory filter band, with the louder sound masks the quieter one. Off-frequency masking The masker and the signal are with different frequency bands. The masking effect is weaker as compared with the on-frequency masking. (Source: figures from wikipedia, 2010) 16

Simultaneous Masking (cont) In off-frequency masking, the amount that the masker raises the threshold of the signal is much less as compared with on-frequency masking, however, it does have some masking effect on the signal, as shown in the above figure. To have a same masking effect as in on-frequency masking, the level of masker needs to be greater in off- frequency masking. (Source: figures from wikipedia, 2010) 17

Demos for Simultaneous Masking (Frequency Domain Masking) Resources: Audio Box CD from Univ. of Victoria A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 12 dB, then by steps of 5 dB. The sequence is repeated twice. The second time the frequency separation between the tones is increased. Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask high frequencies. This demo shows a tone of greater intensity masks a broader ranger of tones than a tone of less intensity. A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 10 dB, then by steps of 3 dB. The sequence above is repeated twice, the second time increasing the intensity of the single tone by 28 dB. Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask low frequencies. 18

The Amount of Masking In the example above, the amount of masking is 16dB, which is the difference between the masked threshold and un-masked threshold. Note that the threshold for a signal that is masked will be raised as compared with the signal is not masked (for example, when the signal is heard in a quiet environment.) (Source: figures from wikipedia, 2010) 19

Masking Interprets Frequency Resolution of Auditory System Frequency selectivity, also known as frequency resolution, is referred to as the ability of human auditory system to separate the different frequency components of a complex sound. Recall the concept of the critical bandwidth, two sounds with different frequencies (pitches) can be heard as two separate tones. It is achieved and performed by the filtering process of the cochlear, where the complex sound is (band-pass) filtered and decomposed into individual frequency components (sinusoids), and then coded independently in the auditory nerve. Masking is usually used to quantify and characterise the frequency resolution of the auditory system. The auditory system would not be able to separate the two frequencies if the sound of one frequency is masked by that of the other. Therefore, masking explains the limits of frequency resolution of the human auditory system. 20

Use Masking to Estimate the Critical Band The original experiment by Fletcher (1940) to measure the threshold for detecting a sinusoidal signal as a function of the bandwidth of a bandpass noise masker Conditions: The noise was centred at the signal frequency. Noise power density was constant. Findings: At first, the threshold increases as the noise bandwidth increases. However, it flats off with the further increases in noise. This was due to the critical bandwidth: where the noise bandwidth exceeds the bandwidth of the auditory filter and the threshold ceases to increase even if the noise power increases. The power-spectrum model of masking assumes (Moore, 1995): The auditory system is a bank of linear overlapping band-pass filters. Use one filter with a centre frequency close to that of the signal for the detection of the signal. The signal is only masked by the noise component that passes through the auditory filter. The threshold corresponds to a certain signal-to-noise (masker) ratio. 21

Psychophysical Tuning Curves Psychophysical tuning curves (PTCs) is a method for the estimation of the shape of the auditory filter. The PTCs above were determined in simultaneous masking, using sinusoidal signals at 10 dB SPL. For each curve, the diamond below it shows the frequency and the level of the signal. The masker was a sinusoid that had a fixed starting phase relationship to the signal. The masker level required for threshold (i.e. just mask the signal) is plotted as a function of masker frequency on a logarithmic scale. The dashed line represents the absolute threshold for the signal. Figure from (Moore, 1995). 22

Shape of Auditory Filter The shape of the auditory filter centred at 1kHz plotted for input sound levels ranging from 20 to 90 dB SPL/ ERB. The output level of the filter is plotted as a function of the frequency. On the low-frequency side, the filter becomes progressively less sharply tuned with increasing sound level. On the high-frequency side, the sharpness of tuning increases slightly with increasing sound level. At moderate sound levels the filter is approximately symmetric on the linear frequency scale used. Figure from (Moore, 1995) 23

Bark Scale Proposed in 1961 by Eberhard Zwicker, named after Heinrich Barkhausen who proposed the first subjective measurement of loudness. The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing. The subsequent band edges are (in Hz) 20, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000,

Non-Simultaneous Masking Forward masking Backward masking Masking tone Masked tone Masking tone Masked tone T T T cannot be as long as 20-30ms T cannot be more than 10ms time 25

Forwarding Masking The left figure shows the amount of forward masking of a 2kHz signal as a function of the time delay between the signal and the end of the noise masker. Each curve represents a different noise level. The results for each spectrum level fall on a straight line when the signal delay is plotted on a logarithmic scale. The right figure shows the same thresholds plotted as a function of the masker level. The slopes of these growth of masking functions decrease with increasing signal delay. Figures from (Moore, 1995) 26

Forwarding Masking Forward masking is greater the nearer in time to the masker that the signal occurs. Increments in masker level do not produce equal increments in amount of forward masking, i.e. the slope of the growth of masking function is less than 1, which is in contrast to the simultaneous masking where the slope is close to 1. 27

PTCs Comparisons Comparison of the psychophysical tuning curves determined by the simultaneous masking (triangle) and the forward masking (square). The masker frequency is plotted as a function of the deviation of the centre frequency divided by the centre frequency. The unit for the centre frequency is kHz. Figures from (Moore et al, 1984) 28

Demos for Non-simultaneous Masking (Time Domain Masking) Resources: Audio Box CD from Univ. of Victoria Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 100ms delay in the between. Two tones can be heard even though the second tone is decreased in 3dB increments. Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 10ms delay in the between. Masking occurs in this demo. How many steps are audible before the second tone is masked. Backward masking: the initial tone is masked by the one that follows. The time delay is 100ms. Backward masking: the initial tone is masked by the one that follows. The time delay is decreased by still more than 10ms. Backward masking: the initial tone is masked by the one that follows. The time delay is below 10ms. Masking occurs. How many steps are audible? 29

Examples of Modern Audio Formats  MP3: MPEG-1 or MPEG-2 Audio Layer 3 (or III), is a patented lossy audio codec. It is a common audio format for consumer audio storage, as well as a standard of digital audio compression for the transfer and playback of music on digital audio players.  Ogg Vorbis: an lossy audio codec developed by the Xiph.Org Foundation (formerly Xiphophorus company). Free and open source.  AAC: Advanced Audio Coding, an audio compression format specified by MPEG-2 and MPEG-4, and successor to MPEG-1’s “MP3” format.  WMA: Windows Media Audio, is an audio codec developed by Microsoft.  MPEG-1 Layer II or MPEG-2 Audio Layer II (MP2): a lossy audio compression format defined by ISO/IEC alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and internet applications, MP2 remains a dominant standard for audio broadcasting.  ATRAC: Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. ATRAC allowed a relatively small disc like MiniDisc to have the same running time as CD while storing audio information with minimal loss in perceptible quality. 30

Auditory Scene Analysis 31

Demos for Sequential Organisation Resources: Audio Box CD from Univ. of Victoria In this demo, the sound is perceived as a single stream of notes C4 G4 F4 B3 If the time delay is further decreased. We no longer hear a melody, we only hear the rhythmic beats. Our auditory system is now hearing four groups of one note each. As the notes are sped up, rhythmic beats played as a melody begin to be heard. The auditory system is now hearing two groups of two notes. 32

Demo for Speech Segregation Resources: Audio Box CD from Univ. of Victoria This demo begins the two melodies of “Camptown Races” and “Yankee Doodle” at the same pitch. Each time the interleaved melody is played, one of the songs is shifted in pitch until eventually the two melodies become distinguishable. This demo adjusts the amplitude of the two songs while leaving the pitch constant. This demo plays the two melodies at the same pitch, but at different timbre. The two melodies are distinguishable instantly. 33

Segregation of a melody from interfering tones Track 1 in Bregman’s ASA Demonstration 34

Segregation of a melody from interfering tones Track 5 in Bregman’s ASA Demonstration 35

Segregation of high notes from low ones in a sonata by Telemann Track 6 in Bregman’s ASA Demonstration 36

Streaming in African xylophone music Track 7 in Bregman’s ASA Demonstration 37

Effects of a timbre difference between the two parts in African xylophone music Track 9 in Bregman’s ASA Demonstration 38

Stream segregation of vowels and diphthongs Track 11 in Bregman’s ASA Demonstration 39

Stream segregation of high and low bands of noise Track 14 in Bregman’s ASA Demonstration 40

Apparent Continuity Track 28 in Bregman’s ASA Demonstration 41

Perceptual continuation Track 29 in Bregman’s ASA Demonstration 42