Presentation on theme: "Abstract Binaural microphones were utilised to detect phonation in a human subject (figure 1). This detection was used to cut the audio waveform in two."— Presentation transcript:
Abstract Binaural microphones were utilised to detect phonation in a human subject (figure 1). This detection was used to cut the audio waveform in two (actually four) separate channels; one for voiced segments and one for the background noise. Given that the own voice almost always is louder than the background noise at the ears of the subject, the channel with the voiced segments can be used for extraction of speaking level, F0 and phonation time. The Background channel can be used to estimate the background noise level. The method has previously been used as part of a voice accumulator in other studies [Södersten et al. 2001]. Example This experimental recording was made in a laboratory environment where speech-shaped noise was played back through loudspeakers and a female speaker wore the microphones during a ”conversation” with the author. The resulting level curves and switched audio can be seen in figures 2 and 3. Web This poster and sound samples are also available on the web: Figure 1. The two microphones are attached near the ears of the subject Figure 2. Example: switched levels. Note the peaks in the S/O ratio channel Figure 3. Example: switched audio The self-to-other ratio applied as a phonation detector for voice accumulation Svante Granqvist, Royal Institute of Technology, KTH, Stockholm, Sweden References Ternström S. (1994) Hearing myself with others: Sound levels in choral performance measured with separation of one’s own voice from the rest of the choir. J Voice 1994;8(4): Södersten M., Hammarberg B., Granqvist S., Szabo A., (2001) Vocal behaviour and vocal loading factors for pre-school teachers at work studied with binaural DAT-recordings. Submitted for publication
Figure 5. Schematic of the signal processing in the computer program ”Aura”. Signal processing From the two microphone signals five level signals is derived, (figure 5): 1. The level at the left microphone (L level) 2. The level at the right microphone (R level) 3. The level of the difference signal (L-R level) 4. The level of the sum signal (L+R level) 5. The S/O ratio [Ternström, 1994], which is the difference between channels 3 and 4. The sum and difference channels are high-pass filtered at 1 kHz before level extraction, see below. Normally, the level in the S/O ratio channel has a high correlation with the instances of phonation, see figure 2 and can thus be used as a control signal for the switching of audio and level signals. Two separate thresholds are applied to control the Self and Other switching. Typically, the Self signal will contain the voiced portions of the recording, with all pauses and unvoiced segments removed. On the other hand, the Other signal will contain these pauses and unvoiced segments. There are, however, instances when there is a need for improved control. This is acheived in the post- processing blocks to the right in figure 5. The most important feature is the construction of a Background control signal from the Other control signal (figure 6). Using this control signal, rather than the the Other control signal, the output is further cleaned from the subject’s voice. This is extremely important for estimation of the background noise level. Similarly a Talk channel can be derived by including short pauses and unvoiced segments (figure 7). Figure 4. The computer program ”Aura”, which implements the method. Computer program The binaural stereo recordings is used as input to the computer program ”Aura” (figure 4). The signals are processed and a number of channels can be selected to appear in the output files. The output files can contain either switched audio or switched level curves. High-pass filter The fundamental idea with the method is that ambient sound sources arrive uncorrelated to the microphones and thus the level of sum and difference signals will be approximately equal. However, for low-frequency sounds, the signals will appear in phase due to the fact that the wavelength is large compared to the distance between the microphones, and will thus be mis- interpreted as voicing from the subject. The 1 kHz high-pass filter will reduce this effect and thus improve the accuracy of the switching. The need for the high-pass filter was verified with the following experiment. A subject was positioned in the diffuse field from two loudspeakers in a standard laboratory environment. The subject was then rotated 360 degrees, and long-time average spectra (LTAS) were used to analyse the spectral properties of the Self and Other channels. The results confirm a raise of the level of the S/O ratio at low frequencies (figure 8), even though the subject did not phonate during the experiment. Figure 6. The steps to derive a Background channel from the Other channel by modifying the instances of switching Figure 7. The steps to derive a Talk channel from the Self channel by modifying the instances of switching The self-to-other ratio applied as a phonation detector for voice accumulation Svante Granqvist, Royal Institute of Technology, KTH, Stockholm, Sweden Figure 8. A diffuse field yields a high S/O ratio at low frequencies even though no phonation occurs. Theconsequences of this effect is reduced by applying a high-pass filter to the signals.