This week I have been looking at the effect different speech samples have on the performance of Codec 2. One factor is microphone placement. In radio (from broadcast to two way HF/VHF) we tend to use microphones closely placed to our lips. In telephony, hands free, or more distance microphone placement has become common.
People trying FreeDV over the air have obtained poor results from using built-in laptop microphones, but good results from USB headsets.
So why does microphone placement matter?
Today I put this question to the codec2-dev and digital voice mailing lists, and received many fine ideas. I also chatted to such luminaries as Matt VK5ZM and Mark VK5QI on the morning drive time 70cm net. I’ve also been having an ongoing discussion with Glen, VK1XX, on this and other Codec 2 source audio conundrums.
A microphone is a bit like a radio front end:
We assume linearity (the microphone signal isn’t clipping).
Imagine we take exactly the same mic and try it 2cm and then 50cm away from the speakers lips. As we move it away the signal power drops and (given the same noise figure) SNR must decrease.
Adding extra gain after the microphone doesn’t help the SNR, just like adding gain down the track in a radio receiver doesn’t help the SNR.
When we are very close to a microphone, the low frequencies tend to be boosted, this is known as the proximity effect. This is where the analogy to radio signals falls over. Oh well.
A microphone 50cm away picks up multi-path reflections from the room, laptop case, and other surfaces that start to become significant compared to the direct path. Summing a delayed version of the original signal will have an impact on the frequency response and add reverb – just like a HF or VHF radio signal. These effects may be really hard to remove.
Science in my Lounge Room 1 – Proximity Effect
I couldn’t resist – I wanted to demonstrate this model in the real world. So I dreamed up some tests using a couple of laptops, a loudspeaker, and a microphone.
To test the proximity effect I constructed a wave file with two sine waves at 100Hz and 1000Hz, and played it through the speaker. I then sampled using the microphone at different distances from a speaker. The proximity effect predicts the 100Hz tone should fall off faster than the 1000Hz tone with distance. I measured each tone power using Audacity (spectrum feature).
This spreadsheet shows the results over a couple of runs (levels in dB).
So in Test 1, we can see the 100Hz tone falls off 4dB faster than the 1000Hz tone. That seems a bit small, could be experimental error. So I tried again with the mic just inside the speaker aperture (hence -1cm) and the difference increased to 8dB, just as expected. Yayyy, it worked!
Apparently this effect can be as large as 16dB for some microphones. Apparently radio announcers use this effect to add gravitas to their voice, e.g. leaning closer to the mic when they want to add drama.
Im my case it means unwanted extra low frequency energy messing with Codec 2 with some closely placed microphones.
Science in my Lounge Room 2 – Multipath
So how can I test the multipath component of my model above? Can I actually see the effects of reflections? I set up my loudspeaker on a coffee table and played a 300 to 3000 Hz swept sine wave through it. I sampled close up and with the mic 25cm away.
The idea is get a reflection off the coffee table. The direct and reflected wave will be half a wavelength out of phase at some frequency, which should cause a notch in the spectrum.
Lets take a look at the frequency response close up and at 25cm:
Hmm, they are both a bit of a mess. Apparently I don’t live in an anechoic chamber. Hmmm, that might be handy for kids parties. Anyway I can observe:
- The signal falls off a cliff at about 1000Hz. Well that will teach me to use a speaker with an active cross over for these sorts of tests. It’s part of a system that normally has two other little speakers plugged into the back.
- They both have a resonance around 500Hz.
- The close sample is about 18dB stronger. Given both have same noise level, that’s 18dB better SNR than the other sample. Any additional gain after the microphone will increase the noise as much as the signal, so the SNR won’t improve.
OK, lets look at the reflections:
A bit of Googling reveals reflections of acoustic waves from solid surfaces are in phase (not reversed 180 degrees). Also, the angle of incidence is the same as reflection. Just like light.
Now the microphone and speaker aperture is 16cm off the table, and the mic 25cm away. Couple of right angle triangles, bit of Pythagoras, and I make the reflected path length as 40.6cm. This means a path difference of 40.6 – 25 = 15.6cm. So when wavelength/2 = 15.6cm, we should get a notch in the spectrum, as the two waves will cancel. Now v=f(wavelength), and v=340m/s, so we expect a notch at f = 340*2/0.156 = 1090Hz.
Looking at a zoomed version of the 25cm spectrum:
I can see several notches: 460Hz, 1050Hz, 1120Hz, and 1300Hz. I’d like to think the 1050Hz notch is the one predicted above.
Can we explain the other notches? I looked around the room to see what else could be reflecting. The walls and ceiling are a bit far away (which means low freq notches). Hmm, what about the floor? It’s big, and it’s flat. I measured the path length directly under the table as 1.3m. This table summarises the possible notch frequencies:
Note that notches will occur at any frequency where the path difference is half a wavelength, so wavelength/2, 3(wavelength)/2, 5(wavelength)/2…..hence we get a comb effect along the frequency axis.
OK I can see the predicted notch at 486Hz, and 1133Hz, which means the 1050 Hz is probably the one off the table. I can’t explain the 1300Hz notch, and no sign of the predicted notch at 810Hz. With a little imagination we can see a notch around 1460Hz. Hey, that’s not bad at all for a first go!
If I was super keen I’d try a few variations like the height above the table and see if the 1050Hz notch moves. But it’s Friday, and nearly time to drink red wine and eat pizza with my friends. So that’s enough lounge room acoustics for now.
How to break a low bit rate speech codec
Low bit rate speech codecs make certain assumptions about the speech signal they compress. For example the time varying filter used to transmit the speech spectrum assumes the spectrum varies slowly in frequency, and doesn’t have any notches. In fact, as this filter is “all pole” (IIR), it can only model resonances (peaks) well, not zeros (notches). Codecs like mine tend to fall apart (the decoded speech sounds bad) when the input speech violates these assumptions.
This helps explain why clean speech from a nicely placed microphone is good for low bit rate speech codecs.
Now Skype and (mobile) phones do work quite well in “hands free” mode, with rather distance microphone placement. I often use Skype with my internal laptop microphone. Why is this OK?
Well the codecs used have a much higher bit rate, e.g. 10,000 bit/s rather than 1,000 bits/s. This gives them the luxury to employ codecs that can, to some extent, code arbitrary waveforms as well as speech. These employ algorithms like CELP that use a hybrid of model based (like Codec 2) and waveform based (like PCM). So they faithfully follow the crappy mic signal, and don’t fall over completely.