Speech and Audio Processing (Part 2)
Basic Parameter Extraction
There are a number of very basic speech parameters which can be easily calculated, such as:
- Short Time Energy
- Short Time Zero Crossing Count (ZCC)
- Pitch Period
All of the above parameters are typically estimated for frames of speech between 10 and 20 milliseconds in length.
Short-Time Energy
The short-time energy of speech may be computed by dividing the speech signal into frames of \(N\) samples and computing the sum squared values of the samples in each frame.
Splitting the signal into frames can be achieved by multiplying the signal by a suitable window function \(w(n)\) where \(n=0,1,2,3,...,N-1\) (\(w(n) \mapsto 0\) for \(n\) outside the range \([0,N-1]\)).
Rectangular Window
A simple rectangular window is suitable for our purposes. For a window starting at sample \(m\), the short-time energy \(E_m\) is defined as:
\[E_m = \sum\limits_n [s(n)w(m-n)]^2\]
\[w(n) = \begin{cases} 1 & 0 \leq n \leq N-1\\ 0 & \text{otherwise} \end{cases}\]
If a non-rectangular window is employed, avoid distortion by using:
\[E_m = \sum\limits_n [s(n)]^2 h(m-n)\]
\[h(n) = [w(n)]^2\]
To see how window choice effects the short-time energy, observe that if \(h(n)\) was very long and of constant amplitude \(E_m\), energy would change very little with time.
Such a window would be the equivalent of a very narrowband lowpass filter. Clearly what is desired is some lowpass filtering so that the short-time energy reflects the amplitude variations of the speech signal.
We want a short duration window to be responsive to rapid amplitude changes, but a window that is too short will not provide sufficient averaging to produce a smooth energy function.
Python Code
We'll start out code by importing global functionality:
from __future__ import division # so, 1/2 == 0.5 (forces real-valued division)
The scipy.io.wavfile code allows for basic manipulation of wave file audio.
from scipy.io import wavfile
# pick one file name:
###fName = '003_A_male_167Hz.wav'
fName = '003_CatInTheHat.wav'
fs, signal = wavfile.read(fName)
signal = signal / max(abs(signal)) # scale signal
assert min(signal) >= -1 and max(signal) <= 1
print 'fs ==> ', fs, 'Hz' # sampling rate
print 'len(signal) ==> ', len(signal), 'samples'
Since \(fs\) and \(len(signal)\) are the same, the duration of the signal is one second. Here are some plots of the signal we are dealing with:
from numpy import linspace
_, (sp1, sp2) = plt.subplots(1, 2, figsize=(16, 4))
# plot raw signal
sp1.plot(signal)
sp1.set_title('Raw Signal')
sp1.set_xlabel('SAMPLE\n(a)')
sp1.autoscale(tight='both')
# plot spectrogram
sp2.specgram(signal)
sp2.set_title('Spectogram')
sp2.set_xlabel('TIME\n(b)')
nSecs = len(signal) / fs
ticksPerSec = 3
nTicks = nSecs * ticksPerSec + 1 # add 1 to include time=0
xTickMax = sp2.get_xticks()[-1]
sp2.set_xticks(linspace(0, xTickMax, nTicks))
sp2.set_xticklabels([round(x, 2) for x in linspace(0, nSecs, nTicks)])
sp2.set_ylabel('FREQ')
maxFreq = fs / 2
nTicks = maxFreq / 1000 + 1 # add 1 to include freq=0
sp2.set_yticks(linspace(0, 1, nTicks))
sp2.set_yticklabels(linspace(0, maxFreq, nTicks));
sp2.autoscale(tight='both')
Figure 1 -- (a) raw audio signal; (b) spectrogram of signal.
If you want to hear it:
from IPython import display
display.HTML("<audio controls><source src='{}'></audio>".format('files/003_CatInTheHat.wav'))
First we'll compute some statistics:
assert fs % 1000 == 0
sampsPerMilli = int(fs / 1000)
millisPerFrame = 20
sampsPerFrame = sampsPerMilli * millisPerFrame
nFrames = int(len(signal) / sampsPerFrame) # number of non-overlapping _full_ frames
print 'samples/millisecond ==> ', sampsPerMilli
print 'samples/[%dms]frame ==> ' % millisPerFrame, sampsPerFrame
print 'number of frames ==> ', nFrames
Now we'll process each frame in turm:
STEs = [] # list of short-time energies
for k in range(nFrames):
startIdx = k * sampsPerFrame
stopIdx = startIdx + sampsPerFrame
window = zeros(signal.shape)
window[startIdx:stopIdx] = 1 # rectangular window
STE = sum((signal ** 2) * (window ** 2))
STEs.append(STE)
Here's a plot of the short-time energies:
plt.plot(STEs)
plt.title('Short-Time Energy')
plt.ylabel('ENERGY')
plt.xlabel('FRAME')
pyplot.autoscale(tight='both');
Figure 2 -- Short-time energy per frame.
There appears to be some silence, and some non-silence components of this utterance.
For instance, it looks like frames 0-33 are silence. That corresponds to samples 0 through 5439 (time 0-0.34 seconds), which makes sense when you examine figure 1.
Unfortunately, trying to determine voiced and unvoiced segments from the short-time energy plot is more problematic. To do that, additional features need to be examined.
NOTE: The contours of the short-time energy plot track the raw signal plot [of figure 1].
As an alternative, a recursive lowpass filter \(H(z)\) can also be used to calculate the short-time energy:
\(H(z) = \frac{1}{1-az^{-1}}\), where \(0 \le a \le 1\)
It can be easily verified that the frequency response \(H(\theta)\) has the desired lowpass property. Such a filter can be implemented by a simple difference equation:
\(E(n) = a E(n-1) + [s(n)]^2\), where \(E(n)\) is the energy at time instant \(n\).
The value of \(a\) can be calculated using \(a = e^{(-f_c 2 \pi/f_s)}\), where \(f_c\) is the cut-off frequency and \(f_s\) is the sampling frequency (e.g. \(f_c = 20Hz\) and \(f_s = 16KHz\)).
Here's the code for the alternative method:
fc = 20
a = exp(-fc * 2 * pi / fs)
STEs = []
for n in range(len(signal)):
if n == 0:
STEs.append(a * 0 + signal[n] ** 2) # base-case
else:
STEs.append(a * STEs[n - 1] + signal[n] ** 2)
And here's the plot of the results:
plt.plot(STEs)
plt.title('Short-Time Energy')
plt.ylabel('ENERGY')
plt.xlabel('SAMPLE')
pyplot.autoscale(tight='both');
Figure 3 -- Alternative plot of short-time enery.
Figure 3 appears to be a "more jagged" version of figure 2, but is easier to calculate and lends itself to real-time processing. Note the differences in the axes values.
Short-Time Zero Crossing Count
The short-time zero crossing count (ZCC) for a block of N samples of speech is defined as:
\[ZCC_i = \sum_{k=1}^{N-1}0.5|sign(s[k])-sign(s[k-1])|\]
The ZCC essentially counts how many times the signal crosses the time axis during the frame. It "reflects" the frequency content of the frame of speech. (High ZCC implies high frequency.)
It is essential that any constant DC offset is removed from the signal prior to the ZCC computation. (DC is removed for the entire signal by subtracting the average of all the samples.)
Python Code
First well remove any DC offset:
DC = mean(signal)
newSignal = signal - DC # create a new signal, preserving old
print 'DC ==> ', DC
print 'mean(newSignal) ==> ', mean(newSignal)
Note that the mean of the new signal is really close to zero.
Now we'll calculate the zero crossing counts:
assert fs % 1000 == 0
sampsPerMilli = int(fs / 1000)
millisPerFrame = 20
sampsPerFrame = sampsPerMilli * millisPerFrame
nFrames = int(len(signal) / sampsPerFrame) # number of non-overlapping _full_ frames
print 'samples/millisecond ==> ', sampsPerMilli
print 'samples/[%dms]frame ==> ' % millisPerFrame, sampsPerFrame
print 'number of frames ==> ', nFrames
ZCCs = [] # list of short-time zero crossing counts
for i in range(nFrames):
startIdx = i * sampsPerFrame
stopIdx = startIdx + sampsPerFrame
s = newSignal[startIdx:stopIdx] # /s/ is the frame, named to correspond to the equation
ZCC = 0
for k in range(1, len(s)):
ZCC += 0.5 * abs(sign(s[k]) - sign(s[k - 1]))
ZCCs.append(ZCC)
And here's a plot of the results:
plt.plot(ZCCs)
plt.title('Short-Time Zero Crossing Counts')
plt.ylabel('ZCC')
plt.xlabel('FRAME')
pyplot.autoscale(tight='both');
Figure 4 -- Short-time zero crossing counts.
The areas of the figure with very high (relative) zero crossing counts likely correspond to unvoiced regions of the signal. This method suffers in a noisy environment.
Uses of Energy and ZCC
Automated speech "end point" detection.
- Needs to be able to operate with background noise
- Needs to be able to ignore "short" background noises and intra-word silences
Voiced/unvoiced speech detection.
- High energy + low ZCC ==> voiced
- low energy + high ZCC ==> unvoiced
Parameters on which simple speech recognition/speaker verification/identification systems could be based.
Pitch Period Estimation
Pitch period is equal to the inverse of the fundamental frequency of vibration of the vocal chords.
It only makes sense to speak about the pitch period of a voiced frame of speech.
A number of techniques are used to determine pitch period in both the time and frequency domains.
Time Domain Method
Since pitch frequency is typically less than 600-700Hz, the speech signals are first lowpass filtered to remove components above this requency range.
The two most commonly used techniques are:
- Short-time autocorrelation function
- Average magnitude difference function (ADMF)
During voiced speech, the speech signal is "quasi-periodic".
Either technique attempts to determine the period (in samples between "repetitions" of the voiced speech signal.
Autocorreleation Function
Correlation is a very commonly used technique in DSP to determine the "time difference" between two signals, where one is a "nearly perfect" delayed version of the other.
Autocorrelation is the application of the same technqiue to determine the unkown "period" of a quasi-periodic signal such as speech.
The autocorrelation function for a delay value of \(k\) samples is:
\(\phi(k) = \frac{1}{N} \sum\limits_{n=0}^{N-1} s[n]s[n-k]\), where \(k\) is the pitch period and varies from [say] 20 to 400.
Clearly, \(\phi(k=0)\) would be equal to the average energy of the signal \(s[n]\) over the \(N\) sample frame.
If \(s[n]\) was perfectly periodic with a period of \(P\) samples, then \(s[n+P]=s[n]\).
Therefore, \(\phi(k=P) = \phi(k=0) = \text{Average Energy}\).
While this is not exactly true for speech signals, the autocorrelation function with \(k\) equal to the pitch would result in a large value.
For the \(k\) values between 0 and \(P\), the various terms (\(s[n]s[n-k]\)) in the autocorrelation function would tend to be a mixture of positive and negative values.
These would tend to cancel each other out, forcing the autocorrelation sum to yield a very low value for \(\phi(k)\).
Thus, for a given frame of \(N\) samples (of voiced speech), a plot of \(\phi(k)\) versus \(k\) would exhibit distinct peaks at \(k\) values of \(0,P,2P,...\) where \(P\) is the pitch period.
The graph of \(\phi(k)\) would contains small values between these peaks.
Thus, the pitch period of the frame is simple determined by measuring the distance (in samples) between the peaks of the autocorrelation graph.
# setup voiced frame times (based on name of file)
if fName == '003_A_male_167Hz.wav':
startTime = 0.45 # of frame, units=seconds
stopTime = 0.75 # ditto
elif fName == '003_CatInTheHat.wav':
startTime = 1.28 # of frame, units=seconds
stopTime = 1.36 # ditto
else:
assert False
# extract the frame
startIdx = int(startTime * fs)
stopIdx = int(stopTime * fs)
s = signal[startIdx:stopIdx] # /s/ is the frame, named to correspond to the equation
We'll take a look at what we're dealing with (i.e., the frame):
plt.plot(s)
plt.title('Frame')
plt.ylabel('AMPLITUDE')
plt.xlabel('SAMPLE')
pyplot.autoscale(tight='both');
Figure 5 -- Raw frame.
And now for the actual autocorrelation:
phis = []
N = len(s)
for k in range(0, 400):
phi = 0
for n in range(k, N):
phi += s[n] * s[n - k]
phi *= 1 / N
phis.append(phi)
And here's what the autocorrelation looks like:
plt.plot(phis)
plt.title('Autocorrelation of Frame')
plt.ylabel('PHI')
plt.xlabel('DELAY INDEX')
pyplot.autoscale(tight='both');
Figure 6 -- Autocorrelation plot.
For the 003_A_male_167Hz.wav file:
Looking at figure 6, two peaks of interest occur at delay indices 89 and 185 (approximately). The difference is 96 samples, which makes the pitch period \(96 / fs = 96 / 16000 = 0.006\), which makes the fundamental frequency \(1 / 0.006 = 166.67\) (which is exactly what the file name suggests).
For the 003_CatInTheHat.wav file:
Looking at figure 6, two peaks of interest occur at delay indices 149 and 299 (approximately). The difference is 150 samples, which makes the pitch period \(150 / fs = 150 / 16000 = 0.009375\), which makes the fundamental frequency \(1 / 0.009375 = 106.67\) (which is exactly what a SFS analysis suggests).
Of course, having a completely automated solution is desirable.
Average Magnitude Difference Function
The AMDF is similar to but opposite of the autocorrelation function.
For a delay of \(k\) samples, the AMDF is defined as:
\[D(k) = \frac{1}{N} \sum_{n=0}^{N-1} |s[n]-s[n-k]|\]
Here's the AMDF code:
Ds = []
N = len(s)
for k in range(0, 400):
D = 0
for n in range(k, N):
D += abs(s[n] - s[n - k])
D *= 1 / N
Ds.append(D)
And here's what it looks like:
plt.plot(Ds)
plt.title('Average Magnitude Difference Function of Frame')
plt.ylabel('D')
plt.xlabel('DELAY INDEX')
pyplot.autoscale(tight='both');
Figure 7 -- AMDF plot.
Notice that figure 7 looks like a vertically flipped version of figure 6.
The AMDF is actually easier to compute than the autocorrelation, and it looks like the troughs might actually be easier to pick out automatically.
That's enough for now...