Speech and Audio Processing (Part 2)

Basic Parameter Extraction

There are a number of very basic speech parameters which can be easily calculated, such as:

  • Short Time Energy
  • Short Time Zero Crossing Count (ZCC)
  • Pitch Period

All of the above parameters are typically estimated for frames of speech between 10 and 20 milliseconds in length.

Short-Time Energy

The short-time energy of speech may be computed by dividing the speech signal into frames of \(N\) samples and computing the sum squared values of the samples in each frame.

Splitting the signal into frames can be achieved by multiplying the signal by a suitable window function \(w(n)\) where \(n=0,1,2,3,...,N-1\) (\(w(n) \mapsto 0\) for \(n\) outside the range \([0,N-1]\)).

Rectangular Window

A simple rectangular window is suitable for our purposes. For a window starting at sample \(m\), the short-time energy \(E_m\) is defined as:

\[E_m = \sum\limits_n [s(n)w(m-n)]^2\]

\[w(n) = \begin{cases} 1 & 0 \leq n \leq N-1\\ 0 & \text{otherwise} \end{cases}\]

If a non-rectangular window is employed, avoid distortion by using:

\[E_m = \sum\limits_n [s(n)]^2 h(m-n)\]

\[h(n) = [w(n)]^2\]

To see how window choice effects the short-time energy, observe that if \(h(n)\) was very long and of constant amplitude \(E_m\), energy would change very little with time.

Such a window would be the equivalent of a very narrowband lowpass filter. Clearly what is desired is some lowpass filtering so that the short-time energy reflects the amplitude variations of the speech signal.

We want a short duration window to be responsive to rapid amplitude changes, but a window that is too short will not provide sufficient averaging to produce a smooth energy function.

Python Code

We'll start out code by importing global functionality:

In [1]:
 from __future__ import division      # so, 1/2 == 0.5 (forces real-valued division)

The scipy.io.wavfile code allows for basic manipulation of wave file audio.

In [2]:
 from scipy.io import wavfile

# pick one file name:
###fName = '003_A_male_167Hz.wav'
fName = '003_CatInTheHat.wav'

fs, signal = wavfile.read(fName)
signal = signal / max(abs(signal)) # scale signal
assert min(signal) >= -1 and max(signal) <= 1
print 'fs ==> ', fs, 'Hz' # sampling rate
print 'len(signal) ==> ', len(signal), 'samples'
 fs           ==>  16000 Hz
len(signal) ==> 48398 samples

Since \(fs\) and \(len(signal)\) are the same, the duration of the signal is one second. Here are some plots of the signal we are dealing with:

In [3]:
 from numpy import linspace

_, (sp1, sp2) = plt.subplots(1, 2, figsize=(16, 4))

# plot raw signal
sp1.plot(signal)
sp1.set_title('Raw Signal')
sp1.set_xlabel('SAMPLE\n(a)')
sp1.autoscale(tight='both')

# plot spectrogram
sp2.specgram(signal)
sp2.set_title('Spectogram')
sp2.set_xlabel('TIME\n(b)')
nSecs = len(signal) / fs
ticksPerSec = 3
nTicks = nSecs * ticksPerSec + 1 # add 1 to include time=0
xTickMax = sp2.get_xticks()[-1]
sp2.set_xticks(linspace(0, xTickMax, nTicks))
sp2.set_xticklabels([round(x, 2) for x in linspace(0, nSecs, nTicks)])
sp2.set_ylabel('FREQ')
maxFreq = fs / 2
nTicks = maxFreq / 1000 + 1 # add 1 to include freq=0
sp2.set_yticks(linspace(0, 1, nTicks))
sp2.set_yticklabels(linspace(0, maxFreq, nTicks));
sp2.autoscale(tight='both')
Figure 1 -- (a) raw audio signal; (b) spectrogram of signal.

If you want to hear it:

In [4]:
 from IPython import display
display.HTML("<audio controls><source src='{}'></audio>".format('files/003_CatInTheHat.wav'))
Out[4]:

First we'll compute some statistics:

In [6]:
 assert fs % 1000 == 0

sampsPerMilli = int(fs / 1000)
millisPerFrame = 20
sampsPerFrame = sampsPerMilli * millisPerFrame
nFrames = int(len(signal) / sampsPerFrame) # number of non-overlapping _full_ frames

print 'samples/millisecond ==> ', sampsPerMilli
print 'samples/[%dms]frame ==> ' % millisPerFrame, sampsPerFrame
print 'number of frames ==> ', nFrames
 samples/millisecond  ==>  16
samples/[20ms]frame ==> 320
number of frames ==> 151

Now we'll process each frame in turm:

In [7]:
 STEs = []                                      # list of short-time energies
for k in range(nFrames):
startIdx = k * sampsPerFrame
stopIdx = startIdx + sampsPerFrame
window = zeros(signal.shape)
window[startIdx:stopIdx] = 1 # rectangular window
STE = sum((signal ** 2) * (window ** 2))
STEs.append(STE)

Here's a plot of the short-time energies:

In [8]:
 plt.plot(STEs)
plt.title('Short-Time Energy')
plt.ylabel('ENERGY')
plt.xlabel('FRAME')
pyplot.autoscale(tight='both');
Figure 2 -- Short-time energy per frame.

There appears to be some silence, and some non-silence components of this utterance.

For instance, it looks like frames 0-33 are silence. That corresponds to samples 0 through 5439 (time 0-0.34 seconds), which makes sense when you examine figure 1.

Unfortunately, trying to determine voiced and unvoiced segments from the short-time energy plot is more problematic. To do that, additional features need to be examined.

NOTE: The contours of the short-time energy plot track the raw signal plot [of figure 1].

As an alternative, a recursive lowpass filter \(H(z)\) can also be used to calculate the short-time energy:

\(H(z) = \frac{1}{1-az^{-1}}\), where \(0 \le a \le 1\)

It can be easily verified that the frequency response \(H(\theta)\) has the desired lowpass property. Such a filter can be implemented by a simple difference equation:

\(E(n) = a E(n-1) + [s(n)]^2\), where \(E(n)\) is the energy at time instant \(n\).

The value of \(a\) can be calculated using \(a = e^{(-f_c 2 \pi/f_s)}\), where \(f_c\) is the cut-off frequency and \(f_s\) is the sampling frequency (e.g. \(f_c = 20Hz\) and \(f_s = 16KHz\)).

Here's the code for the alternative method:

In [9]:
 fc = 20
a = exp(-fc * 2 * pi / fs)
STEs = []
for n in range(len(signal)):
if n == 0:
STEs.append(a * 0 + signal[n] ** 2) # base-case
else:
STEs.append(a * STEs[n - 1] + signal[n] ** 2)

And here's the plot of the results:

In [10]:
 plt.plot(STEs)
plt.title('Short-Time Energy')
plt.ylabel('ENERGY')
plt.xlabel('SAMPLE')
pyplot.autoscale(tight='both');
Figure 3 -- Alternative plot of short-time enery.

Figure 3 appears to be a "more jagged" version of figure 2, but is easier to calculate and lends itself to real-time processing. Note the differences in the axes values.

Short-Time Zero Crossing Count

The short-time zero crossing count (ZCC) for a block of N samples of speech is defined as:

\[ZCC_i = \sum_{k=1}^{N-1}0.5|sign(s[k])-sign(s[k-1])|\]

The ZCC essentially counts how many times the signal crosses the time axis during the frame. It "reflects" the frequency content of the frame of speech. (High ZCC implies high frequency.)

It is essential that any constant DC offset is removed from the signal prior to the ZCC computation. (DC is removed for the entire signal by subtracting the average of all the samples.)

Python Code

First well remove any DC offset:

In [11]:
 DC = mean(signal)
newSignal = signal - DC # create a new signal, preserving old
print 'DC ==> ', DC
print 'mean(newSignal) ==> ', mean(newSignal)
 DC               ==>  1.73124831013e-06
mean(newSignal) ==> -9.07564002932e-20

Note that the mean of the new signal is really close to zero.

Now we'll calculate the zero crossing counts:

In [12]:
 assert fs % 1000 == 0

sampsPerMilli = int(fs / 1000)
millisPerFrame = 20
sampsPerFrame = sampsPerMilli * millisPerFrame
nFrames = int(len(signal) / sampsPerFrame) # number of non-overlapping _full_ frames

print 'samples/millisecond ==> ', sampsPerMilli
print 'samples/[%dms]frame ==> ' % millisPerFrame, sampsPerFrame
print 'number of frames ==> ', nFrames
 samples/millisecond  ==>  16
samples/[20ms]frame ==> 320
number of frames ==> 151

In [13]:
 ZCCs = []                                      # list of short-time zero crossing counts
for i in range(nFrames):
startIdx = i * sampsPerFrame
stopIdx = startIdx + sampsPerFrame
s = newSignal[startIdx:stopIdx] # /s/ is the frame, named to correspond to the equation
ZCC = 0
for k in range(1, len(s)):
ZCC += 0.5 * abs(sign(s[k]) - sign(s[k - 1]))
ZCCs.append(ZCC)

And here's a plot of the results:

In [14]:
 plt.plot(ZCCs)
plt.title('Short-Time Zero Crossing Counts')
plt.ylabel('ZCC')
plt.xlabel('FRAME')
pyplot.autoscale(tight='both');
Figure 4 -- Short-time zero crossing counts.

The areas of the figure with very high (relative) zero crossing counts likely correspond to unvoiced regions of the signal. This method suffers in a noisy environment.

Uses of Energy and ZCC

Automated speech "end point" detection.

  • Needs to be able to operate with background noise
  • Needs to be able to ignore "short" background noises and intra-word silences

Voiced/unvoiced speech detection.

  • High energy + low ZCC ==> voiced
  • low energy + high ZCC ==> unvoiced

Parameters on which simple speech recognition/speaker verification/identification systems could be based.

Pitch Period Estimation

Pitch period is equal to the inverse of the fundamental frequency of vibration of the vocal chords.

It only makes sense to speak about the pitch period of a voiced frame of speech.

A number of techniques are used to determine pitch period in both the time and frequency domains.

Time Domain Method

Since pitch frequency is typically less than 600-700Hz, the speech signals are first lowpass filtered to remove components above this requency range.

The two most commonly used techniques are:

  • Short-time autocorrelation function
  • Average magnitude difference function (ADMF)

During voiced speech, the speech signal is "quasi-periodic".

Either technique attempts to determine the period (in samples between "repetitions" of the voiced speech signal.

Autocorreleation Function

Correlation is a very commonly used technique in DSP to determine the "time difference" between two signals, where one is a "nearly perfect" delayed version of the other.

Autocorrelation is the application of the same technqiue to determine the unkown "period" of a quasi-periodic signal such as speech.

The autocorrelation function for a delay value of \(k\) samples is:

\(\phi(k) = \frac{1}{N} \sum\limits_{n=0}^{N-1} s[n]s[n-k]\), where \(k\) is the pitch period and varies from [say] 20 to 400.

Clearly, \(\phi(k=0)\) would be equal to the average energy of the signal \(s[n]\) over the \(N\) sample frame.

If \(s[n]\) was perfectly periodic with a period of \(P\) samples, then \(s[n+P]=s[n]\).

Therefore, \(\phi(k=P) = \phi(k=0) = \text{Average Energy}\).

While this is not exactly true for speech signals, the autocorrelation function with \(k\) equal to the pitch would result in a large value.

For the \(k\) values between 0 and \(P\), the various terms (\(s[n]s[n-k]\)) in the autocorrelation function would tend to be a mixture of positive and negative values.

These would tend to cancel each other out, forcing the autocorrelation sum to yield a very low value for \(\phi(k)\).

Thus, for a given frame of \(N\) samples (of voiced speech), a plot of \(\phi(k)\) versus \(k\) would exhibit distinct peaks at \(k\) values of \(0,P,2P,...\) where \(P\) is the pitch period.

The graph of \(\phi(k)\) would contains small values between these peaks.

Thus, the pitch period of the frame is simple determined by measuring the distance (in samples) between the peaks of the autocorrelation graph.

In [15]:
 # setup voiced frame times (based on name of file)
if fName == '003_A_male_167Hz.wav':
startTime = 0.45 # of frame, units=seconds
stopTime = 0.75 # ditto
elif fName == '003_CatInTheHat.wav':
startTime = 1.28 # of frame, units=seconds
stopTime = 1.36 # ditto
else:
assert False

# extract the frame
startIdx = int(startTime * fs)
stopIdx = int(stopTime * fs)
s = signal[startIdx:stopIdx] # /s/ is the frame, named to correspond to the equation

We'll take a look at what we're dealing with (i.e., the frame):

In [16]:
 plt.plot(s)
plt.title('Frame')
plt.ylabel('AMPLITUDE')
plt.xlabel('SAMPLE')
pyplot.autoscale(tight='both');
Figure 5 -- Raw frame.

And now for the actual autocorrelation:

In [17]:
 phis = []
N = len(s)
for k in range(0, 400):
phi = 0
for n in range(k, N):
phi += s[n] * s[n - k]
phi *= 1 / N
phis.append(phi)

And here's what the autocorrelation looks like:

In [18]:
 plt.plot(phis)
plt.title('Autocorrelation of Frame')
plt.ylabel('PHI')
plt.xlabel('DELAY INDEX')
pyplot.autoscale(tight='both');
Figure 6 -- Autocorrelation plot.

For the 003_A_male_167Hz.wav file:
Looking at figure 6, two peaks of interest occur at delay indices 89 and 185 (approximately). The difference is 96 samples, which makes the pitch period \(96 / fs = 96 / 16000 = 0.006\), which makes the fundamental frequency \(1 / 0.006 = 166.67\) (which is exactly what the file name suggests).

For the 003_CatInTheHat.wav file:
Looking at figure 6, two peaks of interest occur at delay indices 149 and 299 (approximately). The difference is 150 samples, which makes the pitch period \(150 / fs = 150 / 16000 = 0.009375\), which makes the fundamental frequency \(1 / 0.009375 = 106.67\) (which is exactly what a SFS analysis suggests).

Of course, having a completely automated solution is desirable.

Average Magnitude Difference Function

The AMDF is similar to but opposite of the autocorrelation function.

For a delay of \(k\) samples, the AMDF is defined as:

\[D(k) = \frac{1}{N} \sum_{n=0}^{N-1} |s[n]-s[n-k]|\]

Here's the AMDF code:

In [19]:
 Ds = []
N = len(s)
for k in range(0, 400):
D = 0
for n in range(k, N):
D += abs(s[n] - s[n - k])
D *= 1 / N
Ds.append(D)

And here's what it looks like:

In [20]:
 plt.plot(Ds)
plt.title('Average Magnitude Difference Function of Frame')
plt.ylabel('D')
plt.xlabel('DELAY INDEX')
pyplot.autoscale(tight='both');
Figure 7 -- AMDF plot.

Notice that figure 7 looks like a vertically flipped version of figure 6.

The AMDF is actually easier to compute than the autocorrelation, and it looks like the troughs might actually be easier to pick out automatically.

That's enough for now...

 

Speech and Audio Processing (Part 1)

The next several [IPython] Notebook entries deal with working with speech audio files (specifically .wav files) in the [scientific] Python environment.

Introduction to Speech Processing

Speech processing is the application of digital signal processing (DSP) techniques to the processing and/or analysis of speech signals.

Applications of speech processing include:
  • Speech Coding
  • Speech Recognition
  • Speaker Verification/Identification
  • Speech Enhancement
  • Speech Synthesis (Text-to-Speech Conversion)

Process of Speech Production

The speech production process begins when the talker formulates a message in her mind to transmit to the listener via speech.

The next step is the conversion of the message into "language code". This is the process of converting the message into a set of phoneme sequences corresponding to the sounds that make up the words along with prosody (syntax) markers denoting duration of sounds, loudness of sounds, and pitch associated with the sounds.

The Mechanism of Speech Production

In order to apply DSP techniques to speech processing problems, it is important to understand the fundamentals of the speech production process.

Speech signals are composed of a sequence of sounds that are produced as a result of acoustical excitation of the vocal tract when air is expelled from the lungs.

Classification of Speech Sounds

In speech processing, speech sounds are divided into two broad classes which depend on the role of the vocal chords on the speech production mechanism.
  • Voiced speech is produced when the vocal chords play an active role (i.e., vibrate) in the production of sound. For example, voiced sounds /a/, /e/, /i/.
  • Unvoiced speech is produced when vocal chords are inactive. For example, unvoiced sounds /s/, /f/.

Voiced Speech

Voiced speech occurs when air flows through the vocal chords into the vocal tract in discrete "puffs" rather than as a continuous flow.

The vocal chords vibrate at a particular frequency, which is called the fundamental frequency of the sound.
  • 50 - 200Hz for male speakers
  • 150 - 300Hz for female speakers
  • 200 - 400Hz for child speakers

Unvoiced Speech

For unvoiced speech, the vocal chords are help open and air flows continuously through them.

However, the vocal tract is narrowed, resulting in a turbulent flow of air along the tract.

Examples include the unvoiced fricatives /f/ and /s/.

Unvoiced speech is characterized by high frequency components.

Other Sound Classes

Nasal Sounds
  • Vocal tract coupled acoustically with nasal cavity through velar opening
  • Sound radiated from nostrils as well as lips
  • Examples include /m/, /n/, /ing/
Plosive Sounds
  • Chacterized by complete closure/constriction towards from of vocal tract
  • Build up of pressure behind closure, sudden release
  • Examples include /p/, /t/, /k/

Resonant Frequencies of Vocal Tract

The vocal tract is a non-uniform acoustic tube that is terminated at one end by the vocal chords and at the other end by the lips.

The cross-sectional area of the vocal tract is determined by the positions of the tongue, lips, jaw, and velum.

The spectrum of vocal tract response consists of a number of resonant frequencies.

These frequencies are called formants.

Three to four formants are typically present below 4KHz for speech.

Formant Frequencies

Speech normally exhibits one formant per KHz.

For voiced speech, the magnitude of the lower formant frequencies is successively larger than the magnitude of the higher formant frequencies.

For unvoiced speech, the magnitude of the higher formant frequencies is successively larger than the magnitude of the lower formant frequencies.

Basic Assumptions of Speech Processing

The basic assumption of nearly all speech processing systems is that the source of excitation and the vocal tract system are independent.

Therefore, it is a reasonable approximation to model the source of excitation and the vocal tract system separately.

The vocal tract changes shape rather slowly in continuous speech and it is reasonable to assume that the vocal tract has a fixed set of parameters over a time interval [on the order to 10 milliseconds].

Thus, once every 10 milliseconds [on average], the vocal tract configuration is varied, producing new vocal tract parameter values (resonant frequencies).

Speech Sounds

Phonemes are the smallest segments of speech sounds. /d/ and /b/ are distinct phonemes. For example: dark and bark.

It is important to realize that phonemes are abstract liguistic units and may not be directly observed in the speech signal.

Different speakers producing the same string of phonemes convey the same information yet sound different as a result of the differences in dialect and vocal tract length and shape.

There are approximately 40 phonemes in English.

Model of Speech Production

To develop an accurate model for how speech is produced it is necessary to develop a digital filter-based model of the human speech production mechanism.

The model must accurately represent:
  • The excitation mechanism of the speech production system
  • The operation of the vocal tract
  • The lip/nasal radiation process
  • Both voiced and unvoiced speech [for 10-30 milliseconds]

Excitation Process

The excitation process must take into account:
  • The voiced/unvoiced nature of speech
  • The operation of the glottis
  • the "energy" of the speech signal

in a given 10-30 millisecond frame of speech.

The nature of the excitation function of the model will be different dependent on the nature of the speech sounds being produced.
  • For voiced speech, the excitation will be a train of unit impulses spaced at intervals of the pitch period.
  • For unvoiced speech, the excitation will be a random noise-like signal.

The next stage in the excitation process is a model of the pulse shaping operation of the glottis. [This is only used for voiced speech.]

Finally, the "energy" of the sound is modelled by a gain factor. Typically the gain factor for voiced speech will be in the region of 10 times that of unvoiced speech.

Vocal Tract Model

The vocal tract can be modelled acoustically as a series of short cylindrical tubes.

The model consists of N lossless tubes each of length L and cross sectional area A.

Total length = N * L.

The acoustic model can be converted into a time varying digital filter model.

For either voiced or unvoiced speech, the underlying spectrum of the vocal tract will exhibit distinct frequency peaks. These are known as the formant frequencies of the vocal tract.

Ideally, the vocal tract model should implement at least three of four of the formants.

Vocal Tract Model -- Voiced Speech

For voiced speech, the vocal tract model can be adequately represented by an "all-pole" model.

Typically, two poles are required for each resonance, or formant frequency.

The all-pole model can be viewed as a cascade of 2nd order resonators (2 poles each).

Use of the Vocal Tract Model

The model of the vocal tract can be made to be a very accurate model of speech production for short (10-30 millisecond) frames of speech sample.

It is widely used in modern low bit-rate speech coding algorithms, as well as speech synthesis and speech recognition/speaker identification systems.

It is necessary to develop a technique which allows the coefficients of the model to be determined for a given frame of speech.

The most commonly used technique is called Linear Predictive Coding (LPC).