Signal Processing using Python-part 1


Hi guys!!

This post is for the people who love Signal Processing.Well,currently Matlab is one of the most used software by the signal processing community,but enough of Matlab,really!!! These days almost everyone knows how to use Matlab.

Python on the other hand is another very  powerful language which also can be used for signal/image processing .

Well here’s how to get started with signal processing in python

1)You are gonna need some python libraries such as numpy,scipy,matplotlib and pylab.These are available for free.Ubuntu and Debian users can download them by typing this on the terminal

sudo apt-get install python-numpy python-scipy python-matplotlib

2)Numpy is the numerical library of python which includes modules for 2D arrays(or lists),fourier transform ,dft etc.Scipy is the scientific library used for importing .wav file in this case.Matplotlib is python’s 2D plotting library .

In this post I am gonna start with a simple code,

Computing the Spectogram of an audio signal.

spectrogram, or sonogram, is a visual representation of the spectrum of frequencies in a sound. Spectrograms are sometimes called spectral waterfallsvoiceprints, or voicegrams.

Procedure for finding the spectogram of a signal is as follows :

  • Read the signal from a .wav file into a 2D numpy array.
  • Divide the signal in to overlapping frames,keeping each frame size say 25ms ,and overlapping window size as 10ms
  • Take the short time fourier transform of each windowed frame
  • Compute the power spectrum of each frame,i.e. the square of the absolute value of the DFT of each frame.

Explanation of the python code:

  • import numpy
    import matplotlib.pyplot as plt
    import   #This library is used for reading the .wav file
  • [fs,signal]‘w1.wav’) #input wav file ,change here
    # fs=sampling frequency,signal is the numpy 2D array where the data of the wav file is written
  • length=len(signal) # the length of the wav file.This gives the number of samples ,not the length in time
    window_hop_length=0.01 #10ms change here
    print”overlap=” ,overlap
    window_size=0.025 #25 ms,change here
    print “framesize=”,framesize
    nfft_length=framesize #length of DFT ,change here
    print “number of frames are =”,number_of_frames
  • frames=numpy.ndarray((number_of_frames,framesize)) # This declares a 2D matrix,with rows equal to the number of frames,and columns equal to the framesize or the length of each DTF
    for k in range(0,number_of_frames):
    for i in range(0,framesize):
  • fft_matrix=numpy.ndarray((number_of_frames,framesize)) #declares another 2d matrix to store  the DFT of each windowed frame
    abs_fft_matrix=numpy.ndarray((number_of_frames,framesize)) #declares another 2D Matrix to store the power spectrum
  • for k in range(0,number_of_frames):
    fft_matrix[k]=numpy.fft.fft(frames[k]) #computes the DFT
    abs_fft_matrix[k]=abs(fft_matrix[k])*abs(fft_matrix[k])/(max(abs(fft_matrix[k]))) # computes the power spectrum
  • t=range(len(abs_fft_matrix))  #This code segment simply plots the power spectrum obtained above





Support Vector Machine Revisited-An Intuitive approach.ASR Part 2


Support Vector Machine: In layman terms ,it’s a mathematical method for finding a line(a hyperplane actually) separating 2 sets/classes of data.

svm 1

The entire concept in machine learning is to make a prediction on a given input vector based on some training data that we had.

In this post I am gonna try and explain SVM in a more intuitive sense rather than a pure mathematical based approach.

SVM’s are mostly used for pattern recognition problems,i.e. given a set of input data,we try to find out the similarity measures in the data.One such measure is the dot product of 2 vectors.The dot product gives the cosine of the angle between 2 vectors.However,its not that useful.

For SVM,we’ll use a entity called a kernel function.The kernel function gives an idea about the similarity between 2 vectors.For example if the 2 vectors are very similar the kernel function gives the result 1,and if the 2 vectors are very different ,the result is close to 0.

An SVM essentially finds a optimal hyperplane that separates 2 classes of data.Lets say we have 2 classes of data with the labels +1(the positive class) and -1(the negative class),then the equation of a hyperplane would be

y=sgn(w.x +b),

where w is the orthogonal weight  vector , b is the offset of the hyperplane and x is the feature vector.If the quantity in the brackets is >0 then the label assigned to it is +1 and if the quantity in the brackets is <0 then the label assigned to it is -1.This is actually how you use an hyperplane classifier to make a prediction.

As can be seen in the above figure, the minimum distance between the hyperplane and the nearest points is 1/|w|.

The job of the SVM is to find out w and b.

This is done by solving the following quadratic optimization problem.

objective fn

The problem can be justified by the following arguments .

  1. Minimizing  w in essence means maximizing the distance between the hyperplane and the nearest point,since that distance is equal to 1/|w|.
  2. The constraint >=1 can be explained as follows,Lets say it was >=0. Multiplying both sides of the constraint by k (0<k<1),still satisfies the constraint,but multiplying with w means decreasing w so its not the minimum value of the objective function.So we will not be able to find a true minimum value of w.In essence any positive number other than 1 will work fine.

Since it is a constrained optimization problem,.we will have to use the concept of Lagrange multipliers to solve it.

The alphas’s are called the lagrange multiplier’s

The lagrangian is maximized with respect to alpha’s and minimized with respect to w and b


lagrange differentiate.JPG

The alpha are always greater than 0.This is because at the optimal point the the tangent to the objective function and the constraint are in the same direction,hence alpha has to be +ve.

Plugging the following constraints into the original lagrangian we can eliminate w and b and get the dual form

As it turns out the dual form is actually a lot easier to solve since it is not a non convex optimization problem.

dual form

Due to the KKT complementary conditions on the lagrange multipliers(alphas) either it will be 0 or y(wx+b) will be 1.The non zero alphas are called SUPPORT VECTORS.

Why the name Support vectors?In the final equation for the hyperplanes in which w is in terms of alpha’s only the non zero alpha’s will contribute or in essence support the hyperplane!!!

One last and important thing : How to visualize SVM in a real world scenario?

Think of the hyperplane as a solid sheet satisfying mechanical equilibrium(since it is the optimum hyperplane).Apply a force of Support vector x in the direction y(w/|w|).Then the 2 constraints derived above tell that the total force and the total torque sum to zero thus maintaining mechanical equilibrium.

final svm

Thats a lot to take in for one post!!In my next post I am gonna tell how to train the SVM,i.e. find the parameters w and b.


Feature vector for Automatic Speech recognition(ASR)


Hi guys!!

Today I am gonna talk about how to go about  making a speaker recognition system.Voice and speaker recognition is an growing field and sooner or later almost everything will be controlled by voice(the Google glass is just a start!!!).I am gonna start from the basic and gonna try to keep it as simple as I can.

Well, the first step in voice/speech recognition is to extract the feature vector of a voice signal.(By feature vector I mean a set of attributes that define the signal ).In case of voice recognition it consists of attributes like Pitch,number of zero crossing of a signal,Loudness ,Beat strength,Frequency,Harmonic ratio,Energy e.t.c.

In most cases the Pitch and the Loudness  of the signal are the most important factors.The fundamental or first harmonic of any tone is perceived as its pitch.

In this article I am gonna talk about a certain class of feature vectors called the Mel Frequency Cepstrum Coefficients (MFCC s).(BIG WORDS HUH!!)

Let me break them down into simple terms

a)Mel : is actually a scale used to measure the Pitch vs Frequency as shown  —->  Image

                                                                                                                                                                                                      Mel scale (Source Wikipedia)

The formula to convert from frequency scale to Mel is : m=2595 log(1+(f/700)) (where the log is to the base 10).

b)Cepstrum : Its just a fancy term for  the Fourier transform (FT) of the log spectrum of any signal.

Steps for finding the MFCC’s

  • Decompose the signal into short frames ,say each frame of 25 ms.Keep in mind to break into overlapping frames.(I’ll explain the reason and the solution to this redundancy in the next few steps)
  • Find the power spectrum of each frame.
  • Apply the Mel filterbank to each power spectra.
  • Take the log of the above energies.
  • Take the DCT(Discrete Cosine Transform) of the energies.
  • Keep the lower 12-13 DCT coefficients for each frame.These are your MFCC.

What do the above steps signify and why are they done?

The signal is framed so that we get the accurate representation of what is actually happening throughout the signal,since the signal is changing constantly.The frame size should not be too small because then the signal will appear to be stationary.It also should not be too large,since then the signal will change too much in that time frame.

By power spectrum I mean the square of the modulus of the Discrete Fourier transform.In simple terms the power spectrum identifies which frequencies are present in the frame.This power spectrum is also called the Periodogram.One thing to keep in mind before taking the fourier transform is to multiply each frame by a HAMMING WINDOW.HAMMING WINDOW is simply an envelop which selects a particular band of frequencies.


                                                 Power spectrum(Periodogram) of a signal


The human ear cannot tell the difference between 2 closely spaced frequencies.This effect becomes more pronounced when frequencies increase.This is where the Mel Filterbank comes into the picture.

The Mel Filterbank is simply a set of triangular filters in the frequency domain.The first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. This roughly gives how much energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and how wide to make them.


26 Mel filterbank ranging from 300-4000 Hertz.

The Mel Filterbanks gives an idea of how much energy is present in each frequency region.This is found out by multiplying the power spectrum of each frame with each Mel Filterbank.

For example,suppose we had 26 Mel filters.(the number of filters chosen is completely upto you, there’s no hard and fast rule for it.But usually about 20-40 filters are chosen).To find the energy in each filterbank we multiply the power spectrum of each frame with each of the mel filters.

So,after multiplying you would have 26 coefficients for each frame.Then the log of these 26 coefficients is taken.Why LOG? Well this is motivated by the Human hearing.We do not hear loudness on a linear scale.Generally to double the volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with

.But why just LOG not some other function like RMS? The Log helps compute the cepstral mean subtraction ,which is a channel normalization technique.It is usually done when the channel varies too much.It subtracts the mean on the cepstral signal.

And finally the Discrete Cosine Transform (DCT).Since we had taken overlapping frames in the beginning ,the energies of various frames would be correlated.DCT is done to de-correlate the overlapping energies.This will be useful when we use Hidden Markov Models (HMM’s)[ I am gonna explain what these are later].After taking the DCT ,the lower 12-13 coefficients are taken.These coefficients are called the mel frequency cepstral coefficients.

But why the lower 12-13 coefficients? This is because the higher DCT coefficients represent fast changes in the filterbank energies and these coefficents degrade the performance of ASR.

From each frame we get a feature vector of length 12-13.

So, there you go, we have the feature vector which can be used for speech recognition and music classification.You can find the Matlab code of the MFCC’s at my github site

In my next post I am gonna tell how to use this feature vector for speaker classification and recognition.