Single perceptron based text classifier applied on sentiment analysis
Text classification is one of the natural language processing basics and could be used to solve an infinite number of problems, as examples, we can mention sentiments analysis and social media data analysis. In this article we will build a text classifier from scratch by training a single perceptron model using a publicly available labeled text dataset.
Dataset
The dataset we will use to train and test our classifier is available publicly available at: https://www.kaggle.com/c/si650winter11/data
It contains 0/1 labeled sentences as the following:
It contains 0/1 labeled sentences as the following:
0 Brokeback Mountain was boring.
1 The Da Vinci Code book is just awesome.
0 is the label of negative sentiments and 1 is the label of positive sentiments.
Source code
the source code is available in this git repository: https://github.com/slrbl/perceptron-text-classification-from-scracth
Features extraction & the bag of words
In contrary of structured datasets, in the case of text data we don't have an explicit list of features to rely on doing feature engineering and building a model. To dynamically extract features from a text we will use a method called bag of words. Given an input text, the bag of words algorithm will create a list with the found words as elements.
Let's consider the following two sentences:
we are excited about this classifier.
We will test the classifier with real data.
We will test the classifier with real data.
The bag of word corresponding to the above text is the following list:
['we', 'are', 'excited', 'about', 'this', 'classifier', 'will', 'test', 'the', 'with', 'real', 'data']
Using this list we will count the occurrence of each word in another other texts from the same context, the resulted vectors are the features:
Text1: The classifier will build is useful
Text2: We will test the model with real data
Text1 features: [ 0 0 0 0 0 1 1 0 0 0 0 0 ]
Text2 features: [ 1 0 0 0 0 0 1 1 1 1 1 1 ]
For our classifier, we will create a big bag of all the words in all the text inputs and will apply it to extract features from the each single input.
The perceptron
A perceptron represents the fundamental unit of a neural network. Given an input X, and a function f(X) = w.X + b that produces an output Y=f(X), a perceptron could be trained learning the best coefficients w and b to predict Y.
A single Perceptron |
As you may have already understood, our goal is to find all the weights w in order to predict Y. To do so we will need labeled data (supervised learning) and to implement an algorithm called back propagation.
Back-propagation
As its name indicates, back-propagation is an algorithm aiming to optimise the wights feating with the outputs by performing a backward calculation passes. A good explaining of back propagation and the methematical theory behind could be found at
but the keep things as simple as possible the calculation steps could be condensed as the following:
1. Initialisation of weights at zero or at random values
2. Making a forward calculation step to get sigmoid value (Y)
3. Calculation of the cost function (error), which is the difference between the actual output (label) and the predicted output
4. Calculation of the derivatives dw and db
5. Calculation of the new w and b
6. Go to the step 1 until the cost function is the nearest possible to 0
but the keep things as simple as possible the calculation steps could be condensed as the following:
1. Initialisation of weights at zero or at random values
2. Making a forward calculation step to get sigmoid value (Y)
3. Calculation of the cost function (error), which is the difference between the actual output (label) and the predicted output
4. Calculation of the derivatives dw and db
5. Calculation of the new w and b
6. Go to the step 1 until the cost function is the nearest possible to 0
Implementation
The goal of our implementation is to build a text classifier able to distinguish between a negative and a positive sentence.
Example:
The movie was very long, it was one of my worst experiences: Negative
Machine learning is awesome: Positive
Data preparation
The two functions clean_text and shuffle_data are in charge of preparing the raw text data to be used for model training.
Clean_text takes as inputs the a raw text file and a 'remove_list'. I will remove the characters present in remove_list to reduce noise in the training data.
Shuffle data takes as input the output of clean data or a raw text file data and a shuffle factor. It changes to indices of each line randomly shuffle_factor times.
def clean_text(data, remove_list):
cleaned_text = []
for data_unit in data:
for char in remove_list:
data_unit = data_unit.replace(char, '')
data_unit = data_unit.lower()
cleaned_text.append(data_unit)
return cleaned_text
def shuffle_data(data_list, shuffle_factor):
m = len(data_list)
for i in range(shuffle_factor):
rndIndex = randint(0, m - 1)
tmp = data_list[rndIndex]
rndIndex2 = randint(0, len(data_list) - 1)
data_list[rndIndex] = data_list[rndIndex2]
data_list[rndIndex2] = tmp
return data_list
Getting the features: the word bag
The aim of the function get_data is to convert the the text data prepared using the functions clean_text and shuffle_data to numeric data using the bag of words. It takes as input the data_list output by the previously described functions and a not mandatory input word_bag. If word_bag is provided as input, the function will use it to to get the features from the input data (this is case when get_data uses test data).
If it is not provided the function will generate it from the input data (this is the case with training data). As a summary the word_bag generated with the training data will be used to extract the features from testing data.
def get_data(data_list, start, end, word_bag):
phrases = []
get_w_b = False
if word_bag == None:
word_bag = []
get_w_b = True
i = 0
m = end - start
Y = np.zeros(shape=(m, 1))
for data_unit in data_list[start:end - 1]:
splitted_data_unit = data_unit.split('\t')
# Get Y (the outputs)
Y[i, 0] = int(splitted_data_unit[0])
words = splitted_data_unit[1].split(' ')
if get_w_b == True:
for word in words:
if word not in word_bag:
word_bag.append(word)
i+=1
# Get X (the inputs)
X = np.zeros(shape=(len(Y), len(word_bag)))
i = 0
for data_unit in data_list[start:end - 1]:
phrases.append(data_unit)
splitted_data_unit = data_unit.split('\t')[1].split(' ')
x = [0] * len(word_bag)
word_count = {}
for word in splitted_data_unit:
if get_w_b == False:
if word in word_bag:
if word not in word_count:
word_count[word] = 1.
else:
word_count[word] += 1.
else:
if word not in word_count:
word_count[word] = 1.
else:
word_count[word] += 1.
for word in word_count:
if word in word_bag:
x[word_bag.index(word)] = word_count[word]
X[I] = (x)
i += 1
if get_w_b == True:
return word_bag, X, Y,phrases
else:
return X, Y,phrases
Model training and back-propagation
The function train m model will take as inputs:
m: the number of training units
ITERATIONS: the number of iteration to optimize the cost
LEARNING_RATE: the value that will be used to update w and b based on their derivatives
Y: the outputs (labels)
X: Inputs
X: Inputs
def train_model(m, ITERATIONS, LEARNING_RATE, Y, X):
Y = Y.T
# weights vector initialization
w = np.zeros((X.shape[0], 1))
# Weights random initialization
for j in range(X.shape[0]):
w[j, 0] = randint(-10, 10) * 0.01
# Bias initialization
b = 0
# Cost initialization
cost = 0
for i in range(ITERATIONS):
A = sigmoid(np.dot(w.T, X) + b)
cost = (-1. / m) * np.sum((Y * np.log(A)) + (1. - Y) * np.log(1. - A))
print "Iteration: "+str(i)+", cost: "+str(cost)
if math.isnan(cost):
return w, b
dw = (1. / m) * np.dot(X, (A - Y).T)
db = (1. / m) * np.sum(A - Y)
w = w - dw * LEARNING_RATE
b = b - db * LEARNING_RATE
return w,b
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
Training and testing the model
The functions described above are the most important one. Other functions to make predictions using the trained model and perform accuracy measurement could be seen in the source code repository.
In order to test the model, you need to launch the following command:
The model will take some time to train according to the number of iterations your will choose. The training, testing data split coefficient is 0.7 and could be modified in the file train_and_test.py. Once trained and tested, you get the accuracy statistics as the following:
In order to test the model, you need to launch the following command:
python train_and_test.py -t ../sentiement_analysis.txt -i 1000 -r 1 -s 10000
The model will take some time to train according to the number of iterations your will choose. The training, testing data split coefficient is 0.7 and could be modified in the file train_and_test.py. Once trained and tested, you get the accuracy statistics as the following:
Data inputs number: 2124
True positives: 1124.0
True negatives: 967.0
False positives: 18.0
False negatives: 15.0
Accuracy: 0.984463276836
Recall: 0.986830553117
Precision: 0.984238178634
F1: 0.985532661114
You are welcome. Don't hesitate to ask if you have questions.
ReplyDeleteWe handle world-class NLP tasks and solve tough problems. Whether you are looking to identify emerging trends, manage customer complaints or automate analyst tasks, we provide a variety of NLP solutions.https://divedeep.ai/natural-language-processing-and-sentiment-analysis-solutions
ReplyDeleteHy i am francoselectric
ReplyDeleteWe offer a variety of electrical services for both residential and commercial properties, including upgrades, repairs, replacements, and installations.