Enigmatr.: NLP Sentiment analysis with a single perceptron from scratch

Single perceptron based text classifier applied on sentiment analysis

Text classification is one of the natural language processing basics and could be used to solve an infinite number of problems, as examples, we can mention sentiments analysis and social media data analysis. In this article we will build a text classifier from scratch by training a single perceptron model using a publicly available labeled text dataset.

Dataset

The dataset we will use to train and test our classifier is available publicly available at: https://www.kaggle.com/c/si650winter11/data
It contains 0/1 labeled sentences as the following:

0 Brokeback Mountain was boring.

1 The Da Vinci Code book is just awesome.

0 is the label of negative sentiments and 1 is the label of positive sentiments.

Source code

the source code is available in this git repository: https://github.com/slrbl/perceptron-text-classification-from-scracth

Features extraction & the bag of words

In contrary of structured datasets, in the case of text data we don't have an explicit list of features to rely on doing feature engineering and building a model. To dynamically extract features from a text we will use a method called bag of words. Given an input text, the bag of words algorithm will create a list with the found words as elements.

Let's consider the following two sentences:

we are excited about this classifier.
We will test the classifier with real data.

The bag of word corresponding to the above text is the following list:

['we', 'are', 'excited', 'about', 'this', 'classifier', 'will', 'test', 'the', 'with', 'real', 'data']

Using this list we will count the occurrence of each word in another other texts from the same context, the resulted vectors are the features:

Text1: The classifier will build is useful

Text2: We will test the model with real data

Text1 features: [ 0 0 0 0 0 1 1 0 0 0 0 0 ]

Text2 features: [ 1 0 0 0 0 0 1 1 1 1 1 1 ]

For our classifier, we will create a big bag of all the words in all the text inputs and will apply it to extract features from the each single input.

The perceptron

A perceptron represents the fundamental unit of a neural network. Given an input X, and a function f(X) = w.X + b that produces an output Y=f(X), a perceptron could be trained learning the best coefficients w and b to predict Y.

A single Perceptron

As you may have already understood, our goal is to find all the weights w in order to predict Y. To do so we will need labeled data (supervised learning) and to implement an algorithm called back propagation.

Back-propagation

As its name indicates, back-propagation is an algorithm aiming to optimise the wights feating with the outputs by performing a backward calculation passes. A good explaining of back propagation and the methematical theory behind could be found at

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

but the keep things as simple as possible the calculation steps could be condensed as the following:
1. Initialisation of weights at zero or at random values
2. Making a forward calculation step to get sigmoid value (Y)
3. Calculation of the cost function (error), which is the difference between the actual output (label) and the predicted output
4. Calculation of the derivatives dw and db
5. Calculation of the new w and b
6. Go to the step 1 until the cost function is the nearest possible to 0

Implementation

The goal of our implementation is to build a text classifier able to distinguish between a negative and a positive sentence.

Example:

The movie was very long, it was one of my worst experiences: Negative

Machine learning is awesome: Positive

Data preparation

The two functions clean_text and shuffle_data are in charge of preparing the raw text data to be used for model training.

Clean_text takes as inputs the a raw text file and a 'remove_list'. I will remove the characters present in remove_list to reduce noise in the training data.

Shuffle data takes as input the output of clean data or a raw text file data and a shuffle factor. It changes to indices of each line randomly shuffle_factor times.

def clean_text(data, remove_list):

cleaned_text = []

for data_unit in data:

for char in remove_list:

data_unit = data_unit.replace(char, '')

data_unit = data_unit.lower()

cleaned_text.append(data_unit)

return cleaned_text

def shuffle_data(data_list, shuffle_factor):

m = len(data_list)

for i in range(shuffle_factor):

rndIndex = randint(0, m - 1)

tmp = data_list[rndIndex]

rndIndex2 = randint(0, len(data_list) - 1)

data_list[rndIndex] = data_list[rndIndex2]

data_list[rndIndex2] = tmp

return data_list

Getting the features: the word bag

The aim of the function get_data is to convert the the text data prepared using the functions clean_text and shuffle_data to numeric data using the bag of words. It takes as input the data_list output by the previously described functions and a not mandatory input word_bag. If word_bag is provided as input, the function will use it to to get the features from the input data (this is case when get_data uses test data).

If it is not provided the function will generate it from the input data (this is the case with training data). As a summary the word_bag generated with the training data will be used to extract the features from testing data.

def get_data(data_list, start, end, word_bag):

phrases = []

get_w_b = False

if word_bag == None:

word_bag = []

get_w_b = True

i = 0

m = end - start

Y = np.zeros(shape=(m, 1))

for data_unit in data_list[start:end - 1]:

splitted_data_unit = data_unit.split('\t')

# Get Y (the outputs)

Y[i, 0] = int(splitted_data_unit[0])

words = splitted_data_unit[1].split(' ')

if get_w_b == True:

for word in words:

if word not in word_bag:

word_bag.append(word)

i+=1

# Get X (the inputs)

X = np.zeros(shape=(len(Y), len(word_bag)))

i = 0

for data_unit in data_list[start:end - 1]:

phrases.append(data_unit)

splitted_data_unit = data_unit.split('\t')[1].split(' ')

x = [0] * len(word_bag)

word_count = {}

for word in splitted_data_unit:

if get_w_b == False:

if word in word_bag:

if word not in word_count:

word_count[word] = 1.

else:

word_count[word] += 1.

else:

if word not in word_count:

word_count[word] = 1.

else:

word_count[word] += 1.

for word in word_count:

if word in word_bag:

x[word_bag.index(word)] = word_count[word]

X[I] = (x)

i += 1

if get_w_b == True:

return word_bag, X, Y,phrases

else:

return X, Y,phrases

Model training and back-propagation

The function train m model will take as inputs:

m: the number of training units

ITERATIONS: the number of iteration to optimize the cost

LEARNING_RATE: the value that will be used to update w and b based on their derivatives

Y: the outputs (labels)
X: Inputs

def train_model(m, ITERATIONS, LEARNING_RATE, Y, X):

X = X.T

Y = Y.T

# weights vector initialization

w = np.zeros((X.shape[0], 1))

# Weights random initialization

for j in range(X.shape[0]):

w[j, 0] = randint(-10, 10) * 0.01

# Bias initialization

b = 0

# Cost initialization

cost = 0

for i in range(ITERATIONS):

A = sigmoid(np.dot(w.T, X) + b)

cost = (-1. / m) * np.sum((Y * np.log(A)) + (1. - Y) * np.log(1. - A))

print "Iteration: "+str(i)+", cost: "+str(cost)

if math.isnan(cost):

return w, b

dw = (1. / m) * np.dot(X, (A - Y).T)

db = (1. / m) * np.sum(A - Y)

w = w - dw * LEARNING_RATE

b = b - db * LEARNING_RATE

return w,b

def sigmoid(z):

return 1.0 / (1.0 + np.exp(-z))

Training and testing the model

The functions described above are the most important one. Other functions to make predictions using the trained model and perform accuracy measurement could be seen in the source code repository.

In order to test the model, you need to launch the following command:

python train_and_test.py -t ../sentiement_analysis.txt -i 1000 -r 1 -s 10000

The model will take some time to train according to the number of iterations your will choose. The training, testing data split coefficient is 0.7 and could be modified in the file train_and_test.py. Once trained and tested, you get the accuracy statistics as the following:

Data inputs number: 2124

True positives: 1124.0

True negatives: 967.0

False positives: 18.0

False negatives: 15.0

Accuracy: 0.984463276836

Recall: 0.986830553117

Precision: 0.984238178634

F1: 0.985532661114

3 comments:

Walid DaboubiAugust 26, 2020 at 3:16 AM
You are welcome. Don't hesitate to ask if you have questions.
https://divedeep.ai/predictive-analysis-services-and-solutionsSeptember 11, 2021 at 10:36 AM
We handle world-class NLP tasks and solve tough problems. Whether you are looking to identify emerging trends, manage customer complaints or automate analyst tasks, we provide a variety of NLP solutions.https://divedeep.ai/natural-language-processing-and-sentiment-analysis-solutions

francoselectricJanuary 23, 2022 at 10:11 PM
Hy i am francoselectric
We offer a variety of electrical services for both residential and commercial properties, including upgrades, repairs, replacements, and installations.

Friday, January 19, 2018

NLP Sentiment analysis with a single perceptron from scratch

Single perceptron based text classifier applied on sentiment analysis

Dataset

Source code

Features extraction & the bag of words

The perceptron

Back-propagation

Implementation

The goal of our implementation is to build a text classifier able to distinguish between a negative and a positive sentence.

Data preparation

Getting the features: the word bag

Model training and back-propagation

Training and testing the model

3 comments: