Friday, January 19, 2018

NLP Sentiment analysis with a single perceptron from scratch

Single perceptron based text classifier applied on sentiment analysis


Text classification is one of the natural language processing basics and could be used to solve an infinite number of problems, as examples, we can mention sentiments analysis and social media data analysis. In this article we will build a text classifier from scratch by training a single perceptron model using a publicly available labeled text dataset.

Dataset

The dataset we will use to train and test our classifier is available publicly available at: https://www.kaggle.com/c/si650winter11/data
It contains 0/1 labeled sentences as the following:


 0 Brokeback Mountain was boring.
 1    The Da Vinci Code book is just awesome.


0 is the label of negative sentiments and 1 is the label of positive sentiments.

Source code 

the source code is available in this git repository: https://github.com/slrbl/perceptron-text-classification-from-scracth

Features extraction & the bag of words

In contrary of structured datasets, in the case of text data we don't have an explicit list of features to rely on doing feature engineering and building a model. To dynamically extract features from a text we will use a method called bag of words. Given an input text, the bag of words algorithm will create a list with the found words as elements. 

Let's consider the following two sentences:

we are excited about this classifier. 
We will test the classifier with real data.

The bag of word corresponding to the above text is the following list:

['we', 'are', 'excited', 'about', 'this', 'classifier', 'will', 'test', 'the', 'with', 'real', 'data']

Using this list we will count the occurrence of each word in another other texts from the same context, the resulted vectors are the features:

Text1: The classifier will build is useful
Text2: We will test the model with real data

Text1 features: [ 0 0 0 0 0 1 1 0 0 0 0 0 ]
Text2 features: [ 1 0 0 0 0 0 1 1 1 1 1 1 ]

For our classifier, we will create a big bag of all the words in all the text inputs and will apply it to extract features from the each single input.

The perceptron

A perceptron represents the fundamental unit of a neural network. Given an input X, and a function f(X) = w.X + b that produces an output Y=f(X), a perceptron could be trained learning the best coefficients w and b to predict Y.


A single Perceptron 

As you may have already understood, our goal is to find all the weights w in order to  predict Y. To do so we will need labeled data (supervised learning) and to implement an algorithm called back propagation. 

Back-propagation

As its name indicates, back-propagation is an algorithm aiming to optimise the wights feating with the outputs by performing a backward calculation passes. A good explaining of back propagation and the methematical theory behind could be found at 
  but the keep things as simple as possible the calculation steps could be condensed as the following:
1. Initialisation of weights at zero or at random values
2. Making a forward calculation step to get sigmoid value (Y)
3. Calculation of the cost function (error), which is the difference between the actual output (label) and the predicted output 
4. Calculation of the derivatives dw and db
5. Calculation of the new w and b
6. Go to the step 1 until the cost function is the nearest possible to 0

Implementation

The goal of our implementation is to build a text classifier able to distinguish between a negative and a positive sentence.

Example:
The movie was very long, it was one of my worst experiences: Negative 
Machine learning is awesome: Positive

Data preparation 

The two functions clean_text and shuffle_data are in charge of preparing the raw text data to be used for model training. 
Clean_text takes as inputs the a raw text file and a 'remove_list'. I will remove the characters present in remove_list to reduce noise in the training data.
Shuffle data takes as input the output of clean data or a raw text file data and a shuffle factor. It changes to indices of each line randomly shuffle_factor times.


 def clean_text(data, remove_list):
    cleaned_text = []
    for data_unit in data:
        for char in remove_list:
            data_unit = data_unit.replace(char, '')
        data_unit = data_unit.lower()
        cleaned_text.append(data_unit)
    return cleaned_text

 def shuffle_data(data_list, shuffle_factor):
    m = len(data_list)
    for i in range(shuffle_factor):
        rndIndex = randint(0, m - 1)
        tmp = data_list[rndIndex]
        rndIndex2 = randint(0, len(data_list) - 1)
        data_list[rndIndex] = data_list[rndIndex2]
        data_list[rndIndex2] = tmp
    return data_list


Getting the features: the word bag 

The aim of the function get_data is to convert the the text data prepared using the functions clean_text and shuffle_data to numeric data using the bag of words. It takes as input the data_list output by the previously described functions and a not mandatory input word_bag. If word_bag is provided as input, the function will use it to to get the features from the input data (this is case when get_data uses test data). 

If it is not provided the function will generate it from the input data (this is the case with training data). As a summary the word_bag generated with the training data will be used to extract the features from testing data.


 def get_data(data_list, start, end, word_bag):
    phrases = []
    get_w_b = False
    if word_bag == None:
        word_bag = []
        get_w_b = True
    i = 0
    m = end - start
    Y = np.zeros(shape=(m, 1))
    for data_unit in data_list[start:end - 1]:
        splitted_data_unit = data_unit.split('\t')
        # Get Y (the outputs)
        Y[i, 0] = int(splitted_data_unit[0])
        words = splitted_data_unit[1].split(' ')
        if get_w_b == True:
            for word in words:
                if word not in word_bag:
                    word_bag.append(word)
        i+=1
    # Get X (the inputs)
    X = np.zeros(shape=(len(Y), len(word_bag)))
    i = 0
    for data_unit in data_list[start:end - 1]:
        phrases.append(data_unit)
        splitted_data_unit = data_unit.split('\t')[1].split(' ')
        x = [0] * len(word_bag)
        word_count = {}
        for word in splitted_data_unit:
            if get_w_b == False:
                if word in word_bag:
                    if word not in word_count:
                        word_count[word] = 1.
                    else:
                        word_count[word] += 1.
            else:
                if word not in word_count:
                    word_count[word] = 1.
                else:
                    word_count[word] += 1.
        for word in word_count:
            if word in word_bag:
                x[word_bag.index(word)] = word_count[word]
        X[I] = (x)
        i += 1
    if get_w_b == True:
        return word_bag, X, Y,phrases
    else:
        return X, Y,phrases


Model training and back-propagation 

The function train m model will take as inputs:
m: the number of training units 
ITERATIONS: the number of iteration to optimize the cost 
LEARNING_RATE: the value that will be used to update w and b based on their derivatives 
Y: the outputs (labels)
X: Inputs


  
  def train_model(m, ITERATIONS, LEARNING_RATE, Y, X):
        X = X.T

    Y = Y.T

    # weights vector initialization

    w = np.zeros((X.shape[0], 1))



    # Weights random initialization
    for j in range(X.shape[0]):
        w[j, 0] = randint(-10, 10) * 0.01
    # Bias initialization
    b = 0
    # Cost initialization
    cost = 0
    for i in range(ITERATIONS):
        A = sigmoid(np.dot(w.T, X) + b)
        cost = (-1. / m) * np.sum((Y * np.log(A)) + (1. - Y) * np.log(1. - A))
        print "Iteration: "+str(i)+", cost: "+str(cost)
        if math.isnan(cost):
            return w, b
        dw = (1. / m) * np.dot(X, (A - Y).T)
        db = (1. / m) * np.sum(A - Y)
        w = w - dw * LEARNING_RATE
        b = b - db * LEARNING_RATE
    return w,b

 def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))




Training and testing the model

The functions described above are the most important one. Other functions to make predictions using the trained model and perform accuracy measurement could be seen in the source code repository.

In order to test the model, you need to launch the following command:



 python train_and_test.py -t ../sentiement_analysis.txt -i 1000 -r 1 -s 10000


The model will take some time to train according to the number of iterations your will choose. The training, testing data split coefficient is 0.7 and could be modified in the file train_and_test.py. Once trained and tested, you get the accuracy statistics as the following:


 Data inputs number: 2124
 True positives: 1124.0
 True negatives: 967.0
 False positives: 18.0
 False negatives: 15.0
 Accuracy: 0.984463276836
 Recall: 0.986830553117
 Precision: 0.984238178634
 F1: 0.985532661114


3 comments:

  1. You are welcome. Don't hesitate to ask if you have questions.

    ReplyDelete
  2. We handle world-class NLP tasks and solve tough problems. Whether you are looking to identify emerging trends, manage customer complaints or automate analyst tasks, we provide a variety of NLP solutions.https://divedeep.ai/natural-language-processing-and-sentiment-analysis-solutions

    ReplyDelete
  3. Hy i am francoselectric
    We offer a variety of electrical services for both residential and commercial properties, including upgrades, repairs, replacements, and installations.

    ReplyDelete