Enigmatr.: Web attacks detection using machine learning

Tuesday, March 7, 2017

Web attacks detection using machine learning

This article describes an experiment of applying classifiers to detect intrusions/suspicious activities in HTTP server logs. The used classifiers are Logistic Regression and Decision Tree Classifier, they are implemented using sklearn Python library.

The Code source is available in https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning.

@rxspawn

1. Used Datasets

The used datasets are available in http://www.secrepo.com/self.logs/. For this experiment, I used October 2016 to January 2016 access logs as training data and February access logs for testing.

2. Data preparation

Classification is a machine learning method. Classifiers could be implemented using both supervised and unsupervised learning algorithms. In this article we will be implementing a supervised classifiers which means that they need to be trained with labeled data before using them to make prediction. Thus, training data has to be labeled, and we have to to choose the features we will use for prediction. Data preparation consists of extracting the wanted features from raw http server log files and labeling it using two labels: 1 to say that an unite of data is considered as an attack an 0 for normal behaviors.

a. Features extraction

The following features are chosen:

HTTP return code
URL length
Number of parameters in the query

There features are extracted form the raw log file using the following function which take as input the raw log file name and returns a hash of features.

 #Retrieve data form a a http log file (access_log)

 def extract_data(log_file):

        regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (.+) "(.*?)" "(.*?)"'

        data={}

        log_file=open(log_file,'r')

        for log_line in log_file:

                log_line=re.match(regex,log_line).groups()

                size=str(log_line[4]).rstrip('\n')

                return_code=log_line[3]

                url=log_line[2]

                param_number=len(url.split('&'))

                url_length=len(url)

                if '-' in size:

                        size=0

                else:

                        size=int(size)

                if (int(return_code)>0):

                        charcs={}

                        charcs['size']=int(size)

                        charcs['param_number']=int(param_number)

                        charcs['length']=int(url_length)

                        charcs['return_code']=int(return_code)

                        data[url]=charcs

        return data

b. Data labeling

Labeling consists of attributing a label for each unit of data, this label will indicate if the concerned log line is considered as an attack or not. Labeling should normally be done manually by experimented security engineer, in this example it is done automatically using a function that looks for specific patterns in each URL and decide if it is about an attack. This automation is done just for experimental purpose, it will be really better if you label your data manually.

The labeling function is the following:

 def label_data(data,labeled_data):

        for w in data:

                attack='0'

                patterns=['honeypot','%3b','xss','sql','union','%3c','%3e','eval']

                if any(pattern in w.lower() for pattern in patterns):

                        attack='1'

                        data_row=str(data[w]['length'])+','+str(data[w ]  ['param_number'])+','+str(data[w]['return_code'])+','+attack+','+w+'\n'

                        labeled_data.write(data_row)

        print str(len(data))+' rows have successfully saved to '+dest_file

b. Ready data example

The following is a sample of data ready to be used to train our classifiers. Have a look on the legend to have a clearer idea.

 36,1,200,1,GET /self.logs/?C=D%3BO%3DA HTTP/1.1

 47,1,200,0,GET /self.logs/error.log.2016-11-05.gz HTTP/1.1

 48,1,200,0,GET /self.logs/access.log.2015-12-19.gz HTTP/1.1

 38,1,404,0,GET /access.log.2015-03-03.gz HTTP/1.1

 47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.1

 47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.0

 48,1,200,0,GET /self.logs/access.log.2015-06-26.gz HTTP/1.1

 47,1,200,0,GET /self.logs/error.log.2016-03-09.gz HTTP/1.1

 LEGEND:

 SIZE

 PARAMETERS NUMBER

 HTTP RETURN CODE

 LABEL (1: attack, 0: no attack)

3. Decision Tree Classifier

Decision Tree is a classification algorithm. Like all classifier, decision tree needs real world training data to make prediction. The data we prepared using the function described in the previous sections will be used as training data.
Testing data is generated with the same function.

In this experiment I used an implementation of Decision Tree available in Sklearn Python Machine Learning library. Here is the implementation of the Decision Tree classifier:

 from utilities import *

 #Get training features and labeles

 training_features,traning_labels=get_data_details(traning_data)

 #Get testing features and labels

 testing_features,testing_labels=get_data_details(testing_data)

 ### DECISON TREE CLASSIFIER

 print "\n\n=-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-\n"

 #Instanciate the classifier

 attack_classifier=tree.DecisionTreeClassifier()

 #Train the classifier

 attack_classifier=attack_classifier.fit(training_features,traning_labels)

 #get predections for the testing data

 predictions=attack_classifier.predict(testing_features)

 print "The precision of the Decision Tree Classifier is:  "+str(get_occuracy(testing_labels,predictions,1))+"%"

4. Logistic Regression Classifier

Like Decision Tree, Logistic regression is implemented in this experiment using Sklearn:

 from utilities import *

 #Get training features and labeles

 training_features,traning_labels=get_data_details(traning_data) 

 #Get testing features and labels   

 testing_features,testing_labels=get_data_details(testing_data)

 ### LOGISTIC REGRESSION CLASSIFIER

 print "\n\n=-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-\n"

 attack_classifier = linear_model.LogisticRegression(C=1e5)

 attack_classifier.fit(training_features,traning_labels)

 predictions=attack_classifier.predict(testing_features)

 print "The precision of the Logistic Regression Classifier is: "+str(get_occuracy(testing_labels,predictions,1))+"%"

5. Utilities

The functions get_data_details() and get_occuracy() used in both Logistic Regression and Decision Tree are implemented in a separate file: https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning/blob/master/utilities.py.

6. Testing and comparing the precision

 root@enigmater:~/intrusion-detection-with-machine-learning$ python ./decision-tree-  classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv

 =-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-

 Real number of attacks:43.0

 Predicted number of attacks:32.0

 The precision of the Decision Tree Classifier is: 74.41%

 root@enigmater:~/intrusion-detection-with-machine-learning$ python ./logistic-regression- classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv 

 =-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-

 Real number of attacks:43.0

 Predicted number of attacks:5.0

 The precision of the Logistic Regression Classifier is: 11.62 %

7. Source code

All the source code and some testing data are available in https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning.

16 comments:

JenishAugust 11, 2017 at 2:04 AM
hi sir,

i require some clarification on the above blog. in the Data labeling, you are using a python script to detect the attacks. if we can able to detect using a regex means why we required an ML for detection. and also different algorithm producing different prediction.hence how we can find the perfect algorithm for our own problem.kindly suggest.!!
ReplyDelete
Replies
Walid DaboubiAugust 11, 2017 at 2:35 AM
Thanks for this comment Jenish!
Yes, I recognize that it is a bit confusing. Actually, the aim of using those regexes is to generate labeled data to use for the model training. As explained above, you should label your data manually, so instead of using lable_data function, just take your data set and manually attribute the value 0 or 1 to each line based on your experience. Concerning the algorithm choice, you should experiment the both with your own data and decide which one fir more with your system..
ReplyDelete
Replies
anishNovember 15, 2017 at 7:18 PM
Hello Sir,
Really nice blog and i have been working on ML and deep learning stuff for quite some months.its very interesting to see its applications for security.
what should be the approach used when there are 5 or more different types of log data, over 100 terabytes of data.How can i build only one model for different types of log data and predict intrusion using ML.is it possible?
ReplyDelete
Replies
AnonymousMarch 30, 2018 at 5:42 AM
Hello,
Can you please provide a step by step guide to clone and execute the codes of this repository ?
Thank you
ReplyDelete
Replies
Walid DaboubiMarch 30, 2018 at 5:46 AM
$git clone https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning
The execution is explained in section 6.
ReplyDelete
Replies
AnonymousMarch 30, 2018 at 7:14 AM
Hello,
What are the parameters passed for the execution of decision_tree_classifier.py file ? I'm coming across a ValueError everytime I execute it. Please help.
ReplyDelete
Replies
a11apurvaApril 19, 2018 at 11:31 PM
Hi,

I am working on a classification problem.
I have error logs generated by a system, which looks like :

'''
ERROR FC HBF12005xeN010 failed this_pd Skipping FC
SUCCESS Testing FC SHB12005xeF010 SCSIDB_entry With_scsidb_entry SID_support NA Instant_Init_Unmap Device_Speed NA Device_Type NA Min_spare_Percentage NA Medium_Type NA Total_chnklet_count cages_supported NA
Port last known topology change n private_loop
'''

I have about 10K log files each classified to one of the 15 different labels.

I have tried word-embeddings by using word2vec and fasttext but the max accuracy achieved is only 0.65

Can decision tree be useful here? Any advice would be really helpful!
Thanks.
ReplyDelete
Replies
UnknownJune 18, 2018 at 1:54 PM
Hey,
Great Article.
Could you kindly also give a guide on how to run it ?
I'm currently struggling getting all the libraries required to run this on windows.
Kind Regards
ReplyDelete
Replies
KITS TechnologiesOctober 13, 2021 at 2:35 AM
NICE POST.
hadoop training
hadoop online training
ReplyDelete
Replies
AMENIApril 7, 2022 at 1:19 AM
thank you for the article , it's so helpful , i'm beginner in machine learning and i have a question please :
if we are going to eliminate the URL from the features how can the model make the prediction while it's based on the specific patterns in the URL ? the model will be only trained on the url_length, the paramater number and the return code !
i would appreciate if you can answer me, i need it on my project. thank you
ReplyDelete
Replies
AnonymousApril 10, 2022 at 11:57 PM
Hello, I am a student from China, I get a problem when I running your code. the problem is the target computer is actively denied and cannot be accessed. How can I solve it ???
ReplyDelete
Replies
AnonymousMay 27, 2022 at 9:34 AM
Did your problem solved? can u please contact me here because I need help in running it? eabuhadda@smail.ucas.edu.ps
ReplyDelete
Replies

Add comment