Tuesday, March 7, 2017

Web attacks detection using machine learning

This article describes an experiment of applying classifiers to detect intrusions/suspicious activities in HTTP server logs. The used classifiers are Logistic Regression and Decision Tree Classifier, they are implemented using sklearn Python library.

1. Used Datasets

The used datasets are available in http://www.secrepo.com/self.logs/. For this experiment, I used October 2016 to January 2016 access logs as training data and February access logs for testing.

2. Data preparation

Classification is a machine learning method. Classifiers could be implemented using both supervised and unsupervised learning algorithms. In this article we will be implementing a supervised classifiers which means that they need to be trained with labeled data before using them to make prediction. Thus, training data has to be labeled, and we have to to choose the features we will use for prediction. Data preparation consists of extracting the wanted features from raw http server log files and labeling it using two labels: 1 to say that an unite of data is considered as an attack an 0 for normal behaviors.

a. Features extraction

The following features are chosen:
  • HTTP return code 
  • URL length
  • Number of parameters in the query 
There features are extracted form the raw log file using the following function which take as input the raw log file name and returns a hash of features.


 #Retrieve data form a a http log file (access_log)
 def extract_data(log_file):
        regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (.+) "(.*?)" "(.*?)"'
        data={}
        log_file=open(log_file,'r')
        for log_line in log_file:
                log_line=re.match(regex,log_line).groups()
                size=str(log_line[4]).rstrip('\n')
                return_code=log_line[3]
                url=log_line[2]
                param_number=len(url.split('&'))
                url_length=len(url)
                if '-' in size:
                        size=0
                else:
                        size=int(size)
                if (int(return_code)>0):
                        charcs={}
                        charcs['size']=int(size)
                        charcs['param_number']=int(param_number)
                        charcs['length']=int(url_length)
                        charcs['return_code']=int(return_code)
                        data[url]=charcs
        return data



b. Data labeling

Labeling consists of attributing a label for each unit of data, this label will indicate if the concerned log line is considered as an attack or not. Labeling should normally be done manually by experimented security engineer, in this example it is done automatically using a function that looks for specific patterns in each URL and decide if it is about an attack. This automation is done just for experimental purpose, it will be really better if you label your data manually.

The labeling function is the following:


 def label_data(data,labeled_data):
        for w in data:
                attack='0'
                patterns=['honeypot','%3b','xss','sql','union','%3c','%3e','eval']
                if any(pattern in w.lower() for pattern in patterns):
                        attack='1'
                        data_row=str(data[w]['length'])+','+str(data[w ]  ['param_number'])+','+str(data[w]['return_code'])+','+attack+','+w+'\n'
                        labeled_data.write(data_row)
        print str(len(data))+' rows have successfully saved to '+dest_file


b. Ready data example

The following is a sample of data ready to be used to train our classifiers. Have a look on the legend to have a clearer idea.


 36,1,200,1,GET /self.logs/?C=D%3BO%3DA HTTP/1.1
 47,1,200,0,GET /self.logs/error.log.2016-11-05.gz HTTP/1.1
 48,1,200,0,GET /self.logs/access.log.2015-12-19.gz HTTP/1.1
 38,1,404,0,GET /access.log.2015-03-03.gz HTTP/1.1
 47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.1
 47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.0
 48,1,200,0,GET /self.logs/access.log.2015-06-26.gz HTTP/1.1

 47,1,200,0,GET /self.logs/error.log.2016-03-09.gz HTTP/1.1

 LEGEND:

 SIZE
 PARAMETERS NUMBER
 HTTP RETURN CODE
 LABEL (1: attack, 0: no attack)


3. Decision Tree Classifier

Decision Tree is a classification algorithm. Like all classifier, decision tree needs real world training data to make prediction. The data we prepared using the function described in the previous sections will be used as training data.
Testing data is generated with the same function.

In this experiment I used an implementation of Decision Tree available in Sklearn Python Machine Learning library. Here is the implementation of the Decision Tree classifier:


 from utilities import *

 #Get training features and labeles
 training_features,traning_labels=get_data_details(traning_data)

 #Get testing features and labels
 testing_features,testing_labels=get_data_details(testing_data)

 ### DECISON TREE CLASSIFIER
 print "\n\n=-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-\n"

 #Instanciate the classifier
 attack_classifier=tree.DecisionTreeClassifier()

 #Train the classifier
 attack_classifier=attack_classifier.fit(training_features,traning_labels)

 #get predections for the testing data
 predictions=attack_classifier.predict(testing_features)

 print "The precision of the Decision Tree Classifier is:  "+str(get_occuracy(testing_labels,predictions,1))+"%"



4. Logistic Regression Classifier

Like Decision Tree, Logistic regression is implemented in this experiment using Sklearn:


 from utilities import *

 #Get training features and labeles
 training_features,traning_labels=get_data_details(traning_data) 

 #Get testing features and labels   
 testing_features,testing_labels=get_data_details(testing_data)

 ### LOGISTIC REGRESSION CLASSIFIER
 print "\n\n=-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-\n"

 attack_classifier = linear_model.LogisticRegression(C=1e5)
 attack_classifier.fit(training_features,traning_labels)

 predictions=attack_classifier.predict(testing_features)
 print "The precision of the Logistic Regression Classifier is: "+str(get_occuracy(testing_labels,predictions,1))+"%"


5. Utilities

The functions get_data_details() and get_occuracy() used in both Logistic Regression and Decision Tree are implemented in a separate file: https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning/blob/master/utilities.py.

6. Testing and comparing the precision



 root@enigmater:~/intrusion-detection-with-machine-learning$ python ./decision-tree-  classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv
  

 =-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-

 Real number of attacks:43.0
 Predicted number of attacks:32.0
 The precision of the Decision Tree Classifier is: 74.41%

 root@enigmater:~/intrusion-detection-with-machine-learning$ python ./logistic-regression- classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv 

 =-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-

 Real number of attacks:43.0
 Predicted number of attacks:5.0
 The precision of the Logistic Regression Classifier is: 11.62 %


7. Source code

All the source code and some testing data are available in https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning.


Monday, February 27, 2017

Get your first ELK stack up and running

ELK Tutorial [ Get your first Elastic Stack up and running ]


1. Create a new folder to install the ELK stack



$ mkdir ELK
$ cd ELK 


2. Elasticsearch



$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.2.1.tar.gz
$ tar -xzvf elasticsearch-5.2.1.tar.gz 

Launch Elasticsearch
Elasticseach will not run if you launch it root:

$ cd elasticsearch-5.2.1
$ ./bin/elasticsearch
[2017-02-27T16:45:41,518][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root

Create a new user (elk for example), login to your system as the created user and launch Elasticsearch

$ su elk
$ ./elasticsearch
[2017-02-27T16:47:51,296][WARN ][o.e.b.JNANatives         ] unable to install syscall filter: 
java.lang.UnsupportedOperationException: seccomp unavailable: requires kernel 3.5+ with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER compiled in
at org.elasticsearch.bootstrap.SystemCallFilter.linuxImpl(SystemCallFilter.java:350) ~[elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.SystemCallFilter.init(SystemCallFilter.java:638) ~[elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.JNANatives.tryInstallSystemCallFilter(JNANatives.java:215) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Natives.tryInstallSystemCallFilter(Natives.java:99) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:110) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:203) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.cli.SettingCommand.execute(SettingCommand.java:54) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:89) [elasticsearch-5.2.1.jar:5.2.1]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:82) [elasticsearch-5.2.1.jar:5.2.1]
[2017-02-27T16:47:51,382][INFO ][o.e.n.Node               ] [] initializing ...
[2017-02-27T16:47:51,442][INFO ][o.e.e.NodeEnvironment    ] [sCFy_Rh] using [1] data paths, mounts [[/home (/dev/sdb1)]], net usable_space [39.2gb], net total_space [96.8gb], spins? [possibly], types [ext3]
[2017-02-27T16:47:51,443][INFO ][o.e.e.NodeEnvironment    ] [sCFy_Rh] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-02-27T16:47:51,452][INFO ][o.e.n.Node               ] node name [sCFy_Rh] derived from node ID [sCFy_RhPR4OqWdbBSCk3IQ]; set [node.name] to override
[2017-02-27T16:47:51,454][INFO ][o.e.n.Node               ] version[5.2.1], pid[15866], build[db0d481/2017-02-09T22:05:32.386Z], OS[Linux/2.6.18-407.el5/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_66/25.66-b17]
[2017-02-27T16:47:52,192][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [aggs-matrix-stats]
[2017-02-27T16:47:52,192][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [ingest-common]
[2017-02-27T16:47:52,192][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [lang-expression]
[2017-02-27T16:47:52,192][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [lang-groovy]
[2017-02-27T16:47:52,193][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [lang-mustache]
[2017-02-27T16:47:52,193][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [lang-painless]
[2017-02-27T16:47:52,193][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [percolator]
[2017-02-27T16:47:52,193][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [reindex]
[2017-02-27T16:47:52,193][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [transport-netty3]
[2017-02-27T16:47:52,194][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] loaded module [transport-netty4]
[2017-02-27T16:47:52,195][INFO ][o.e.p.PluginsService     ] [sCFy_Rh] no plugins loaded
[2017-02-27T16:47:53,948][INFO ][o.e.n.Node               ] initialized
[2017-02-27T16:47:53,948][INFO ][o.e.n.Node               ] [sCFy_Rh] starting ...
[2017-02-27T16:47:54,104][INFO ][o.e.t.TransportService   ] [sCFy_Rh] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2017-02-27T16:47:54,110][WARN ][o.e.b.BootstrapChecks    ] [sCFy_Rh] max file descriptors [8192] for elasticsearch process is too low, increase to at least [65536]
[2017-02-27T16:47:54,110][WARN ][o.e.b.BootstrapChecks    ] [sCFy_Rh] max virtual memory areas vm.max_map_count [65536] is too low, increase to at least [262144]
[2017-02-27T16:47:54,110][WARN ][o.e.b.BootstrapChecks    ] [sCFy_Rh] system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
[2017-02-27T16:47:57,155][INFO ][o.e.c.s.ClusterService   ] [sCFy_Rh] new_master {sCFy_Rh}{sCFy_RhPR4OqWdbBSCk3IQ}{MHeR3-oJSYywtA-dUCwAzw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-02-27T16:47:57,294][INFO ][o.e.h.HttpServer         ] [sCFy_Rh] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2017-02-27T16:47:57,294][INFO ][o.e.n.Node               ] [sCFy_Rh] started
[2017-02-27T16:47:57,368][INFO ][o.e.g.GatewayService     ] [sCFy_Rh] recovered [2] indices into cluster_state
[2017-02-27T16:47:57,608][INFO ][o.e.c.r.a.AllocationService] [sCFy_Rh] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[logstash-2017.02.27][3], [.kibana][0]] ...]).

At this point Elasticsearch is correctly running.

3. Kibana



$ wget https://artifacts.elastic.co/downloads/kibana/kibana-5.2.1-linux-x86_64.tar.gz
$ tar -xzvf kibana-5.2.1-linux-x86_64.tar.gz 

Update kibana config file located in kibana-5.2.1-linux-x86_64/config/kibana.yml as the following

server.host: "x.x.x.x"

# Enables you to specify a path to mount Kibana at if you are running behind a proxy. This only affects
# the URLs generated by Kibana, your proxy is expected to remove the basePath value before forwarding requests
# to Kibana. This setting cannot end in a slash.
#server.basePath: ""

# The maximum payload size in bytes for incoming server requests.
#server.maxPayloadBytes: 1048576

# The Kibana server's name.  This is used for display purposes.
#server.name: "your-hostname"

# The URL of the Elasticsearch instance to use for all your queries.

elasticsearch.url: "http://localhost:9200"


Launch Kibana


$ ./kibana-5.2.1-linux-x86_64/bin/kibana
  log   [15:59:57.638] [info][status][plugin:kibana@5.2.1] Status changed from uninitialized to green - Ready
  log   [15:59:57.700] [info][status][plugin:elasticsearch@5.2.1] Status changed from uninitialized to yellow - Waiting for Elasticsearch
  log   [15:59:57.730] [info][status][plugin:console@5.2.1] Status changed from uninitialized to green - Ready
  log   [15:59:57.946] [info][status][plugin:timelion@5.2.1] Status changed from uninitialized to green - Ready
  log   [15:59:57.950] [info][listening] Server running at http://10.81.163.152:5601
  log   [15:59:57.951] [info][status][ui settings] Status changed from uninitialized to yellow - Elasticsearch plugin is yellow
  log   [15:59:58.069] [info][status][plugin:elasticsearch@5.2.1] Status changed from yellow to green - Kibana index ready
  log   [15:59:58.070] [info][status][ui settings] Status changed from yellow to gren - Ready



At this point Kibana is correctly running since no index pattern are available.


4. Logstash

$ wget https://artifacts.elastic.co/downloads/logstash/logstash-5.2.1.tar.gz
$ tar -xzvf logstash-5.2.1.tar.gz 
$ cd logstash-5.2.1


Create a simple Logstash config file withe following content


a. Use STDIN as input



input { stdin { } }

output {
  elasticsearch { hosts => ["localhost:9200"] }
  stdout { codec => rubydebug }
}


b. Use filberts feed as input



input {
   beats {
     type => beats
     port => 5044
   }
}

output {
  elasticsearch { hosts => ["localhost:9200"] }
  stdout { codec => rubydebug }
}


Save it as my-first-logstash-config.conf
This configuration will tell Logstash to use the stdin as input and the currently running instance of Elasticsearch as output.

To get more details about Filebeats, refer to the next section.

Launch Logstash 


$ ./bin/logstash -f ./my-first-logstash-config.conf
Sending Logstash's logs to /home/SIEM_2/logstash-5.2.1/logs which is now configured via log4j2.properties
The stdin plugin is now waiting for input:
[2017-02-27T17:10:41,028][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2017-02-27T17:10:41,034][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2017-02-27T17:10:41,126][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<URI::HTTP:0x447dc06c URL:http://localhost:9200/>}
[2017-02-27T17:10:41,128][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2017-02-27T17:10:41,188][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>50001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"_all"=>{"enabled"=>true, "norms"=>false}, "dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword"}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date", "include_in_all"=>false}, "@version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2017-02-27T17:10:41,202][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>[#<URI::Generic:0x3da3fd70 URL://localhost:9200>]}
[2017-02-27T17:10:41,205][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>250}
[2017-02-27T17:10:41,208][INFO ][logstash.pipeline        ] Pipeline main started
[2017-02-27T17:10:41,253][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

This is a data sample       
{
    "@timestamp" => 2017-02-27T16:11:07.294Z,
      "@version" => "1",
          "host" => "x.x.x.x",
       "message" => "This is a data sample"


At this point Logstash is working correctly and we started feeding Elasticsearch with data. we can create index in Kibana and visualize data.




5. Filebeats

Filebeats allows you to track the change in the files you define in its configuration. It could be configured to feed Elasticsearch or Logstash


filebeat.prospectors:

# Each - is a prospector. Most options can be set at the prospector level, so
# you can use different prospectors for various configurations.
# Below are the prospector specific configurations.

- input_type: log

  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /var/log/*.log
    #- c:\programdata\elasticsearch\logs\*

output.logstash:
  # The Logstash hosts
  hosts: ["localhost:5044"]



The above configuration will make Filebeats tracking all the changes in /var/log/*.log file. The changes will be fed to Logstash which is listening throught the port 5044.

Source
https://www.elastic.co/products