Log parsing with python
-
What is log parsing?
Parsing is basically the process of breaking the process down, your log message into smaller chunks of data and placing them into its own specific named fields by following a set of roles.
It is easy to search in a log for specific words or numbers and to count it. It gives the opportunity to count words or numbers, specially for HTTP status code analysis. When looked at the log’s status code values at for example: Apache logs, it can be used to count the numbers of different HTTP status (such as: 200, 304, 302 and 404).It can give you a good overview of HTTP responses for example a company’s website or other website.
This shows you the same as the table before but just on a graph.
-
Logfile parsing fixed seperator
Log files be formatted with a fixed separator (for example space or tab). Besides that, and empty values can be represented with a placeholder (for example “-”).
Therefore, you can split the file into “columns” and address the individual values.Here is an example of 2 Apache log file entries:
157.55.39.21 2020/10/10-05:37:42 eth1 INPUT image1232.png 30542 option1
212.140.130.19 2020/10/10-05:37:42 eth0 INPUT image2342.png 98312 -In this example we can see that the file has a fixed format, so if you want to retrieve the ip-addresse, you can reach it in column1, while you can reach the filesize in column6.
So basically you can access these positions by splitting the log first into the different lines:
That should simply output:
Now we can split every single line into into the different colums, and access them individually:
The above example should print out all ip-addresses. The index of the first element in a list in python is always 0. Therefore the output of the above should be:
Now lets suppose we want to have the sum of all the files sizes in the logfile. We should have a look at the 6th column (index 5) and store the sum value.
Please notice that we also want for convert the filesize into int. That should give us the result below:
You could go on to expand on that. So if you for example only want to calculate the total file sizes for eth0, you could add a conditional on the 3rd column, or column with index 2 in the list:
The output would be:
-
Apache logs parser with Python
Logs parsing is a pretty regular way to identify offending IP addresses during a DDoS attack on your website. To reveal the attacker, you must collect data about quantity of requests from each IP-address.
-
What is Python RegEx?
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like.
The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.What is Python RegEx?
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like.
The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.Repeating Things
Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.
The first metacharacter for repeating things that we’ll look at is *. * doesn’t match the literal character '*'; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.Using Regular Expressions
Now that we’ve looked at some simple regular expressions, how do we use them in Python? The re module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.Compiling Regular Expressions
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:
The RE is passed to re.compile() as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. (There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) Instead, the re module is simply a C extension module included with Python, just like the socket or zlib modules.
Putting REs in strings keeps the Python language simpler but has one disadvantage which is the topic of the next section.
In Python a regular expression search is typically written as:The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):
Regular Expressions in Python.
-
Exercise
Now try to apply what you have learned into extracting the following data, from the logfile below:
- All Ip-addresses
- Filenames if the responsecode isn’t 200
- Total filesize of the 200 requests
First make sure that you understand the format of this Apache logfile. Then you can start digging into the data.
193.106.31.130 - - [23/Sep/2020:11:57:15 +0200] "POST /administrator/index.php HTTP/1.0" 200 4481 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" "-"
3.120.223.25 - - [23/Sep/2020:11:59:42 +0200] "GET /apache-log/access.log HTTP/1.1" 200 20899424 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.121.24.234 - - [23/Sep/2020:11:59:42 +0200] "GET /apache-log/access.log HTTP/1.1" 200 9763544 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
3.121.24.234 - - [23/Sep/2020:11:59:43 +0200] "GET /apache-log/access.log HTTP/1.1" 200 14853136 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
37.170.98.124 - - [23/Sep/2020:12:04:23 +0200] "GET /apache-log/access.log HTTP/1.1" 200 1219290675 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0" "-"
157.55.39.21 - - [23/Sep/2020:12:07:00 +0200] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
3.120.223.25 - - [23/Sep/2020:12:14:40 +0200] "GET /apache-log/access.log HTTP/1.1" 200 12050672 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:12:14:41 +0200] "GET /apache-log/access.log HTTP/1.1" 200 29412072 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:12:14:41 +0200] "GET /apache-log/access.log HTTP/1.1" 200 15149096 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
40.77.167.156 - - [23/Sep/2020:12:16:30 +0200] "GET /apache-log/access.log:80 HTTP/1.1" 404 230 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
18.195.155.52 - - [23/Sep/2020:12:29:40 +0200] "GET /apache-log/access.log HTTP/1.1" 200 18831288 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.121.24.234 - - [23/Sep/2020:12:29:41 +0200] "GET /apache-log/access.log HTTP/1.1" 200 9698360 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:12:29:41 +0200] "GET /apache-log/access.log HTTP/1.3" 400 17035456 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
111.119.187.30 - - [23/Sep/2020:12:34:27 +0200] "GET /server.php HTTP/1.1" 401 230 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36" "-"
111.119.187.30 - - [23/Sep/2020:12:35:06 +0200] "GET /apache-log/access.log HTTP/1.1" 200 5099992 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36" "-"
111.119.187.30 - - [23/Sep/2020:12:36:11 +0200] "GET /apache-log/access.log HTTP/1.1" 500 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36" "-"
185.153.46.94 - - [23/Sep/2020:12:44:16 +0200] "GET /apache-log/access.log HTTP/1.1" 200 10891488 "-" "python-requests/2.24.0" "-"
18.195.155.52 - - [23/Sep/2020:12:44:39 +0200] "GET /apache-log/access.log HTTP/1.1" 200 18894616 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.121.24.234 - - [23/Sep/2020:12:44:41 +0200] "GET /admin.php HTTP/1.1" 401 16233064 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:12:44:42 +0200] "GET /apache-log/access.log HTTP/1.1" 200 10560928 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
3.121.24.234 - - [23/Sep/2020:12:59:40 +0200] "GET /apache-log/access.log HTTP/1.1" 200 18148896 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
18.195.155.52 - - [23/Sep/2020:12:59:40 +0200] "GET /apache-log/access.log HTTP/1.1" 200 18072776 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:12:59:40 +0200] "GET /apache-log/access.log HTTP/1.1" 200 9004248 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
185.153.46.94 - - [23/Sep/2020:13:07:26 +0200] "GET /apache-log/access.log HTTP/1.1" 200 145984 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36" "-"
18.195.155.52 - - [23/Sep/2020:13:14:41 +0200] "GET /apache-log/error.log HTTP/1.1" 404 230 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
18.195.155.52 - - [23/Sep/2020:13:14:41 +0200] "GET /apache-log/access.log HTTP/1.1" 200 17451152 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 7 Build/KRT16M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" "-"
3.120.223.25 - - [23/Sep/2020:13:14:45 +0200] "GET /apache-log/access.log HTTP/1.1" 200 10061336 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36" "-"
185.103.121.10 - - [23/Sep/2020:13:19:47 +0200] "GET /apache-log/access.log HTTP/1.1" 200 26709316 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36" "-"
-
Quiz