When we shop online, book holidays, and search for gift ideas, we hardly give a second thought to the fact that each search entry leaves behind a trail of our identity. Busy web bots are never far behind and sweep up this information. The result of all of this is Big Data: massive volumes of data that is analysed and used for a variety of reasons. But is there reason to be wary of leaving behind...
Whether you’re sitting at a desktop PC, reading the news on a tablet, or operating a website on a server, there are many different processes taking place in the background of these devices. Should an error occur, or should you simply just want to find out more about which actions a given operating system or program is executing, then log files can help you on this front. These are automatically recorded by virtually every application, server, and database system.
Generally, log files are rarely read and evaluated — think of them as a virtual black box of sorts: only in the most urgent of cases are they inspected. Due to the manner in which they capture data, log files prove to be an excellent source for finding out more about program and system errors; they also lend themselves particularly well to gathering information on user behaviour. The ability to find out more about users makes this technology especially interesting for website operators, as they are able to gain useful data from the log files located on their web servers.
What is a log file?
Log files, which are sometimes referred to as event files, generally deal with common text files. These contain information on all processes that have been defined as being relevant by their corresponding programmers. When it comes to a database’s log file, this shows all the changes made to correctly executed transactions. If part of a database is deleted, e.g. in the course of a system shutdown, log files act as a basis for recovering the data set to its proper state.
Log files are automatically generated according to how they’ve been programmed. It’s also possible to create your own files, provided you’re familiar enough with the technical aspects involved. Generally, a line within a log file contains the following information:
- Recorded events (e.g. program start)
- Time stamp, which assigns a date and time to the event
Normally, the time is put on first in order to display the chronological sequence of events.
Typical application for log files
Operating systems generally create multiple protocol files by assigning the different process types to fixed categories. For example, Windows systems record information on application events, system events, security-related events, set-up events, and redirect events. This allows administrators to get an insight into corresponding log file information, which can assist them in their troubleshooting; Windows log files also display which users have logged on and off the system. In addition to the operating system, the following programs and systems collect completely different data:
- Background programs,like e-mails, databases, or proxy servers generate log files that are primarily used to record error and event messages as well as other notices. These functions help secure, and in the event of a crash, restore data.
- Installed software, like official programs, games, instant messengers, firewalls, or virus scanners, save many different types of data in log files. Different configurations or chat messages may be involved in this process. Instances of program crashes are compiled and used to help speed up troubleshooting efforts.
- Servers (especially web servers) record relevant network activity; this information contains useful data on users and their behavior within networks. What’s more, authorised administrators are granted information on which users started an application or requested a file, what time and for how long they did this, and which operating system was used. Web log analysis is one of the oldest web controlling methods and one of the best examples for showcasing the many uses of log files.
Web server log files: the textbook example for the potential of log files
Originally, log files of web servers, like Apache or Microsoft IIS, were the default options for recording and repairing processing errors. It was quickly discovered, however, that web server log files contain much more valuable data: information on the usability and popularity of websites hosted on servers as well as user data such as:
- Time of page view
- Number of page views
- Session duration
- IP address and user’s host name
- Information on the requesting client (usually the browser)
- Search engine used, including search queries
- Applied operating system
A typical entry of aweb server log file looks as follows:
188.8.131.52 - - [18/Mar/2003:08:04:22 +0200] "GET /images/logo.jpg HTTP/1.1" 200 512 "http://www.wikipedia.org/" "Mozilla/5.0 (X11; U; Linux i686; de-DE;rv:1.7.5)"
Detailed overview of individual parameters:
|IP address||184.108.40.206||The requesting host’s IP address|
|Idle||-||Generally unknown RFC 1413 identity|
|Who?||-||Reveals user name, provided the HTTP authentication has taken place; otherwise, as is the case in this example, it remains empty|
|When?||[18/Mar/2003:08:04:22 +0200]||Time stamp consisting of date, time, and time offset information|
|What?||"GET /images/logo.jpg HTTP/1.1"||The occurred event, in this case an image request via HTTP|
|Ok||200||Confirms successful request (HTTP status code 200)|
|How much?||512||If applicable: the amount of transferred data in bytes|
|From where?||"http://www.wikipedia.org/"||The web address from which the files are requested|
|By which means?||"Mozilla/5.0 (X11; U; Linux i686; de-DE;rv:1.7.5)"||Technical information about the client: browser, operating system, kernel, user interface, voice output, version|
In order to effectively evaluate the flood of information, tools, like Webalizer have been developed. These take collected data and transform it into informative statistics, tables, and graphics. Tendencies regarding a website’s growth, the user friendliness of individual pages, or relevant keywords and themes can all be determined using this information.
Even if web server log file analyses continue to be carried out, this tried and true method has lost some of its former sheen due to increasingly popular methods of web analysis, like Cookies or page tagging. Some things pushing this trend include the error-prone nature of log file analysis when assigning sessions as well as the fact that website operators often aren’t able to access a web server’s log files. Despite this drawback, all error reports are immediately registered. Moreover, data collected from a log file analysis is kept directly within the company.