More and more companies have large amounts of data that are valuable resources for customer seg­ment­a­tion, sales man­age­ment, and target marketing. However, if these data sets cannot be suf­fi­ciently analysed and evaluated, they are prac­tic­ally worthless to companies. There is a wealth of in­form­a­tion here, but only those who know how to use it can benefit from it. This is also pointed out by trend re­search­er and fu­tur­o­lo­gist John Naisbitt with his well-known quote:

Quotation

“We are drowning in in­form­a­tion, but starving for knowledge.”

– Trend re­search­er and fu­tur­o­lo­gist, John Naisbitt, on growing volumes of digital data

Data mining tools help to manage the amount of data and identify po­ten­tially decisive trends and patterns. Data mining software is becoming in­creas­ingly complex and the selection of tools is growing. To help you keep track of the most important data mining programs, we have compiled a com­par­is­on of the various data mining programs available.

Tech­niques, tasks, and com­pon­ents of data mining

Data mining is the term used for al­gorithmic methods of data eval­u­ation that are applied to par­tic­u­larly large and complex data sets. Data mining is designed to extract hidden in­form­a­tion from large volumes of data (es­pe­cially mass data, which is known as Big Data), and therefore identify even better hidden cor­rel­a­tions, trends, and patterns that are depicted in them. This is where data mining tools come in. The term 'data mining' does not mean gen­er­at­ing data or even the data set them­selves, but refers to the practice of data analysis. Many of the methods used come from stat­ist­ics; however, data mining is not purely stat­ist­ic­al, but rather an in­ter­dis­cip­lin­ary method that connects computer science and math­em­at­ic­al findings with machine-learning tech­no­lo­gies (es­pe­cially un­su­per­vised learning) and ar­ti­fi­cial in­tel­li­gence. These powerful methods are in­teg­rated into data mining software to enable large data sets to be evaluated.

Fact

Text mining is a special form of data mining, which gains special relevance due to the pop­ular­ity of language software and language tech­no­logy. In­form­a­tion retrieval here does not refer to data sets, but to text documents. The main points are extracted from large amounts of text (spe­cial­ist articles or company documents). This makes text mining useful for companies when re­search­ing new projects, for example.

Nev­er­the­less, users must also have a good un­der­stand­ing of data sets in order for data mining to be suc­cess­ful. Only then can they use the data mining tools in a mean­ing­ful way – pro­gram­ming skills are not required.

In­di­vidu­al data mining tasks:

  • Clas­si­fic­a­tion: Assigns in­di­vidu­al data objects to certain pre­defined classes (such as cats or bicycles) that were not pre­vi­ously assigned to these classes; the decision tree analysis is par­tic­u­larly helpful for clas­si­fic­a­tion.
  • Deviation outlier analysis: Iden­ti­fies objects that do not comply with the rules of de­pend­ency for related objects; this enables you to find the causes of the dis­crep­an­cies.
  • Cluster analysis: Iden­ti­fies clusters of sim­il­ar­it­ies and then forms groups of objects that are more similar in terms of certain aspects than other groups; unlike clas­si­fic­a­tion, the groups (or clusters) are not pre­defined and can take different forms depending on the data analysed.
  • As­so­ci­ation analysis: Reveals cor­rel­a­tion between two or more in­de­pend­ent items that are not directly related, but occur more often together.
  • Re­gres­sion analysis: Reveals re­la­tion­ships between a dependent variable (e.g. product sales) and one or more in­de­pend­ent variables (e.g. product price or customer income), and is used, among other things, to make forecasts about the dependent variable (e.g. a sales forecast).
  • Pre­dict­ive analytics: This is actually a su­per­or­din­ate task that aims to make pre­dic­tions about future trends. It uses data mining, among other things, and works with a variable (predictor) that is measured for in­di­vidu­al people or larger entities.
Fact

With the help of as­so­ci­ation analysis, in­form­at­ive cor­rel­a­tions could be es­tab­lished during pur­chas­ing decisions for different products, which sig­ni­fic­antly improved the shopping basket analysis. This method is used to determine re­com­men­ded purchases from online mail order companies.

The different methods can be roughly divided into so-called ob­ser­va­tion problems (deviation analysis, cluster analysis) and fore­cast­ing problems (re­gres­sion analysis, clas­si­fic­a­tion). A detailed ex­plan­a­tion of different data mining methods can be found on Zentut.

A com­par­is­on of data mining tools

In order to carry out a com­par­is­on of the best data mining tools, we will introduce the tools, Rap­id­Miner, WEKA, Orange, KNIME, and SAS. It has been proven that users use multiple programs, because data mining tools have different strengths that can be combined with each other. Data mining tools are often com­pat­ible with each other. But even with just one good all-rounder tool, you can do a lot of things as a beginner.

Rap­id­Miner

Rap­id­Miner (formerly known as: YALE, 'Yet Another Learning En­vir­on­ment') is one of the most popular data mining tools. In 2014, it was the most widely used data mining tool prior to the R tool, according to a survey conducted by KDnuggets. It is available for free and easy to use even if you don’t possess special pro­gram­ming skills. Nev­er­the­less, it offers a large selection of operators. Startups, in par­tic­u­lar, make the most of this tool.

Rap­id­Miner was written in Java and contains more than 500 operators with different ap­proaches to point out con­nec­tions in data – there are options for data mining, text mining, web mining, and also for mood analysis (sentiment analysis, opinion mining), among other things. The program also imports Excel tables, SPSS files, and data sets from many databases, and in­teg­rates the WEKA and R data mining tools. This makes it a com­pre­hens­ive all-rounder.

Rap­id­Miner supports all steps of the data mining process, including the present­a­tion of results. The tool consists of three major modules: Rap­id­Miner Studio, Rap­id­Miner Server, and Rap­id­Miner Radoop, each of which executes different data mining tech­niques. In addition, Rap­id­Miner prepares the data prior to analysis and optimises it for faster sub­sequent pro­cessing. For each of these three modules, there’s a free and a fee-based version available.

A par­tic­u­lar strength of Rap­id­Miner is pre­dict­ive analytics, which is the name given to pre­dict­ing future de­vel­op­ments based on collected data. When comparing data mining software, Rap­id­Miner is one of the strongest tools out of the ones mentioned.

WEKA

WEKA (Waikato En­vir­on­ment for Knowledge Analysis) is open source software and was developed by the Uni­ver­sity of Waikato. The data mining tool is based on Java and can be used with Windows, MacOS, and Linux. Known for its extensive machine learning cap­ab­il­it­ies, it supports all major data mining tasks such as clus­ter­ing, as­so­ci­ation, re­gres­sion, and clas­si­fic­a­tion. The graphic user interface fa­cil­it­ates access to the software. In addition, WEKA offers connect to SQL databases and can further process the requested data. WEKA’s strength lies in clas­si­fic­a­tion: the data mining tool is known for its many clas­si­fic­a­tions, including ar­ti­fi­cial neural networks, decision trees, ID3, and C4.5 al­gorithms. However, WEKA is less powerful when it comes to other tech­niques such as cluster analysis. Only the most important pro­ced­ures are offered by this program. Another dis­ad­vant­age: WEKA can ex­per­i­ence problems with pro­cessing if the amount of data becomes too much. This is because the data mining tool tries to load all of it into the memory. To avoid this, WEKA offers a simple command line (CLI) that makes it easier to handle large amounts of data.

Fact

WEKA was awarded the 'SIGKDD Service Award' from the As­so­ci­ation for Computing Machinery for its high-research con­tri­bu­tion. In com­par­is­on to other data mining tools, WEKA has proven par­tic­u­larly useful for teaching and research purposes.

Orange

The data mining tool Orange has existed for more than 20 years and is a project from the Uni­ver­sity of Ljubljana. The software’s core was written in C++, but early on the program was extended by the pro­gram­ming language, Python, which is now used as the query language. The more com­plic­ated op­er­a­tions are still carried out in C++. Orange is a com­pre­hens­ive data mining software that demon­strates how much you can do with Python: It offers useful ap­plic­a­tions for data and text analysis as well as features for machine learning. When it comes to data mining, it works with operators for clas­si­fic­a­tion, re­gres­sion, clus­ter­ing, and much more. This data mining tool also in­teg­rates visual pro­gram­ming.

What is striking about the tool is that users re­peatedly emphasise how fun this data mining software is compared to others. Both beginners and ex­per­i­enced users have admitted to being fas­cin­ated by Orange. Its pop­ular­ity comes down to two things: firstly, the appealing data visu­al­isa­tion that makes it more in­ter­est­ing to work with; secondly, the speed and ease with which the visu­al­isa­tion takes place. The program prepares input data visually and instantly. Un­der­stand­ing these graphics and pro­cessing the data analysis further is re­l­at­ively easy, and quick business decisions can be made. This makes Orange an ideal tool for data mining.

A further advantage for beginners is that there are numerous online tutorials available for the tool. Another special feature of Orange is that it learns the pref­er­ences of its users over time and reacts ac­cord­ingly. This is another plus for the data mining tool.

KNIME

KNIME was developed by the Uni­ver­sity of Constance and is now popular with a large in­ter­na­tion­al community of de­velopers. Although KNIME was ori­gin­ally intended for com­mer­cial use, it is still available as open source software. It was written in Java and edited with Eclipse. If you compare this data mining software with others, its range of functions is es­pe­cially im­press­ive: with more than 1,000 modules and ready-made ap­plic­a­tion packages, this tool helps to reveal hidden data struc­tures. The modules can be expanded by ad­di­tion­al com­mer­cial features. Among its functions, in­teg­rat­ive data analysis is par­tic­u­larly appealing – KNIME is one of the most powerful tools in its field and enables numerous methods of machine learning and data mining to be in­teg­rated. It is also par­tic­u­larly effective when pre­pro­cessing data i.e. ex­tract­ing, trans­form­ing, and loading data. Its modular pipelin­ing makes it a data flow-oriented data mining tool. KNIME has been used in phar­ma­ceut­ic­al research since 2006 and is also a powerful data mining tool for the financial data sector. However, KNIME is also fre­quently used in the business in­tel­li­gence (BI) sector. Here, KNIME is regarded as the tool that made pre­dict­ive analytics also available to in­ex­per­i­enced users. The tool is also in­ter­est­ing for beginners, because despite its many strong features, you don’t need much time to fa­mil­i­ar­ise yourself with it. KNIME is available as a free program as well as a paid program.

SAS

SAS (Stat­ist­ic­al Analysis System) is a product of the SAS Institute, one of the world’s largest privately-owned software companies. SAS is the leading data mining tool for business analysis and is also the most expensive of the programs listed here. However, it is the one that is best suited for use in large companies. SAS is par­tic­u­larly good when it comes to the pro­gnost­ic sector and in­ter­act­ive data visu­al­isa­tion, which is ideal for large present­a­tions. In principle, this data mining software provides a com­pre­hens­ive all-round solution for suc­cess­ful data mining. The tool is char­ac­ter­ised by very high scalab­il­ity, so it’s possible to increase the per­form­ance pro­por­tion­ally by adding ad­di­tion­al hardware or other resources. This also makes it a powerful tool for high-quality business solutions. For tech­nic­ally less ex­per­i­enced users, it has a graphical user interface. However, this software can only be used free of charge if you get a cor­res­pond­ing licence from a public in­sti­tu­tion. SAS is usually subject to a fee. The costs are decided upon request and depend on special con­di­tions i.e. it’s cheaper for au­thor­it­ies or edu­ca­tion­al in­sti­tu­tions. SAS is one of the more expensive al­tern­at­ives among com­mer­cial tools. However, it is possible to customise the range of functions and therefore influence the price. SAS is mainly used in phar­ma­ceut­ic­al companies where it has es­tab­lished itself as standard. It is also fre­quently used in the banking sector and offers optimal solutions for BI and web mining. Among other things, it has its own business in­tel­li­gence software for this purpose. This makes it one of the most powerful data mining tools on the market.

Data mining tools at a glance

After providing a detailed com­par­is­on of the data mining software, here’s an overview of all important features of the data mining tool:

  Char­ac­ter­ist­ics Pro­gram­ming language Operating system Price/licence
Rap­id­Miner Strong all-rounder with a special strength in pre­dict­ive analytics Java Windows macOS Linux Freeware Various fee-based versions
WEKA Many methods of clas­si­fic­a­tion Java Windows macOS Linux Free software (GPL)
Orange Creates par­tic­u­larly appealing and in­ter­est­ing data visu­al­isa­tions without the need for extensive prior knowledge Software core: C++ Ex­ten­sions and query language: Python Windows macOS Linux Free software (GPL)
KNIME The leading open data mining tool that has made pre­dict­ive analytics available to the general public Java Windows macOS Linux Free software (GPL) (from version 2.1 onwards)
SAS Expensive, but powerful data mining software for large en­ter­prises SAS language Windows macOS Linux Limited freeware available through edu­ca­tion­al in­sti­tu­tions Price only available on request Various extensive models available
Go to Main Menu