Decision support systems have a long tradition in the world of business. Since the 1960s companies have been using analysis methods with the aim of gaining dis­pos­it­ive data. The aim is to support man­age­ment with the strategic direction of business processes using data-based reports, models, and prognoses.

Analytic in­form­a­tion systems, with their cor­res­pond­ing functions, comprise of concepts like MIS (Man­age­ment In­form­a­tion Systems), DSS (Decision Support Systems), and EIS (Executive In­form­a­tion Systems). They can be difficult to dis­tin­guish from one another and have, in the areas of business practice as well as well as marketing, been cat­egor­ised since the nineties under the col­lect­ive term business in­tel­li­gence (BI).

Fact

Business in­tel­li­gence refers to the col­lect­ive term for the IT-supported pre­par­a­tion and analysis of companies’ raw data. BI should generate knowledge that serves as a decision maker for the strategic direction and ori­ent­a­tion of a company.

The database of the decision support systems in the context of BI is nowadays usually provided by a central so-called data warehouse. We guide you through the basics of data ware­hous­ing, trace the reference ar­chi­tec­ture of such an in­form­a­tion system, and suggest es­tab­lished providers of com­mer­cial DWH solutions as well as free open source al­tern­at­ives.

What is a data warehouse?

According to its defin­i­tion, a data warehouse (DWH) is a data bank system separate from an operative data handling system, in which data from different, sometimes even very het­ero­gen­eous sources, is com­pressed and archived for the long term. Many companies carry over his­tor­ic­al data peri­od­ic­ally from data handling systems into such a data warehouse to prepare them for access at a later date as well as strategic analysis in the context of business in­tel­li­gence (BI). Thus, operative data becomes dis­pos­it­ive data:

  • Operative data: Operating data refers to trans­ac­tion ori­ent­ated in­form­a­tion that arises in companies from day to day and is generated from ad­min­is­trat­ive and ac­count­ing systems. Some of these typical data sources include operating data handling systems like product man­age­ment systems, en­ter­prise resource planning (ERP), in­form­a­tion or ordering systems, and, of course, ac­count­ing programs.
  • Dis­pos­it­ive data: If operating data is brought together (ag­greg­ated) in a central location, saved for the long term, and prepared for analysis, it can then be referred to as a dis­pos­it­ive data.

A DWH offers analysts a com­pre­hens­ive view of het­ero­gen­eous data files and makes it possible to gather op­er­a­tion­al figures in the context of online ana­lyt­ic­al pro­cessing (OLAP). As a central point for all relevant company files, the DWH serves the purpose of company-internal knowledge man­age­ment. It’s usually the case that users are just granted reading access. A DWH functions as a database for dating mining methods and is the basis of all con­sid­er­a­tions in the area of per­form­ance man­age­ment and the strategic company direction.

Structure of a DWH: data warehouse ar­chi­tec­ture

The process of managing and eval­u­at­ing a DWH is known as data ware­hous­ing and involves the following phases:

  1. Data ac­quis­i­tion and data in­teg­ra­tion
  2. Data re­pos­it­ory
  3. Data eval­u­ation and analysis

The phases of data ware­hous­ing are reflected in the typical structure, the so-called reference ar­chi­tec­ture of data warehouse systems. Of course, the system ar­chi­tec­ture of a DWH changes depending on the specific product and provider — fun­da­ment­ally they orientate them­selves around a technical structure based on a modular model, made up of three levels:

  • Data col­lec­tion
  • Data Re­pos­it­ory
  • Data provision

In addition, there is also a central control component; a data warehouse manager that assigns specific ad­min­is­trat­ive functions to every level of the DWH. The in­di­vidu­al com­pon­ents of a data warehouse do not ne­ces­sar­ily have to come from one provider as the re­spect­ive services can also be rendered by different software products or in­di­vidu­al solutions.

The following il­lus­tra­tion depicts a schematic rep­res­ent­a­tion of the DWH reference ar­chi­tec­ture.

Data recording level

Before files can be uploaded to the DWH, it is often the case that the het­ero­gen­eous in­form­a­tion is trans­lated into a uniform present­a­tion form. A DWH is supplied from the company’s internal data sources as well as from relevant external ones too.

  • Internal data: Operating systems: En­ter­prise Resource Planning systems (ERP), Customer Re­la­tion­ship Man­age­ment systems (CRM); operating databases; Content Man­age­ment Systems (CMS); flat files (e.g., Excel, CSV, Text files), emails, etc.
  • External data: Ap­plic­a­tions and systems from external service providers, websites, social media, cloud services, etc.

Systems on the data recording level provide in­ter­faces for companies’ operating systems and are employed in the first phase of data ware­hous­ing — two central functions of these DWH com­pon­ents are data ac­quis­i­tion and data in­teg­ra­tion.

When it comes to data ac­quis­i­tion, the following ex­trac­tion tech­niques are used:

  • Trigger: Provided that a company’s operating systems support a database trigger, one can be deployed in order to auto­mat­ise data ex­trac­tion. Triggers make it possible to define op­er­a­tions that will be auto­mat­ic­ally im­ple­men­ted when certain oc­cur­rences take place. Usually, these oc­cur­rences relate to changes in the database of the source system, which then sub­sequently lead to an ex­trac­tion of the modified data in the DWH.
  • Log files: If it’s the case that an operating system doesn’t support trigger tech­no­logy, then a DWH’s data handling system can contain a program that can usually evaluate a source system’s log files and extract the logged op­er­a­tions within.
  • Mon­it­or­ing program: If neither a trigger nor a log file is available for ex­trac­tion, it’s usually the case that a mon­it­or­ing program will be employed. It extracts changes in the database of an operating system occur using al­gorithms that peri­od­ic­ally create snapshots of the monitored files and then match them with previous ones.

If it so happens that none of the above-mentioned tech­niques are supported and access to the operating system’s database is not possible, then these changes must be logged in­de­pend­ently and the relevant modi­fic­a­tions com­mu­nic­ated to the data warehouse.

When it comes to data in­teg­ra­tion, most DWHs provide OLAP func­tion­al­it­ies that make it possible to present files in multi-di­men­sion­al struc­tures. Online Ana­lyt­ic­al Pro­cessing (OLAP) is an analysis method that offers com­pres­sion of company files that are relevant to man­age­ment. The process is based on the ETL process:

  • E = Ex­trac­tion: Data ex­trac­tion involves the selection of relevant in­form­a­tion from various data sources. This can be executed either as a push or as a pull strategy. If the data ex­trac­tion takes place as part of a push strategy, then the data sources are in­struc­ted to peri­od­ic­ally generate extracts and then transfer these to the DWH. With a pull strategy, the DWH will initiate the data ex­trac­tion by itself.
  • T = Trans­form­a­tion: The extracted files are adjusted as part of a trans­form­a­tion and uniformly trans­lated into the format of the target database.
  • L = Loading: The loading phase involves the saving of the trans­formed files in the re­spect­ive target databases of the DWH.

The data handling level of a DWH can contain a so-called staging area. This refers to a temporary section of the database where the pre-pro­cessing of loaded files takes place. Such a staging process is par­tic­u­larly essential when it comes to complex ETL processes.

Given that DWH files are brought together from different sources, data in­teg­ra­tion is based on a diverse range of tools that make the trans­form­a­tion and ad­just­ment of the adjusted files possible. These can be organised into the following cat­egor­ies.

  • Data migration tools: Programs for data migration make it possible to easily define the trans­form­a­tion rules in order to transfer het­ero­gen­eous output data in a uniform target format.
  • Data scrubbing tools: When it comes to data scrubbing, the programs that are called into action are based on the fuzzy logic approach as well as neuronal networks. The goal here is to improve the file quality by ad­dress­ing mistakes, gaps, and re­pe­ti­tion in datasets through the im­ple­ment­a­tion of pre-defined rules, al­gorithms, and lookup tables (LUT). In this case, one also speaks of quality man­age­ment.
  • Data auditing tools: Data auditing tools are used in data in­teg­ra­tion to establish rules and con­nec­tions between files. Ad­di­tion­ally, programs like this make it possible to identify files that infringe on these rules, meaning that they are pre­sum­ably mistakes.

Following data in­teg­ra­tion, we come to the acquiring of extracted files in the central database, the so-called core data warehouse. This step is supported through programs that offer the following functions:

  • Integrity con­di­tions check
  • File sorting
  • Cal­cu­la­tion of ag­greg­a­tions
  • De­term­in­a­tion of access struc­tures
  • Partition of files for efficient access

Data retention levels

Central to DWH is the data retention level. This involves the so-called core data warehouse. Extracted files are stored in the DWH mostly in the form of multi-di­men­sion­al matrices, known as star and snowflake schemas, in the long-term archives for future analyses.  Only rarely will these refer to the entire database of the DWH. In order to fa­cil­it­ate an efficient analysis, it’s normal practice to create data segments of the entire portfolio, so-called data marts.

A data mart is a copy of a part of a database that is usually im­ple­men­ted as a non-per­sist­ent in­ter­me­di­ary storage. From time to time it is also the case that so-called in­de­pend­ent data marts are used that hold a separate data extract on a long-term basis.

A star schema is a type of Entity Re­la­tion­ship Diagram (ERD), i.e., a graphical rep­res­ent­a­tion of the table structure of a database that il­lus­trates the various entities as well as their re­la­tion­ship to one another. The star schema serves, es­sen­tially, as a visu­al­isa­tion of multi-di­men­sion­al file struc­tures.

Each star schema is made up of a table of facts as well as multiple dimension tables that then group them­selves in a star shape around the table of facts.

  • The facts table contains so-called facts: the per­form­ance and result figures of the company that need to be con­tinu­ously held onto (e.g. turnover)
  • The di­men­sions table contains at­trib­utes with which the data from the facts table can be described. It refers to a col­lec­tion of reference in­form­a­tion relating to the saved events within the table of facts.

In a star schema, the facts table is simply connected with all di­men­sions tables via foreign key re­la­tion­ships. Con­nec­tions between in­di­vidu­al dimension tables are es­tab­lished. The following il­lus­tra­tion shows a sim­pli­fied portrayal of such a data structure:

In the star schema above the example of Turnover is displayed in relation to a certain sales channel, a specific product, a seller, the region, and the cor­res­pond­ing time period. A possible query across three di­men­sions could be: how much turnover can be achieved from a par­tic­u­lar product over the Online Shop sales channel during the period of 2016?

A con­tinu­ation of the star schema is the snowflake schema. Whilst the di­men­sions table of a star schema exists in a de-nor­m­al­ised form, the reference in­form­a­tion of a snowflake schema is saved in 3-D form. A hier­arch­ic­al clas­si­fic­a­tion of the data takes place, meaning that any redundant in­form­a­tion is out­sourced and placed into new tables. This leads to a sort of dis­tinct­ive branching out that then takes on the ap­pear­ance of a snowflake.

Snowflake schemas dif­fer­en­ti­ate them­selves from star schemas in that they take up less storage space. This comes as a result of the nor­m­al­ised data retention. In this case, nor­m­al­isa­tion refers to the out­sourcing of columns into new tables with the aim then of avoiding double entries. The dis­mant­ling of redundant in­form­a­tion also reduces the outlay in terms of data man­age­ment — each piece of in­form­a­tion will ideally only appear once and, therefore, only need to be placed once in the schema.

If the files are out­sourced in nor­m­al­ised tables, this in­ev­it­ably leads to more complex file struc­tures, which are usually ac­com­pan­ied by a longer A. If it is the case that analysts want to access files in a snowflake schema, then the multi-level di­men­sions tables first need to be linked into the framework of a so-called join.

Fact

A join refers to database operation with which, under certain cir­cum­stances, can be merged via database tables that are linked with a foreign key.

In practice, the data structure of a DWH is based mostly on the snowflake schema, while in­di­vidu­al data marts are im­ple­men­ted as star schemas.

With both the star and snowflake schemata we are dealing with di­men­sions tables; as each table is seen as one dimension of a multi-di­men­sion­al OLAP dice. This makes it possible to analyse business per­form­ance in­dic­at­ors multi-di­men­sion­ally, e.g., turnover, with the aid of as many ref­er­en­tial facts stored in the DWH as you wish, to then be examined in the amount of detail that one desires.

The following il­lus­tra­tion shows the schematic portrayal of a three-di­men­sion­al OLAP dice, whose corners span the di­men­sions Product line, Sales channel, and Time period. The length of the dice edge is de­term­ined by the number of cells. Each cell of the dice contains exactly one value — e.g., the turnover for the product line Insurance liability in the year 2016 over the dis­tri­bu­tion channel Branch trade (high­lighted in bright blue in the graphic).

The OLAP process is also not just limited to three di­men­sions. A data dice like this is struc­tured in a way that is n-di­men­sion­al and can, in theory, be made up of any number of di­men­sions you desire.

Fact

Depending on which storage tech­no­logy underlies the core data warehouse, one can then dif­fer­en­ti­ate between the various OLAP processes. If the dice refers to data from a re­la­tion­al database, you can describe it as a ROLAP (Re­la­tion­al OLAP). Dice that are based on multi-di­men­sion­al databases are known as MOLAP (multi-di­men­sion­al OLAP).

Data provision level

The data provision level functions as an interface to end ap­plic­a­tions and present­a­tion tools. Methods for data eval­u­ation and analysis are sourced from diverse end ap­plic­a­tion tools. This makes it possible to extract in­form­a­tion from the data warehouse’s database and process it in various present­a­tion forms for end users. The spectrum ranges from reporting and query tools, col­lab­or­a­tion tools, data mining tools, tools for online ana­lyt­ic­al pro­cessing (OLAP), executive in­form­a­tion systems (EIS), to fore­cast­ing and sim­u­la­tion tools.

Reporting and query tools

Reporting tools provide end users with various functions in order to set up pre­defined reporting. This can happen auto­mat­ic­ally at regular time intervals or on request. In order to make it easier to fa­cil­it­ate requests to the DWH, these can also be pre­defined with the help of query tools.

Col­lab­or­a­tion tools

Col­lab­or­a­tion tools support the com­mu­nic­a­tion and co­oper­a­tion of end users with data analysis. The range of functions offered by this tool range, for example, from the saving of an­nota­tions to the exchange of analysis results. 

Data mining tools

The col­lect­ive term data mining also includes all un­dir­ec­ted, partly auto­mat­ised analysis methods, which aim to identify relevant patterns, trends, and re­la­tion­ships in the database. Data mining tools rely on stat­ist­ic­al and math­em­at­ic­al methods as well as on tech­niques relating to ar­ti­fi­cial in­tel­li­gence (AI) and machine learning. The extent of the files, which companies create, process and bring together for the purpose of analysis in DWHs is ex­per­i­en­cing ex­po­nen­tial growth. Globally, the average data volume is doubling every two years. With this in mind, data mining methods in the context of data ware­hous­ing are growing in sig­ni­fic­ance.

Tools for online ana­lyt­ic­al pro­cessing (OLAP)

Of the available data eval­u­ation and analysis tools out there, OLAP ap­plic­a­tions have managed to establish them­selves as the standard user interface in the area of data ware­hous­ing. Tools that are used in the area of online ana­lyt­ic­al pro­cessing provide end users with various functions, with which queries to the DWH can be for­mu­lated ad-hoc. They serve to assist nav­ig­a­tion through the multi-di­men­sion­al database. The display via OLAP makes it possible to model prepared data depending on popular pre­defined di­men­sions. Analysts have a choice of various basic op­er­a­tions, with which an OLAP dice can be edited and processed:

  • Slicing: Slicing refers to a process wherein a dimension of the OLAP dice is narrowed down to a certain subset. What happens is that basically a slice is cut from the Data dice and then viewed sep­ar­ately. The following il­lus­tra­tion shows the dimension of Time period with the section of 2015 high­lighted. Thus, the high­lighted section shows the turnover for all insurance products sold through all the dis­tri­bu­tion channels during the year 2015.
  • Dicing: This refers to the cutting of an OLAP dice via a simulated slicing operation in multiple di­men­sions. The dicing process leads to a smaller dice which then rep­res­ents a partial quantity of the entire dice. The following il­lus­tra­tion depicts a dicing operation, in which the entire dice is reduced to subsets across all three di­men­sions.
  • Pivoting:  The turning of the data dice in a way that makes at least one other dimension is visible is known as Pivoting.
  • Drill Down/Roll-Up: If the ag­greg­a­tions of an in­form­a­tion object need to be broken down into detailed values, then the drill down operation is called into action. This makes it possible for analysts to zoom into an OLAP dice and increase the gran­u­lar­ity of the files. As a counter operation to the drill down, the rollup consists of com­pressed data at a higher hier­arch­ic­al level.  The drill down and rollup are used in the nav­ig­a­tion of multi-di­men­sion­al hier­arch­ic­al struc­tures. The following il­lus­tra­tion shows a drill down of the in­form­a­tion object Turnover in the dimension of Product lines. The gran­u­lar­ity is increased in order that the stored turnover figures in the DWH in relation to in­di­vidu­al products can be in­ter­preted and analysed.
  • Drill Out/Split: The drill out operator (also known as split) makes it possible for analysts to add further di­men­sions to an OLAP dice. The result of this is more detailed data. Unlike with the drill down, the level of detail in relation to gran­u­lar­ity is not increased. Instead, it is increased through an in­form­a­tion yield that is a result of the ad­di­tion­al reference in­form­a­tion of the added dimension. 
  • Drill in/Merge: As a counter operation to the drill out, with a drill in, the level of detail of the OLAP dice is reduced through the removal of a dimension. Unlike with a rollup, the loss of in­form­a­tion comes not as a result of the change of viewing level, but instead as a result of the loss of di­men­sion­al in­form­a­tion. The gran­u­lar­ity remains the same.
  • Drill across: The data operation drill across serves to analyse a database. While the op­er­a­tions discussed so far relate to an OLAP dice, the drill across process is deployed across many cor­rel­ated data dice in order to make global analyses possible. This means that a lot of fact tables that have at least one common dimension can be analysed on the same hier­arch­ic­al level and gran­u­lar­ity, while still main­tain­ing the viewing level. 
  • Drill through: A drill through refers to an operation by which an analyst chooses an in­di­vidu­al cell of a data dice and views it in the highest level of detail. Unlike with the drill down, the drill through accesses the source files of the selected dice cell. This means that the result of the drill through operation is derived from the table cells that underlie the selected dice cell.

Executive In­form­a­tion Systems (EIS)

Similarly to OLAP, EIS tools offer end users various pos­sib­il­it­ies for for­mu­lat­ing ad-hoc queries and modeling files. In contrast to OLAP, the term EIS nowadays relates primarily to completed ap­plic­a­tion systems, which create reports for certain operating areas like sales, marketing, and financial planning.

Fore­cast­ing and sim­u­la­tion tools

Fore­cast­ing and sim­u­la­tion tools offer end users the pos­sib­il­ity of updating DWH saved figures in the future to develop fore­cast­ing models.

Data warehouse man­age­ment

Special tools are active across all levels of the DWH that can be sum­mar­ized under the term warehouse man­age­ment. The task given to these com­pon­ents is the building, main­tain­ing, and operating of all ad­min­is­trat­ive functions that are necessary for the area of data ware­hous­ing. The main scope of duties of the DWH manager is the schedul­ing of the DWH processes, metadata man­age­ment, security man­age­ment, as well as system man­age­ment.

  • Schedul­ing: Schedul­ing involves the man­age­ment of DWH processes. Ad­min­is­trat­ive functions in the context of schedul­ing can be cat­egor­ized in relation to the following data warehouse ar­chi­tec­ture:
  • Data col­lec­tion/Data entry: In terms of data col­lec­tion, the DWH manager is re­spons­ible for the design and ad­just­ment of the ETL processes. Fur­ther­more, ad­min­is­trat­ive functions are also made available with a view to mon­it­or­ing the updating process and quality man­age­ment.
  • Data retention: When it comes to data retention, the DWH manager su­per­vises the storage usage, con­structs ag­greg­a­tion tables, and carries out archiving and backup op­er­a­tions.
  • Data provision: Ad­min­is­trat­ive functions for the provision of data relate to user ad­min­is­tra­tion as well as to the mon­it­or­ing of query running times.
  • Metadata man­age­ment: A central component of the DWH manager is the metadata re­pos­it­ory. It contains all in­form­a­tion that is necessary for the con­struc­tion and operation of in­form­a­tion relating to the database of the DWH. Files saved in the re­pos­it­ory may cover things like the defin­i­tion of the un­der­ly­ing database schema, in­form­a­tion on the saving struc­tures, in­form­a­tion on the path of access, metadata on the de­scrip­tion of the source data, as well as update times, rules on data cleansing and trans­form­a­tion, indices and par­ti­tions tables. Fur­ther­more, the DWH manager looks after the exchange of metadata between in­di­vidu­al com­pon­ents of the DWH and provides a ho­mo­gen­ous base for metadata.
  • Security man­age­ment: Security man­age­ment involves diverse services in the areas of user au­then­tic­a­tion, au­thor­iz­a­tion, and en­cryp­tion.
  • System man­age­ment: When it comes to system man­age­ment, the DWH manager provides various ad­min­is­trat­ive functions for the operation of the DWH. For example, these include mon­it­or­ing (per­form­ance, capacity, etc.), data archiving, and data backup.

Data ware­hous­ing from a data pro­tec­tion point of view

The large-scale ag­greg­a­tion of op­er­a­tion­al, business, and customer data in a data warehouse, as well as the analysis of this mass of data, offers companies the chance to sus­tain­ably optimise business pro­ced­ures and processes. This is done with the help of data mining methods and/or OLAP ap­plic­a­tions. Along with the ad­vant­ages in the area of decision making, data pro­tec­tion­ists also emphasise the risks of such big data analyses of this kind — spe­cific­ally for the basic right to in­form­a­tion­al self-de­term­in­a­tion and pro­tect­ing your private sphere.

According to critics, what’s par­tic­u­larly prob­lem­at­ic about such analyses is that they make the creation of personal profiles and auto­mat­ised pre­dic­tions of be­ha­vi­our­al tend­en­cies possible. At the center of the debate is the potential for ma­nip­u­la­tion of the in­form­a­tion that is gained from the data analyses. 

Under the Data Pro­tec­tion Act busi­nesses and or­gan­isa­tions are also legally required to inform the In­form­a­tion Com­mis­sion­er’s Office of how your business uses personal in­form­a­tion, and also to respond to any data pro­tec­tion inquiries made by in­di­vidu­als. The act also outlines other important legal re­spons­ib­il­it­ies which are titled as ‘data pro­tec­tion prin­ciples’. Some of these include making sure that the in­form­a­tion is:

  • used fairly and lawfully 
  • used for limited, spe­cific­ally stated purposes
  • kept for no longer than is ab­so­lutely necessary
  • kept safe and secure
  • not trans­ferred outside the European Economic Area (EEA) without adequate pro­tec­tion (this refers to the rights and freedoms of data subjects in relation to the pro­cessing of personal data)

These are otherwise known as ‘common sense’ prin­ciples. Overall the Data Pro­tec­tion Act has the purpose of balancing in­di­vidu­al rights against the le­git­im­ate needs of that of a business to actually process personal data.

Data warehouse software

It’s been quite some time since data ware­hous­ing has been con­sidered some kind of venture, discussed solely in the board­rooms of bigger cor­por­a­tions. These days even small and medium-sized en­ter­prises (SMEs) see the potential of op­tim­ising their business processes through big data analyses. Alongside upmarket BI (Business In­tel­li­gence) suites and in­teg­rated DWH complete solutions, there have also been reas­on­ably priced entry-level products, flexible cloud services, as well as soph­ist­ic­ated open source ap­plic­a­tions on the market for years, many of which are aimed spe­cific­ally at mid-tier busi­nesses.

Fee-based data ware­hous­ing products

Well-es­tab­lished com­mer­cial BI software normally dis­tin­guishes itself through its high degree of re­li­ab­il­ity, the range of services provided in the context of a service level agreement (SLA), as well as the pro­fes­sion­al support on hand. However, users need to budget for their ac­quis­i­tion, or at least for the cloud-based use of these services. 

The following list provides an overview of the leading providers of fee-based data ware­hous­ing products in al­pha­bet­ic­al order.

Propriety software providers Data ware­hous­ing products
Amazon Web Services Amazon Redshift
Cloudera Cloudera En­ter­prise
Hewlett Packard En­ter­prise HP Vertica, HP ArcSight Data-Platform, HP Haven OnDemand, HP IDOL, HP Key View
IBM IBM Netezza, IBM PureData System, IBM In­foSphere DataStage
Microsoft SQL Server, Microsoft Analytics Platform System, Azure HDInsight for Hadoop
Oracle Oracle Business In­tel­li­gence, Oracle Database, Oracle Exadata Database Machine, Oracle NoSQL Database,Oracle TimesTen In-Memory Database, Oracle Big Data Appliance
Pivotal Software Pivotal Greenplum, Pivotal Big Data Suite, Pivotal HDB (powered by Apache HAWQ), Pivotal HDP (OEM Hor­ton­sworks Data Platform)
SAP SAP NetWeaver Business In­tel­li­gence, SAP IQ, SAP HANA En­ter­prise Cloud
SAS SAS Data Man­age­ment, SAS Access Interface to Hadoop, SAS Fed­er­a­tion Server, SAS Data Loader for Hadoop, SAS Event Stream Pro­cessing
Snowflake Computing Snowflake
Teradata Teradata Active En­ter­prise Data Warehouse, Teradata Data Warehouse Appliance, Teradata Appliance for Hadoop, Teradata In­teg­rated Big Data Platform, Teradata Aster Big Analytics Appliance

Open source solutions

Alongside fee-based products, the market also offers a range of open source solutions for business in­tel­li­gence software that provide data ware­hous­ing functions free of charge. Below is an overview of open source tools from the area of business in­tel­li­gence. The table shows not only the most common BI open source programs but also the scope of ap­plic­a­tions offered by each of them. More in­form­a­tion on open source business in­tel­li­gence tools can be found in this article from the website Open­Source.com.

BI-Software Ro­hd­aten­Raw data ex­tract­ing ex­trah­ier­en Trans­form­ing raw dara Loading trans­formed data OLAP Data Mining Dash-boards Reports
Pentaho DI - - - -
Talend OS - - - -
Jasper ETL - - - -
Pentaho Mondrian - - - -
Jedox - - -
BIRT - - - -
SQL Power Wabit - - -
KNIME -
Rap­id­Miner
Weka - -
Jasper­Soft
Pentaho
SpagoBI

In ac­cord­ance with the ap­plic­a­tion areas, the listed open source programs can be assigned to the fields of ETL, OLAP, data mining and reporting. Ad­di­tion­ally, there are also in­teg­rated BI solutions that cover all listed ap­plic­a­tion areas.

ETL software

When it comes to data col­lec­tion and in­teg­ra­tion in the context of an ETL process, there is also the pos­sib­il­ity of the open source programs Pentaho DI, Talend OS and Jasper ETL.

  • Pentaho DI: Also known under the name ‘Kettle,' the ETL tool Pentaho Data In­teg­ra­tion (DI) is part of the larger Pentaho BI suite but can also be used in­de­pend­ently from the other Pentaho com­pon­ents as a stan­dalone ap­plic­a­tion in data warehouse ar­chi­tec­ture. The data col­lec­tion and in­teg­ra­tion tool consists of a graphic user interface, which can also be used by users without any prior pro­gram­ming knowledge, to manage and ad­min­is­trate ETL processes. Pentaho DI also offers a com­pre­hens­ive range of various pro­cessing modules that allow you to define in­di­vidu­al steps of the ETL process. This data in­teg­ra­tion tool supports all common database systems. On top of this, flat files like CSV, Excel, or text data can all be used as data sources. Fur­ther­more, the tool also provides interface functions for pro­pri­et­ary BI suites from SAS or SAP as well as from analysis software like Google Analytics.
  • Talend OS: Com­par­able with Pentaho DI is the open source ETL tool from the software provider Talend. Talend Open Studio (OS) makes it possible for users to define data col­lec­tion and in­teg­ra­tion processes with the help of para­met­erised modules (so-called ‘jobs’). The program offers in­ter­faces for all common data sources and various data trans­form­a­tion functions. A map editor allows users to transfer het­ero­gen­eous raw data in a pre­defined target structure. Like with Pentaho DI, Talend OS users can engage with graphic user in­ter­faces without prior pro­gram­ming knowledge.
  • Jasper ETL: Jasper ETL is the result of a co­oper­a­tion between the software producers Jasper­soft and Talend. This ETL is largely based on Talend OS, the market leader when it comes to the data in­teg­ra­tion tools in the open source sector. De­ploy­ment of this tool is es­pe­cially re­com­men­ded, in the context of DWH ar­chi­tec­ture, if it is the case that other BI products from Jasper­soft are also being used. 

OLAP ap­plic­a­tions

Pentaho Mondrian and Jedox are two es­tab­lished OLAP tools under an open source licence.

  • Pentaho Mondrian: Mondrian is a Java-based OLAP server. Ori­gin­ally developed as a separate open source project, since 2006 Mondrian has been part of the Pentaho BI suite. But users do also still have the option of using it as a stan­dalone ap­plic­a­tion. Alongside Mondrian, with BI solutions there are also other open source providers available to use, like Jasper­soft. Users benefit from a bundling of open source resources that make col­lect­ive projects possible, like the Mondrian Schema Workbench or the OLAP4J interface. The Mondrian project follows a re­la­tion­al online ana­lyt­ic­al process (ROLAP). Databases build a databank, where tables are organized in the form of star or snowflake schemata. Access is in the form of multi-di­men­sion­al queries (MDX) via XML for Analysis (XMLA) or else via the Java interface OLAP4J. With the Mondrian Schema Workbench, users also have the pos­sib­il­ity of a graphic user interface. Mondrian Schemata can be easily and con­veni­ently developed and tested on a desktop.
  • Jedox: With its BI suite of the same name, the software developer Jedox offers a complete solution for business in­tel­li­gence and per­form­ance man­age­ment ap­plic­a­tions. A central component of this software is a high-per­form­ance, in-memory OLAP server, which can be in­teg­rated by way of in­ter­faces for Java, PHP, C/C++ or .NET. Of par­tic­u­lar use to Jedox users in the area of KMU are the Excel add-ins that allow the OLAP server to interface with this program. This is also the case with the well-known table cal­cu­la­tion software from Microsoft. Office ap­plic­a­tions are very common among small and medium-sized busi­nesses, often acting as the basis for data storage. Therefore, Excel in­teg­ra­tion reduces the time and energy spent on employee induction and training.

Data mining

Even in the area of data mining, there are several products available under an open source license. Two of these products are Rap­id­Miner and Weka.

  • Rap­id­Miner: The analysis platform Rap­id­Miner from the software company of the same name provides users with an in­teg­rated en­vir­on­ment for machine learning, sentiment and time analysis, as well as forecast models. It also caters for data, text, and web mining. All of this then takes place in an open core model. Support covers all steps of the data mining process including data pro­cessing, visu­al­isa­tion, val­id­a­tion, and op­tim­iz­a­tion. For some users, the free Community version with just one logical processor and one analysis scope of 10,000 data sets is not suf­fi­cient. If this is the case, then there is also the pos­sib­il­ity of the fee-based En­ter­prise license. The program is written in Java and provides a user interface with which an analysis workflow can be easily defined and carried out in just a few mouse clicks.
  • Weka: Weka (Waikato En­vir­on­ment for Knowledge Analysis) is an open source project from the Uni­ver­sity of Waikato in New Zealand. The analysis tool offers users various al­gorithms in the area of machine learning. Alongside the classic data mining processes like clas­si­fic­a­tion, as­so­ci­ation, re­gres­sion or cluster analysis, Weka also features various com­pon­ents for data pre­pro­cessing and visu­al­isa­tion. The program, which is written in Java, also offers a graphic user interface. All software features can also be carried out via command lines. If required, it’s also possible to integrate Weka into various software solutions via a Java interface.

Reporting Systems

Two re­com­men­ded open source tools in the area of reporting systems are BIRT and SQL Power Wabit. Alongside the classic monthly, quarterly, and annual reports, it also offers ad-hoc functions, allowing you to compile relevant in­form­a­tion in real time.

  • BIRT: BIRT (Business In­tel­li­gence and Reporting Tools) is an open source project from the non-profit Eclipse Found­a­tion that provides BI reporting functions for Rich Clients and web ap­plic­a­tions. The software is suitable for Java-based ap­plic­a­tions and covers the broad sectors of data visu­al­isa­tion and reporting systems. Designs for BIRT reports are created in a graphic interface, which is based on the open source pro­gram­ming tool Eclipse, then saved as XML files.
  • SQL Power Wabit: With the reporting tool SQL Power Wabit users can compile reports based on classic databank queries. OLAP dice are only supported if a de­scrip­tion of the data structure is present. The tool supports standard reports, ad hoc queries, user-defined overview pages, as well as drill down op­er­a­tions in the context of online ana­lyt­ic­al pro­cessing. Functions like a drag & drop control system, the updating of result reports in real time, a global search function, and a WYSIWYG Editor for the drafting of reports, all make the SQL Power Wabit suitable for users; even those without SQL cap­ab­il­it­ies. This allows for the con­veni­ent compiling of com­pre­hens­ive reports in just a few mouse clicks and, if required, also the per­son­al­isa­tion of fonts, text color, and layout.

In­teg­rated BI solutions

Apart from the fee-based BI suites from es­tab­lished providers like SAP, Oracle, IBM, SAS, HP, and Microsoft, there are also software projects on the open source market, which offer users data ware­hous­ing solutions as an in­teg­rated program package. Pentaho CE, Jasper­soft, and SpagoBI are re­com­men­ded by this digital guide.

  • Pentaho Community Edition (CE): The Pentaho BI package contains, alongside a selection of in-house de­vel­op­ments, a number of already existing open source projects, which can be purchased bit by bit and then in­teg­rated into the product portfolio. The main focus of the project is based around data in­teg­ra­tion and the auto­mat­isa­tion of reports. The following programs are featured in the package:
    • Pentaho Business Analytics Platform: The BA Platform is a web ap­plic­a­tion that makes it possible for users to merge all in­form­a­tion through a central platform.
    • Pentaho Data In­teg­ra­tion: Pentaho DI refers to the ETL tool described above.
    • Pentaho Report Designer (PRD): PRD is the evolved version of the project JF­reeRe­port. The open source reporting solution supports a range of output formats like PDF, Excel, HTML, Text, Rich-Text-File, XML, and CSV.
    • Pentaho Mar­ket­place: The Mar­ket­place allows users, in just a few clicks, to extend the Pentaho platform by plug-ins.
    • Pentaho Ag­greg­a­tion Designer (PAD): Through PAD, users can set up and adjust databank content. Central to this tool is the OLAP server, Mondrian.
    • Pentaho Schema Workbench (PSW): This is a graphic design interface that allows users to create and test schemata for Mondrian OLAP dice.
    • Pentaho Metadata Editor (PME): PME assists the detailed de­scrip­tion of un­der­ly­ing data struc­tures with the help of an XML file.

Pentaho En­ter­prise Edition (EE) is a fee-based version of the BI suite with a range of ad­di­tion­al features and pro­fes­sion­al support.

  • Jasper­soft: Jasper­soft also offers various DWH ap­plic­a­tions as part of an in­teg­rated BI solution. The col­lec­tion of programs includes:
    • JasperRe­ports Server: This is a report server offering OLAP functions via an adjusted Mondrian server.
    • JasperRe­ports Library: Jasper­soft provides a library for gen­er­at­ing reports.
    • Jasper­soft Studio: This is an editor provided by the BI suite for the writing of reports.
    • Jasper­soft ETL: The Talend OS-based ETL tool has already been described above.
    • Mobile BI: Mobile BI is the native app for iPhone and Android devices. It means reports and dash­boards can be accessed from mobiles. 

Jasper­soft also offers the pos­sib­il­ity of a further range of functions through its fee-based, com­mer­cial version.

  • SpagoBI: Unlike with Pentaho and Japersoft, who market their products under a dual license, the IT-ini­ti­at­ive Spago­World offers only an open source solution. However, business users do have the pos­sib­il­ity of paying for the pro­fes­sion­al setting up and cus­tom­isa­tion of the software. The program is made up of the following com­pon­ents:
    • SpagoBI Server: At the core of this open source resource is the SpagoBI server that provides all of the various analysis tools and functions.
    • SpagoBI Studio: The program includes an in­teg­rated de­vel­op­ment en­vir­on­ment.
    • SpagoBI Meta: SpagoBI Meta offers users an en­vir­on­ment for metadata man­age­ment.
    • SpagoBI SDK: Through the SpagoBI SDK, the Spago BI suite has an in­teg­ra­tion layer that makes it possible for various external tools to be in­cor­por­ated, e.g., Talend OS (ETL), Jedox, or Mondrian (OLAP); Weka or R (data mining); as well as BIRT or JasperRe­ports Library (reporting systems).

Data retention

Even when it comes to data retention, users have a choice of various open source software al­tern­at­ives to the pro­pri­et­ary databank man­age­ment systems like Microsoft SQL Server, IBM DB2, or the solutions offered by Oracle and Teradata. One of the prominent data stores are the re­la­tion­al database systems MySQL and Maria DB, or else the object-re­la­tion­al database Post­gr­eSQL. Last but not least, there is also the offering from Pivotal that goes by the name of Greenplum Database, an optimised and evolved de­vel­op­ment offered spe­cific­ally for data warehouse ar­chi­tec­ture under an open source licence.

Summary: Data ware­hous­ing for SMEs

Data ware­hous­ing has now well and truly become part of small and medium-sized en­ter­prises. Along with the more expensive business solutions, the market for BI solutions and data warehouse solutions offers a wide range of very useful open source projects. This means that the financial hurdle for smaller and medium-sized busi­nesses is no longer as big an issue when it comes to using big data analysis.

Surpluses can be achieved by a business through combining already existing data with straight­for­ward expenses. If in the course of the analysis, gaps begin to emerge in the database, then the next step should be the re­or­gan­isa­tion of the data col­lec­tion with the help of the pre­vi­ously mentioned ETL and OLAP tools. The final step in the in­teg­ra­tion of a data warehouse ar­chi­tec­ture into your IT in­fra­struc­ture is carried out by data mining tools. Then, with the help of ongoing analyses (e.g., shopping cart analyses) new trends and cross con­nec­tions can show up, providing important in­form­a­tion for strategic decision making.

SMEs that are con­tem­plat­ing the con­struc­tion of a data warehouse should make sure that they remain compliant con­cern­ing data pro­tec­tion during the im­ple­ment­a­tion of their BI strategy.

Go to Main Menu