In this age of in­form­a­tion, or­gan­isa­tions are con­stantly col­lect­ing massive amounts of data. But in most cases, collected data is stored without being analysed. Data that exists but is not used, is referred to as dark data.

Compute Engine
The ideal IaaS for your workload
  • Cost-effective vCPUs and powerful dedicated cores
  • Flex­ib­il­ity with no minimum contract
  • 24/7 expert support included

What is dark data?

Dark data is data that is not used by an or­gan­isa­tion to gain insight; in other words, it is hidden data. This may include data that is in­com­plete, has not been evaluated, exists in secret, or has not (yet) been recorded at all. Essential to our un­der­stand­ing of the term is that it is relative. Whether data is ‘dark’ or not depends on the re­la­tion­ship of the data to a par­tic­u­lar or­gan­isa­tion.

Dark data is par­tic­u­larly obvious in con­nec­tion with the man­age­ment of big data. Often, too much data is generated that it cannot be processed and analysed in a timely manner. In the words of British stat­ist­i­cian David Hand:

Quote

‘In the era of big data, it is easy to imagine that we have all the in­form­a­tion we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg.’ - David Hand

To exemplify what dark data en­com­passes, let’s look at four scenarios:

  1. Data of unknown existence
  2. Data that is subject to un­cer­tain­ties
  3. Data that is stored unused
  4. Data not yet recorded at all.

In all four scenarios, we further dif­fer­en­ti­ate two distinct cases:

  1. The or­gan­isa­tion is aware that data is missing, in­com­plete, or subject to un­cer­tainty.

This case is less prob­lem­at­ic. If there is an awareness that the available data may represent only the tip of an iceberg, the or­gan­isa­tion can take coun­ter­meas­ures. They may try to obtain more complete data or analyse available data regarding un­cer­tain­ties.

  1. The or­gan­isa­tion is unaware that data is missing or it is assumed that available data is complete.

This case is more dangerous. If one assumes that one has a complete picture of the situation based on the available data, the or­gan­isa­tion operates contrary to reality. Con­clu­sions drawn from in­com­plete data lead to sub­op­tim­al decisions.

In times of big data and data mining, or­gan­isa­tions strive to get everything they can out of data.

What is data?

Since the explosive spread of in­form­a­tion tech­no­logy, the term data has been widely used. Fre­quently mentioned by politi­cians, business rep­res­ent­at­ives and sci­ent­ists alike, the term remains nebulous. This is because data is non-physical in nature – it is an abstract concept.

Data is not syn­onym­ous with in­form­a­tion

First, let us note that data is a mani­fest­a­tion of in­form­a­tion. In fact, data is the smallest units of which in­form­a­tion is composed. In the same way as atoms are the smallest building blocks of matter or photons are the smallest building blocks of energy.

Note

We use the term ‘in­form­a­tion’ as an abstract concept, like matter and energy.

Single data is often mean­ing­less on its own. Only the in­ter­pret­a­tion of several data results becomes usable in­form­a­tion. Think of data as in­di­vidu­al letters. A single letter, for example the letter ‘P’ has no meaning. Only when several letters are combined does a word result, e.g. ‘apple’. Here, moreover, the sequence is decisive.

In­form­a­tion is data that is sum­mar­ised in struc­tures and delimited from one another. The process of in­ter­pret­a­tion depends on the context. This means that a set of data can be in­ter­preted dif­fer­ently, possibly resulting in several different meanings. Think again of the word ‘apple’. Instead of combining the in­di­vidu­al letters into one word, we could count the letters. This would result in different in­form­a­tion based on the same data.

Picture the totality of an or­gan­isa­tion's data as a mountain. The challenge is to extract useful in­form­a­tion from this mountain of data. In contrast to a physical mountain, where valuable materials are lost during ex­trac­tion, useful in­form­a­tion can, in principle, be extracted from a mountain of data several times. It all depends on the context and the per­spect­ive.

The hierarchy of in­form­a­tion

If in­form­a­tion is composed of data, like matter is composed of atoms, it is fair to assume that higher struc­tures exist. In fact, there is a hierarchy of in­form­a­tion: at the bottom is data, followed by in­form­a­tion, and then knowledge.

Knowledge is linked in­form­a­tion. The in­di­vidu­al pieces of in­form­a­tion are weighted. Some are primary, others secondary. Crucial for knowledge is the concept of reference, which is known today as a (hyper) link: in­form­a­tion that links to another knowledge unit. Examples of knowledge are Wikipedia entries, recipes, and doc­u­mented processes.

Building on knowledge, in­tel­li­gence follows. From learned knowledge and ac­cu­mu­lated ex­per­i­ence con­clu­sions may be drawn and patterns can be re­cog­nised. New knowledge is syn­thes­ised by creating and testing hy­po­theses. Crucial for in­tel­li­gence is ex­ecut­able in­form­a­tion, or in other words: code, which can take on the form of al­gorithms or heur­ist­ics. Whereas data, in­form­a­tion, and knowledge are inert, in­tel­li­gence requires an en­vir­on­ment in which it is executed. Cells, organisms, computers, and networks are all systems that exhibit in­tel­li­gence.

The highest level in the in­form­a­tion hierarchy is wisdom. Wisdom is the totality of knowledge and in­tel­li­gence. Wisdom allows eval­u­at­ing in different ways to find a balanced solution. The in­ter­est­ing questions are not so much “‘what”’ (data, in­form­a­tion) or “‘how”’ (knowledge, in­tel­li­gence), but “‘why”’ and “‘what for”’. A good example of wisdom is a library. It contains not only knowledge in the form of books and other media, but also in­tel­li­gence in the form of staff and index systems.

How is dark data created?

Or­gan­isa­tion­al processes, which are supported by modern methods of in­form­a­tion pro­cessing, con­tinu­ously produce data. Some pro­por­tion of the data will be dark data. Either the in­form­a­tion that data exists is lost or missing from the outset. Or the knowledge of how data can be analysed is not available.

Dark data comes in many forms. In the words of marketing expert Sky Cassidy:

Quote

‘So as for dark data, it’s all the in­form­a­tion companies collect in their regular business processes, don’t use, have no plans to use, but will never throw out. It’s web logs, visitor tracking data, sur­veil­lance footage, email cor­res­pond­ences from past employees, and so much more’. - Sky Cassidy

Dark data arises from forgotten or no longer ac­cess­ible data

A majority of dark data consists of data that is no longer ac­cess­ible. This can be forgotten data or data that can no longer be accessed.

Employees con­tinu­ously store data on their private and company devices. It can happen quickly that such data is forgotten and becomes dark data. Data on USB sticks and portable hard drives, as well as internal data carriers of de­com­mis­sioned desktop and mobile devices, are as much part of this as data in email at­tach­ments and unused databases.

Near endless scalab­il­ity is one of the ad­vant­ages of the cloud, but it’s also a curse. Because with the help of cloud storage, it is possible to keep ac­cu­mu­lat­ing data without hitting a fixed limit. This tempts employees to collect data without lim­it­a­tions. If the col­lec­tion frenzy takes place outside of strictly regulated processes, the result is usually dark data.

Data security and pro­tec­tion must be warranted when storing data digitally. When data is encrypted, systems are protected. But what happens when the login password is forgotten, or the key can no longer be found? In both cases, access to data is hampered and in­form­a­tion may be lost forever.

But there is another danger of losing access to available data: when it is no longer available in an ac­cess­ible form. For example, if it is a pro­pri­et­ary file format, a special program may be necessary to read it. However, it could happen that the relevant software can no longer be operated or is no longer available in the required version. This means that the data remains trapped in a vendor lock-In.

Dark data arises due to in­com­plete or outdated data

Dark data is not just data that is no longer ac­cess­ible. It also includes in­com­plete or outdated data. Let's let stat­ist­i­cian David Hand have his say again:

Quote

‘Dark data are data you don't have. This might be because you want today's data, but all you have is yes­ter­day's. It might be because your sample is distorted, perhaps certain types of cases are missing. It might be because the recorded values are in­ac­cur­ate – after all, no meas­ure­ment in­stru­ment is perfect.’) - David Hand

Remember that data is the lowest level of the in­form­a­tion hierarchy. Data in­ac­curacies and de­vi­ations manifest them­selves in the higher in­form­a­tion levels. This usually results in cascading effects: small de­vi­ations lead to large changes. Thus, in­com­plete data can have serious effects.

The situation is similar with obsolete data. Consider, for example, the geo­loca­tion of a user, which is stored as part of a data set. Since the geo­loca­tion changes as the user moves, the in­form­a­tion it contains may only be useful if the data is analysed in real time. For example, if you want to make a user a location-based offer, this must be done while the user is still on-site.

Dark data arises from un­ana­lysed data

A large class of dark data consists of data that has been collected and stored, but not analysed. A par­tic­u­larly high volume of dark data comes from sources that generate data auto­mat­ic­ally. This includes sensors, log files, and stat­ist­ics on page visits from websites. The data generated is often stored for long periods of time without being analysed.

Some data is available in formats that require complex pro­ced­ures for analysis. This includes texts contained in image files and spoken text in audio files. In general, digital images contain a wealth of in­form­a­tion that can only be automated using modern ar­ti­fi­cial in­tel­li­gence methods. Pattern re­cog­ni­tion and clas­si­fic­a­tion are used to identify and assign objects depicted in image data. Since these are still re­l­at­ively new ap­proaches, the majority of image materials stored worldwide likely contain dark data.

In another case, dark data arises from existing but un­ana­lysed data. Namely, if the data is stored and kept only as part of audit security, without there being a need to evaluate the data. Stat­ist­i­cian David Hand sums up the problem:

Quote

‘It might even be that the data are available, but un­ex­amined, gently decaying in a giant data warehouse, unlooked at because they were collected purely for com­pli­ance reasons.’ - David Hand

Dark data arises from data not yet recorded

There is one more scenario from which dark data arises. This is of a more the­or­et­ic­al nature, because it involves data that has not yet been collected. Of course, this data (which does not yet exist) is outside the view of the or­gan­isa­tion. Therefore, it also counts as dark data.

Stat­ist­i­cian David Hand draws an analogy to ‘dark matter’:

Quote

‘Just as much of the universe is composed of dark matter, invisible to us but non­ethe­less present, the universe of in­form­a­tion is full of dark data that we overlook at our peril.’ -David Hand

Why dark data is a problem

There are various reasons why dark data is a problem for busi­nesses and other or­gan­isa­tions. Below we discuss cases where data actually exists. We exclude cases where data does not yet exist.

Storing dark data is in­ef­fi­cient

Storage of data requires resources. These include, in par­tic­u­lar, storage space and energy on the part of the storage operator. This causes costs for the or­gan­isa­tion that claims the data as its own. Effort is expended in order to store the data.

Ef­fi­ciency is defined as the quotient between benefit and effort. If a high benefit is achieved with little effort, this is referred to as high ef­fi­ciency. On the other hand, a low benefit with a high effort means that ef­fi­ciency is low.

Efficiency = benefit / effort

Data is supposed to be useful. With dark data, utility is limited. Nev­er­the­less, a con­tinu­ous effort must be expended to store the data. Con­sequently, the storage of dark data is in­ef­fi­cient.

Finding the in­form­a­tion needle in the dark data haystack

Let's imagine the entirety of an or­gan­isa­tion's data as an iceberg. The majority of the data is dark data. Un­for­tu­nately, useful data is not collected on the surface. Rather, it is mixed in with dark data and cannot be easily separated. To find useful data, you have to search the entire mountain.

Because of the sheer mass of dark data, in­form­a­tion that is useful remains hidden. Often, it is unclear whether data is of any value at all. Missing or incorrect data leads to incorrect in­form­a­tion. Thus, dark data in­flu­ences what con­clu­sions are drawn from the in­form­a­tion at hand. This limits how in­tel­li­gently the or­gan­iz­sa­tion can behave.

No one knows what dark data contains

Dark data is opaque by defin­i­tion. You can never be sure whether it contains anything useful. It also cannot be ruled out that the data contains sensitive in­form­a­tion that must not fall into the wrong hands.

Data is usually stored for long periods of time. At the same time, dark data has little benefit for the or­gan­isa­tion. There is often a lack of mo­tiv­a­tion to secure the data. Unused data is easily forgotten. This makes it more likely to be in­ad­equately stored.

In principle, data can always include in­form­a­tion that is subject to special pro­tec­tion. In most cases, in­di­vidu­al data is harmless; on the other hand, sensitive in­form­a­tion can be extracted from data volumes. For example, movement profiles can be created from location data collected over long periods of time. The loss of dark data, therefore, poses a high risk.

One other risk as­so­ci­ated with dark data arises during disaster recovery because data may not be recovered after failure. Let's imagine a system that ran cleanly and in which seemingly all com­pon­ents were known and cloud backups were made. But what if one of the com­pon­ents consisted of dark data? When the system is restored, a critical part is missing. In the worst case, the failure of important systems is the con­sequence.

Managed Nextcloud from IONOS Cloud
Work together in your own cloud
  • Industry-leading security
  • Com­mu­nic­a­tion and col­lab­or­a­tion tools
  • Hosted and developed in Europe

Dark data is hard to get rid of

A mountain of data is hard to keep track of. Dark data could contain useful or sensitive in­form­a­tion. If ap­plic­able, certain storage periods are pre­scribed for the retention of the data. This means that it is not possible to dispose of the data without further ado.

This situation can be compared to hazardous waste, which is hard or im­possible to separate. If a ton of waste contains one gram of highly toxic material, the entire ton is treated as hazardous waste. So data continues to be stored, and the mountain of data continues to grow. This also increases the costs incurred to store it.

Go to Main Menu