Data reduction methods can be used to lessen the amount of data that is phys­ic­ally stored. This saves storage space and costs.

What does data reduction mean?

The term data reduction covers various methods used to optimise capacity. Such methods aim to reduce the amount of data being stored. With data volumes in­creas­ing worldwide, data reduction is necessary to ensure resource- and cost-ef­fi­ciency when storing data.

Data reduction can be carried out through data com­pres­sion and de­du­plic­a­tion. While lossless com­pres­sion uses re­dund­an­cies within a file to compress data, de­du­plic­a­tion al­gorithms match data across files to avoid re­pe­ti­tion.

What is de­du­plic­a­tion?

De­du­plic­a­tion is a process of data reduction that is es­sen­tially based on pre­vent­ing data re­dund­an­cies in the storage system. It can be im­ple­men­ted either at the storage target or at the data source. A de­du­plic­a­tion engine is used, which uses special al­gorithms to identify and eliminate redundant files or data blocks. The main area of ap­plic­a­tion for de­du­plic­a­tion is data backup.

The aim of data reduction using de­du­plic­a­tion is to write only as much in­form­a­tion on non-volatile storage media as is necessary to be able to re­con­struct a file without losses. The more du­plic­ates are deleted, the smaller the data volume that needs to be stored or trans­ferred.

The iden­ti­fic­a­tion of du­plic­ates can be done at file-level with Git or Dropbox, for example. However, a more efficient method is the use of de­du­plic­a­tion al­gorithms, which work on a sub-file level. To do this, files are first broken down into data blocks (chunks) and awarded unique checksums, or hash values. The tracking database, which contains every checksum, acts as a central su­per­vis­ory entity.

The block-based de­du­plic­a­tion methods can be broken down into two vari­ations:

  • Fixed block length: Files are divided into sections of exactly the same length based on the cluster size of the file or RAID system (typically 4 KB)
  • Variable block length: The algorithm divides the data into different blocks, the length of which varies depending on the type of data to be processed.

The way blocks are divided has a massive influence on the ef­fi­ciency of the data du­plic­a­tion. This is es­pe­cially no­tice­able when de­du­plic­ated files are sub­sequently modified. When using fixed block sizes, if a file is changed, all sub­sequent segments are also clas­si­fied as new by the de­du­plic­a­tion algorithm due to the shift in block bound­ar­ies. This increases the computing effort and use of bandwidth.

If, on the other hand, an algorithm uses variable block bound­ar­ies, the modi­fic­a­tions of an in­di­vidu­al data block have no effect on the next segments. Instead, the modified data block is simply extended and stored with the new bytes. This relieves the burden on the network. However, the flex­ib­il­ity of the file changes is more computing-intensive, as the algorithm must first find out how the chunks are split up.

Cloud Backup powered by Acronis
Mitigate downtime with total workload pro­tec­tion
  • Automatic backup and easy recovery
  • Intuitive schedul­ing and man­age­ment
  • AI-based threat pro­tec­tion

What is data com­pres­sion?

In data com­pres­sion, files are converted into an al­tern­at­ive format, which is more efficient than the original. The aim of this type of data reduction is to reduce the required memory space as well as the transfer time. A coding gain like this can be achieved with two different ap­proaches:

  • Re­dund­ancy com­pres­sion: With lossless data com­pres­sion, data can be de­com­pressed precisely after com­pres­sion. Input and output data is therefore identical. This kind of com­pres­sion is only possible when a file contains redundant in­form­a­tion.
  • Ir­rel­ev­ance com­pres­sion: With lossy com­pres­sion, ir­rel­ev­ant in­form­a­tion is deleted to compress a file. This is always ac­com­pan­ied by a loss of data. There is only an ap­prox­im­ate recovery of the original data after an ir­rel­ev­ance com­pres­sion. The process for clas­si­fy­ing data as ir­rel­ev­ant is dis­cre­tion­ary. In an audio com­pres­sion via MP3, for example, the frequency patterns removed are those that are assumed to be hardly or not at all heard by humans.

While com­pres­sion on the storage system level is es­sen­tially loss-free, data losses in other areas, such as image, video and audio transfers, are de­lib­er­ately accepted to reduce file size.

Both the encoding and decoding of a file require com­pu­ta­tion­al effort. This primarily depends on the com­pres­sion method that is used. While some tech­niques aim for the most compact rep­res­ent­a­tion of the original data, others focus on reducing the required com­pu­ta­tion time. The choice of com­pres­sion method is therefore always dependent on the re­quire­ments of the project or task it is being used for.

Which data reduction method is better?

In order to implement backup pro­ced­ures or optimise storage in standard file systems, companies generally rely on de­du­plic­a­tion. This is mainly due to the fact that de­du­plic­a­tion systems are extremely efficient when identical files need to be stored.

Data com­pres­sion methods, on the other hand, are generally as­so­ci­ated with higher computing costs and therefore require more complex platforms. Storage systems that have a com­bin­a­tion of both data reduction methods can be used most ef­fect­ively. First, re­dund­an­cies are removed from the files to be stored using de­du­plic­a­tion, and then the remaining data is com­pressed.

Go to Main Menu