Deduplication is a process of data reduction that is essentially based on preventing data redundancies in the storage system. For this, a deduplication engine is used, which generates special algorithms to identify and eliminate redundant files and data blocks. This can be integrated into backup software or hardware used for storing information, or implemented as an intermediary appliance.
The aim of deduplication in storage techniques is to write only as much information on non-volatile storage media as is necessary to be able to reconstruct a file without losses. The more duplicates are deleted, the smaller the data volume that needs to be stored or transferred becomes. The identification of duplicates can be done at file-level with Git or Dropbox, for example. However, a more efficient method is the use of deduplication algorithms, which work on a sub-file level. To do this, files are first broken down into data blocks (chunks) and awarded unique checksums, or hash values. The tracking database, which contains every checksum, acts as the central supervisory entity.
The block-based deduplication methods can be broken down into two variations:
- Deduplication with a fixed block size: for deduplications with a fixed block size, the algorithm divides files into sections of the exact same length. This is generally based on the cluster size of the file or RAID system (typically 4KB), but can also sometimes be configured manually. In this case, the individually-adjusted block size is adopted as the standard size for all data blocks.
- Deduplication with variable block size: with deduplication with variable block sizes, on the other hand, no specific standard size is defined. Instead, the algorithm divides the data into different blocks, whose lengths vary depending on the type of data that is to be processed.
The type of block division has a massive influence on the efficiency of the data duplication. This is especially noticeable when deduplicated files are subsequently modified.
If a data block with fixed block boundaries is extended with additional information, the contents of all subsequent blocks are usually also relocated in relation to the fixed block boundaries. Although the changes only affect one data block, all subsequent segments of a file will also be classified as new due to the change in the block boundaries by the deduplication algorithm. Unless modified bytes are exactly one multiple of the fixed block size. Since these are saved as newly reclassified data blocks, both the computing effort and the capacity of the bandwidth are increased in the case of a backup with a deduplication with a fixed block size.
If, on the other hand, an algorithm uses variable block boundaries, the modifications of an individual data block have no effect on the next segments. Instead, the modified data block is simply extended and stored with the new bytes. This relieves the burden on the network, as less data is transferred during a backup procedure. However, the flexibility of the file changes is more computing-intensive, as the algorithm must first find out how the chunks are split up.
Identifying redundant chunks is based on the assumption that data blocks with identical hash values contain identical information. To filter out redundant chunks, the deduplication algorithm just needs to synchronise newly calculated hash values with the tracking database. If it finds identical checksums, the redundant chunks are replaced by identifying the storage location of the identical data block with a pointer. A pointer like this takes up far less space than the data block itself. The more chunks in a file can be replaced by placeholders, the less memory space is required. However, the efficiency of data reduction can’t by predicted by deduplication algorithms, as this is strongly dependent on the output file and its data structure. In addition, deduplication is only suitable for unencrypted data. Redundancies are intentionally avoided with encryption systems; this makes any pattern recognition impossible.
Deduplication can be carried out either in the storage destination or at the data source.