Data Deduplication in Backup and Recovery

In recent years, big data is being mentioned everywhere, and the protection of data has been highly concerned.

Years ago, data were backed up into tapes, which are storage devices with high capacity, good at sequential access but poor enough in random access. Administrators used to make weekly full backup and daily incremental backup in order to balance between backup speed, size of backup set and mean time to recover.

We need to know what is the difference between backup storage and transactional storage. The most significant difference is the backup storage is written frequently but hardly read, always sequantially accessed instead of randomly. In that case, we need to reduce the size of backup set to increase the logical data protected without increasing physical storage. Compression is a good idea to reduce data size, and in practise, it can compress data by approximately 50%. However this is still very far from the ideal target of space saving. That' why we introduce the deduplication technology, that is, generally, save only one copy of identical data.

There are several ways to split logical data into parts. We can split the data file-based, block-based or segment-based. Segment here stands for chunks of data which sizes are not fixed.

In file-based deduplication, backup system can detect identical files in the storage system, such as two or more same files are stored in different logical directories. Obviously, once the file is slightly changed, deduplication is not applicable.

In block-based deduplication, backup system can detect identical blocks in the storage system. This method is fine-grained, so the detection of identical blocks will be applied on every single block in each file. Generally, it will be able to save more space than file-based way. However, when data insertion happens in a block, significant data shuffle will happen in subsequent data blocks, resulting in the inapplicability of deduplication on those blocks.

Segment-based deduplication is introduced to solve this problem. Each file is detected and splitted into segments by a certain algorithm. This algorithm will try to detect logical segments of file, and keep the segment size in a reasonable value. For example, in a plain text file, the algorithm may split the file by paragraph, that is to detect new line characters. Of course, different patterns are applied to different data format to ensure better logical separation of data. Since the file is separated in logical patterns, so we are more likely to benefit from data deduplication through this way.

While single algorithm is hard to detect so many different file formats, so data producer can provide us with the most suitable separation algorithm as a plugin to the deduplication feature, to help the file it produced to be more logically understood.

In order to implement this, we need to build an index to store the metadata for every segment, so that we will be able to tell whether some data are added, removed or modified.