Data Deduplication in Backup and Recovery

deltamaster posted @ Apr 24, 2012 05:12:22 PM in Storage with tags backup deduplication , 14910 阅读

In recent years, big data is being mentioned everywhere, and the protection of data has been highly concerned.

Years ago, data were backed up into tapes, which are storage devices with high capacity, good at sequential access but poor enough in random access. Administrators used to make weekly full backup and daily incremental backup in order to balance between backup speed, size of backup set and mean time to recover.

We need to know what is the difference between backup storage and transactional storage. The most significant difference is the backup storage is written frequently but hardly read, always sequantially accessed instead of randomly. In that case, we need to reduce the size of backup set to increase the logical data protected without increasing physical storage. Compression is a good idea to reduce data size, and in practise, it can compress data by approximately 50%. However this is still very far from the ideal target of space saving. That' why we introduce the deduplication technology, that is, generally, save only one copy of identical data.

There are several ways to split logical data into parts. We can split the data file-based, block-based or segment-based. Segment here stands for chunks of data which sizes are not fixed.

In file-based deduplication, backup system can detect identical files in the storage system, such as two or more same files are stored in different logical directories. Obviously, once the file is slightly changed, deduplication is not applicable.

In block-based deduplication, backup system can detect identical blocks in the storage system. This method is fine-grained, so the detection of identical blocks will be applied on every single block in each file. Generally, it will be able to save more space than file-based way. However, when data insertion happens in a block, significant data shuffle will happen in subsequent data blocks, resulting in the inapplicability of deduplication on those blocks.

Segment-based deduplication is introduced to solve this problem. Each file is detected and splitted into segments by a certain algorithm. This algorithm will try to detect logical segments of file, and keep the segment size in a reasonable value. For example, in a plain text file, the algorithm may split the file by paragraph, that is to detect new line characters. Of course, different patterns are applied to different data format to ensure better logical separation of data. Since the file is separated in logical patterns, so we are more likely to benefit from data deduplication through this way.

While single algorithm is hard to detect so many different file formats, so data producer can provide us with the most suitable separation algorithm as a plugin to the deduplication feature, to help the file it produced to be more logically understood.

In order to implement this, we need to build an index to store the metadata for every segment, so that we will be able to tell whether some data are added, removed or modified.

* 本文在CC BY-SA(署名-相同方式共享)协议下发布。
  • 无匹配
AP 10th Physics Mode 说:
Sep 15, 2022 10:43:16 PM

Physical Science is the part of Science known as Physics (PS), every student in class 10th grade studying at Government & Private Schools of the state board can download the AP SSC PS Model Paper 2023 with answers for all topics of the course designed by the board experts based on the new revised syllabus and curriculum of the BSEAP. Either Telugu medium, English medium and Urdu medium students of class 10th can download the AP SSC Physics model papers 2023 to practice with regular revisions and mock tests. AP 10th Physics Model Paper Class teachers and leading institutional experts are prepared those AP 10th Class PS Model Paper 2023 Pdf with answers that support all exam formats of the board such as Summative Assessments (SA-1 & SA-2) and Formative assessments (FA-1, FA-2, FA-3, FA-4) along with Assignments.

CBSE 10th Exemplar 说:
Jul 28, 2023 10:36:12 PM

CBSE 10th Class Science Exemplar Problems with Solutions 2024 has been Divided into 3 parts. Part A Provides Detailed Exemplar Problems of all the Questions/ Exercises Provided in the CBSE Textbooks, Part B Provides Exemplar Problems CBSE 10th Exemplar Solutions for Science 2024 and Solutions in the CBSE 10th Science Exemplar book, Part C Provides Selected Practice Questions Paper useful for the Class 10 Science Examination along with Detailed Solutions,Students Should Start Preparing for the Finals in a more Planned way, Going through the CBSE 10th Exemplar Books and Exam Pattern important step in the Exam Preparation Process.

seo service london 说:
Jan 16, 2024 07:00:12 PM

Thanks for another wonderful post. Where else could anybody get that type of info in such an ideal way of writing?

mily 说:
May 11, 2024 12:57:42 PM

Data deduplication is a method used in backup and recovery processes to reduce the amount of storage space required by eliminating duplicate copies of data. Best Internet Packages It's important to consider potential challenges and limitations of data deduplication, such as increased processing overhead during deduplication operations and potential performance impacts on certain workloads.


登录 *


loading captcha image...
(输入验证码)
or Ctrl+Enter