member sign-in
Forgot password? Create new account Close

Deduplication

Definition

Data deduplicationis a specialized technology that breaks data into segments, identifies the unique segments and store them only once for eliminating redundant data to improve storage utilization.

Subsequent iterations of the data are replaced with a pointer to the original.

In few words, deduplication means “store only one copy of the data”.

User Benefits

Data deduplication increases the speed of service and reduces costs.

Data deduplication lets users reduce the amount of disk/tape they need for backup by 90 percent or more, and with reduced acquisition costs—and reduced power, space, and cooling requirements—disk becomes suitable for first stage backup and restore and for retention that can easily extend to months. With data on disk, restore service levels are higher, media handling errors are reduced, and more recovery points are available on fast recovery media.

Data deduplication solutions uses different compression techniques to compress data before storage. These techniques reduce data size and storage needs and also simplify the offsite replication, backup and disaster recovery because much less WAN bandwidth is needed when transporting the data.

Business Impact

Data deduplication can operate at the file, block or bit level. Whenever data is transformed, concerns arise about potential loss of data. By definition, data deduplication systems store data differently from how it was written. As a result, users are concerned with the integrity of their data. The various methods of deduplicating data all employ slightly different techniques. The integrity of the data will ultimately depend upon the design of the deduplicating system. As the technology has matured over the past decade, the integrity of most of the major products has been well proven.

Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may have great business impact. If the deduplicated data is stored in a single location with no disaster recovery or backup system in place, all data can be lost if something unexpected happen.

To summarize, the benefits for an organization by using data deduplication technologies are:

  • Reduced back-up costs
  • Reduced costs for disaster recovery
  • Reduced costs for hardware
  • Increased storage efficiency


 


Products supporting this technology

Deduplication technology offers storage and backup administrators a number of benefits, including lower storage space requirements, more efficient disk space use, and less data sent across a WAN for remote backups, replication, and disaster recovery.

How it works?

  • Data is divided in blocks
  • For each block, a hash value is calculates and the hash is stored in an index
  • Using the hash value of the original block of data and comparing with the hash value of another block, deduplication solutions are able to determine whether to store or to deduplicate those blocks of data

How data gets deduplicated?

 One of the most common forms of data deduplication implementations work by comparing blocks of data to detect duplicates. For each block of data is assigned an identifier, calculated by the software, typically using cryptographic hash functions. After software confirms that the blocks of data are identical, it will replace the duplicate block with a link.

Below are are few methods used to deduplicate data:

  • File-based compare and compression, where tow files are compared to see if they are the same
  • Hashing at the file level where the mathematical representation of files, the hash string- is analyzed
  • Hashing at the block level, same as the file hashing, but in this case the hash is created for blocks of data. This method works for structured and unstructured data.
  • Hashing at sub-block level. In this situation, the blocks of data are sliced into sub blocks at a specific size and, for each of those sub blocks, a hash value is created.

In order to create the hash string, deduplication solutions use standard encryption techniques like DES, AES, SHA, MD5..

Deduplication can be done at the source or at the target that stores the data. There are pros and cons for each method. If the deduplication occurs at the source, it means that a small amount of data is sent across the wire for back-up.

Target deduplication is the process of removal of redundant data at the back-up target. Target deduplication requires hardware at the remote site and is suitable for companies with no or fewer bandwidth constraints.

In target-based implementations, data can either be backed up first, then deduplicated (called post-process deduplication), or deduplication can be executed during the backup process (inline deduplication). Each method has pros and cons: Post-process deduplication may result in a faster backup, but inline can be replicated immediately after a backup concludes.

Of course, there are situation in which deduplication technology make no sense to be implemented. For example, when you store information in a data base and you have to create a hash value for each write operation to the database disks. Every write operation must have place after the hash value is creates, which introduce performance problems. But, on the other hand, if you use deduplication for backing up the database, it make sense.