As discussed earlier chapters, data deduplication is a hot technology that is used to reduce data storage capacity requirements. If you employ smart choices in backup and data management processes, you might not need data deduplication. But if you keep all of your inactive and unimportant data on your production storage systems, and use backup software that forces you to perform repetitive full backups of all that static data, then data deduplication can provide you with a huge benefit.
The basic idea behind data deduplication is to store just one copy of any data object, and place pointers to the single copy wherever duplicates are eliminated. Some solutions do this at a file level, so that the files have to be exactly the same to be deduplicated. This is often called single-instance storage (SIS). Other solutions deduplicate data at a fixed or variable block length. IBM’s solutions use a blended approach based on the size of the data—file-based for smaller files, and variable block for larger files.
Most deduplication solutions run a checksum algorithm against the selected data to create a hash signature, then check to see if that signature has ever been seen before. If it has, the data is discarded and a pointer to the already stored data is put in its place. A small number of high-end solutions perform a complete byte-level differential comparison of the data to remove all potential for “data collisions,” where two distinct data blocks may share the same hash signature.
Data deduplication can and does occur at many points in the data creation and management life cycle. In general, these points of deduplication can be broken into source-side, where the data is created, and target-side, where it is stored and managed. Backup applications, for example, can perform source-side deduplication by not transferring data that has previously been backed up over the LAN or WAN, saving on bandwidth.
On the target side, the most popular use of deduplication is in virtual tape libraries, or VTLs. These disk-based systems emulate tape libraries and drives, but apply deduplication to store equivalent amounts of data on disk very cost-effectively while providing performance advantages over tape. Performing deduplication on tape-based systems is considered to be a bad idea, given the portable nature of tapes and the need to recycle them over time; it would be very difficult to guarantee that you maintain the original data for all of the pointers that are out there.
Today, IBM offers two compelling data deduplication solutions. The Extended Edition of Tivoli Storage Manager 6 includes deduplication capabilities to eliminate duplicate data that has been backed up from multiple production systems. Again, TSM’s progressive-incremental backup methodology does not create massive amounts of duplicate data, so the deduplication is only effective when the same data exists on different systems.
The other solution is the IBM System Storage ProtecTIER® family of deduplication systems for reducing data coming from multiple sources, including Tivoli Storage Manager servers, backups from other backup systems, or archive software solutions.
A lot of customers ask when they should use TSM deduplication and when they should use ProtecTIER. I’ll cover this question in detail in my next blog, but the simple answer is:
- Use TSM deduplication when you have a single TSM server; you want to improve TSM recovery times by storing more backup data on disk; or there isn’t a large amount of duplicate data across the systems protected by multiple TSM servers.
- Use the IBM System Storage ProtecTIER TS7650 Deduplication solutions when: you have multiple TSM servers; you have other sources of backup and archive data; or you are using other (non-IBM) backup products that perform periodic full backups.
"The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions."