Data De-duplication and Disk-to-Disk Backup Systems

Recommendations

Info

Redundant Data Data De-Duplication Engine Figure Two: Data De-Duplication ESG Report Data De-Duplication and D2D Backup Systems Data de-duplication ratios will vary based on the types of data involved and the frequency of full backups and retention. As a rule of thumb, ESG believes a 20:1 ratio—when combined with data compression—to be broadly achievable. Though ESG has seen data de-duplication ratios of 89:1 and there is potential for even greater reductions, do not feel disappointed if you do not achieve 20:1 or greater, since reductions of 5:1 or more are still extremely valuable. Technology Considerations There is a great deal of buzz in the market around data de-duplication and it is only going to get louder. As a result, there will be lots of confusion and convoluted messaging slung around regarding this topic. Keep in mind that in the high tech industry, we often use the same term to mean different things and different terms to mean the same thing. In the interest of clarity, ESG has provided a set of questions aimed at providing some guidance to end-users interested in evaluating and implementing D2D backup solutions that support data de-duplication. Technology Questions to Ask: Unique Data 1. What type of data de-duplication ratio can I expect? A number of the vendors that support de-duplication provide high-level numbers. Dig deeper. The actual amount of data reduction an organization can expect to see can vary significantly depending on the type of data being backed up, retention periods, the frequency of full backups and the data deduplication technology. Provide potential vendors with information about your environment, backup process, applications, retention SLAs and data types to better determine what to expect. 2. How will data de-duplication affect my backup and restore performance? Data de-duplication is a resource-intensive process. It needs to determine whether some new small sequence has been stored before, often across hundreds of prior terabytes of data. A simple index of this information is too big to fit in RAM, unless it is a very small deployment. So it needs to seek on Enterprise Strategy Group Page 5
ESG Report Data De-Duplication and D2D Backup Systems disk, and disk seeks are notoriously slow (and not getting better). The following questions allow you to dig deeper with regard to performance: What is the single stream backup and restore throughput? This is how fast a given file/DB can be backed up, restored or copied to tape for archiving. The numbers may be different—read speed and write speed may have separate issues. Because of backup windows for critical data, backup throughput is what most people ask about, though restore time is more significant for most SLAs. LTO4 tapes need to receive data at >60 MB/sec. or they will operate well below their rated speed for streaming, so restore stream speed matters significantly if tape will stay in your plans. What is the aggregate backup/restore throughput per system? With many streams, how fast can a given controller perform? This will help gauge the number of controllers/systems needed for your deployment. It is mostly a measure of system management (number of systems) and cost—single stream speed is more important for getting the job done. Is the 30 th backup different from the 1 st ? If you backup images and delete them over time, does the performance of the system change? Since de-duplication uses so many references around the store for new documents, do the recovery characteristics for a recent backup (what you’ll mostly be recovering) a month or two into deployment change vs. the first pilot? Talk to existing users of the vendor to find what others have seen. In many cases, performance in your deployment will depend on many factors, including the backup software and the systems and networks supporting it. Understand your current performance and bottlenecks before challenging the vendor of a particular component to fix it all. 3. Is the data de-duplication in-line or post process? As with any new technology, there is a lot of confusion in the industry about the differences between inline and post processing approaches, as well as abuse/misuse of the terms being used to differentiate the two. In ESG’s view, it comes down to one simple “yes or no” question: When the backup data is written to disk, is the data de-duplicated or not? If the answer is yes, then it has been de-duplicated inline. If at that point the answer is no, then the de-duplication is done post-process. What is the significance of one approach versus the other? The two areas that you need to research are the impact to performance for the in-line approach and capacity issues for the post-process approach. Understand the trade-offs for each approach based on the vendor’s specific solutions. Since in-line de-duplication is an intelligent process performed during the backup process, there can be some performance degradation during data ingest. However, performance impact depends on a number of variables, including the de-duplication technology itself, the size of the backup volume, the granularity of the de-duplication process, the aggregate throughput of the architecture and the scalability of the solution. Post-process data de-duplication does require more disk capacity to be allocated upfront. But the size of the “capacity reserve” needed depends on a number of variables, including the amount of backup data and how long the de-duplication technology “holds” onto the capacity before releasing it. Some post-process de-duplication technologies wait for the entire backup job to be completed, while others start de-duplication as backup data is stored. Solutions that wait for the backup process to complete before de-duplication of the data have a greater initial “capacity overhead” than solutions that start the de-duplication process earlier. These solutions have to allocate enough capacity to store the entire backup volume. The capacity is released when the backup job is complete and re-allocated before the next backup job begins. Beginning the de- Enterprise Strategy Group Page 6
Page 1 and 2: Data De-duplication and Disk-to-Dis
Page 3 and 4: Introduction ESG Report Data De-Dup
Page 5: Data De-Duplication Defined ESG Rep
Page 9 and 10: ESG Report Data De-Duplication and
Page 15: ESG’s View ESG Report Data De-Dup

Data De-duplication and Disk-to-Disk Backup Systems

Create successful ePaper yourself

Delete template?

Save as template?