Data De-duplication and Disk-to-Disk Backup Systems
Data De-duplication and Disk-to-Disk Backup Systems
Data De-duplication and Disk-to-Disk Backup Systems
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ESG Report<br />
<strong>Data</strong> <strong>De</strong>-Duplication <strong>and</strong> D2D <strong>Backup</strong> <strong>Systems</strong><br />
disk, <strong>and</strong> disk seeks are no<strong>to</strong>riously slow (<strong>and</strong> not getting better). The following questions allow you <strong>to</strong><br />
dig deeper with regard <strong>to</strong> performance:<br />
What is the single stream backup <strong>and</strong> res<strong>to</strong>re throughput? This is how fast a given file/DB can be<br />
backed up, res<strong>to</strong>red or copied <strong>to</strong> tape for archiving. The numbers may be different—read speed <strong>and</strong><br />
write speed may have separate issues. Because of backup windows for critical data, backup<br />
throughput is what most people ask about, though res<strong>to</strong>re time is more significant for most SLAs.<br />
LTO4 tapes need <strong>to</strong> receive data at >60 MB/sec. or they will operate well below their rated speed for<br />
streaming, so res<strong>to</strong>re stream speed matters significantly if tape will stay in your plans.<br />
What is the aggregate backup/res<strong>to</strong>re throughput per system? With many streams, how fast can a<br />
given controller perform? This will help gauge the number of controllers/systems needed for your<br />
deployment. It is mostly a measure of system management (number of systems) <strong>and</strong> cost—single<br />
stream speed is more important for getting the job done.<br />
Is the 30 th backup different from the 1 st ? If you backup images <strong>and</strong> delete them over time, does the<br />
performance of the system change? Since de-<strong>duplication</strong> uses so many references around the s<strong>to</strong>re<br />
for new documents, do the recovery characteristics for a recent backup (what you’ll mostly be<br />
recovering) a month or two in<strong>to</strong> deployment change vs. the first pilot? Talk <strong>to</strong> existing users of the<br />
vendor <strong>to</strong> find what others have seen.<br />
In many cases, performance in your deployment will depend on many fac<strong>to</strong>rs, including the backup<br />
software <strong>and</strong> the systems <strong>and</strong> networks supporting it. Underst<strong>and</strong> your current performance <strong>and</strong><br />
bottlenecks before challenging the vendor of a particular component <strong>to</strong> fix it all.<br />
3. Is the data de-<strong>duplication</strong> in-line or post process?<br />
As with any new technology, there is a lot of confusion in the industry about the differences between inline<br />
<strong>and</strong> post processing approaches, as well as abuse/misuse of the terms being used <strong>to</strong> differentiate<br />
the two. In ESG’s view, it comes down <strong>to</strong> one simple “yes or no” question: When the backup data is<br />
written <strong>to</strong> disk, is the data de-duplicated or not? If the answer is yes, then it has been de-duplicated inline.<br />
If at that point the answer is no, then the de-<strong>duplication</strong> is done post-process.<br />
What is the significance of one approach versus the other? The two areas that you need <strong>to</strong> research<br />
are the impact <strong>to</strong> performance for the in-line approach <strong>and</strong> capacity issues for the post-process<br />
approach. Underst<strong>and</strong> the trade-offs for each approach based on the vendor’s specific solutions.<br />
Since in-line de-<strong>duplication</strong> is an intelligent process performed during the backup process, there can<br />
be some performance degradation during data ingest. However, performance impact depends on a<br />
number of variables, including the de-<strong>duplication</strong> technology itself, the size of the backup volume, the<br />
granularity of the de-<strong>duplication</strong> process, the aggregate throughput of the architecture <strong>and</strong> the<br />
scalability of the solution.<br />
Post-process data de-<strong>duplication</strong> does require more disk capacity <strong>to</strong> be allocated upfront. But the size<br />
of the “capacity reserve” needed depends on a number of variables, including the amount of backup<br />
data <strong>and</strong> how long the de-<strong>duplication</strong> technology “holds” on<strong>to</strong> the capacity before releasing it.<br />
Some post-process de-<strong>duplication</strong> technologies wait for the entire backup job <strong>to</strong> be completed, while<br />
others start de-<strong>duplication</strong> as backup data is s<strong>to</strong>red.<br />
Solutions that wait for the backup process <strong>to</strong> complete before de-<strong>duplication</strong> of the data have a greater<br />
initial “capacity overhead” than solutions that start the de-<strong>duplication</strong> process earlier. These solutions<br />
have <strong>to</strong> allocate enough capacity <strong>to</strong> s<strong>to</strong>re the entire backup volume. The capacity is released when the<br />
backup job is complete <strong>and</strong> re-allocated before the next backup job begins. Beginning the de-<br />
Enterprise Strategy Group Page 6