26.07.2013 Views

Data De-duplication and Disk-to-Disk Backup Systems

Data De-duplication and Disk-to-Disk Backup Systems

Data De-duplication and Disk-to-Disk Backup Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ESG Report<br />

<strong>Data</strong> <strong>De</strong>-Duplication <strong>and</strong> D2D <strong>Backup</strong> <strong>Systems</strong><br />

disk, <strong>and</strong> disk seeks are no<strong>to</strong>riously slow (<strong>and</strong> not getting better). The following questions allow you <strong>to</strong><br />

dig deeper with regard <strong>to</strong> performance:<br />

What is the single stream backup <strong>and</strong> res<strong>to</strong>re throughput? This is how fast a given file/DB can be<br />

backed up, res<strong>to</strong>red or copied <strong>to</strong> tape for archiving. The numbers may be different—read speed <strong>and</strong><br />

write speed may have separate issues. Because of backup windows for critical data, backup<br />

throughput is what most people ask about, though res<strong>to</strong>re time is more significant for most SLAs.<br />

LTO4 tapes need <strong>to</strong> receive data at >60 MB/sec. or they will operate well below their rated speed for<br />

streaming, so res<strong>to</strong>re stream speed matters significantly if tape will stay in your plans.<br />

What is the aggregate backup/res<strong>to</strong>re throughput per system? With many streams, how fast can a<br />

given controller perform? This will help gauge the number of controllers/systems needed for your<br />

deployment. It is mostly a measure of system management (number of systems) <strong>and</strong> cost—single<br />

stream speed is more important for getting the job done.<br />

Is the 30 th backup different from the 1 st ? If you backup images <strong>and</strong> delete them over time, does the<br />

performance of the system change? Since de-<strong>duplication</strong> uses so many references around the s<strong>to</strong>re<br />

for new documents, do the recovery characteristics for a recent backup (what you’ll mostly be<br />

recovering) a month or two in<strong>to</strong> deployment change vs. the first pilot? Talk <strong>to</strong> existing users of the<br />

vendor <strong>to</strong> find what others have seen.<br />

In many cases, performance in your deployment will depend on many fac<strong>to</strong>rs, including the backup<br />

software <strong>and</strong> the systems <strong>and</strong> networks supporting it. Underst<strong>and</strong> your current performance <strong>and</strong><br />

bottlenecks before challenging the vendor of a particular component <strong>to</strong> fix it all.<br />

3. Is the data de-<strong>duplication</strong> in-line or post process?<br />

As with any new technology, there is a lot of confusion in the industry about the differences between inline<br />

<strong>and</strong> post processing approaches, as well as abuse/misuse of the terms being used <strong>to</strong> differentiate<br />

the two. In ESG’s view, it comes down <strong>to</strong> one simple “yes or no” question: When the backup data is<br />

written <strong>to</strong> disk, is the data de-duplicated or not? If the answer is yes, then it has been de-duplicated inline.<br />

If at that point the answer is no, then the de-<strong>duplication</strong> is done post-process.<br />

What is the significance of one approach versus the other? The two areas that you need <strong>to</strong> research<br />

are the impact <strong>to</strong> performance for the in-line approach <strong>and</strong> capacity issues for the post-process<br />

approach. Underst<strong>and</strong> the trade-offs for each approach based on the vendor’s specific solutions.<br />

Since in-line de-<strong>duplication</strong> is an intelligent process performed during the backup process, there can<br />

be some performance degradation during data ingest. However, performance impact depends on a<br />

number of variables, including the de-<strong>duplication</strong> technology itself, the size of the backup volume, the<br />

granularity of the de-<strong>duplication</strong> process, the aggregate throughput of the architecture <strong>and</strong> the<br />

scalability of the solution.<br />

Post-process data de-<strong>duplication</strong> does require more disk capacity <strong>to</strong> be allocated upfront. But the size<br />

of the “capacity reserve” needed depends on a number of variables, including the amount of backup<br />

data <strong>and</strong> how long the de-<strong>duplication</strong> technology “holds” on<strong>to</strong> the capacity before releasing it.<br />

Some post-process de-<strong>duplication</strong> technologies wait for the entire backup job <strong>to</strong> be completed, while<br />

others start de-<strong>duplication</strong> as backup data is s<strong>to</strong>red.<br />

Solutions that wait for the backup process <strong>to</strong> complete before de-<strong>duplication</strong> of the data have a greater<br />

initial “capacity overhead” than solutions that start the de-<strong>duplication</strong> process earlier. These solutions<br />

have <strong>to</strong> allocate enough capacity <strong>to</strong> s<strong>to</strong>re the entire backup volume. The capacity is released when the<br />

backup job is complete <strong>and</strong> re-allocated before the next backup job begins. Beginning the de-<br />

Enterprise Strategy Group Page 6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!