Extreme Binning: Scalable, Parallel Deduplication for File Backup

    Data deduplication is an essential and critical component of backup
systems.  Essential, because it reduces storage space requirements, and
critical, because the performance of the entire backup operation depends
on its throughput.  Traditional backup workloads consist of large data
streams with high locality, which existing deduplication techniques
require to provide reasonable throughput.

    We present Extreme Binning, a scalable deduplication technique for
non-traditional backup workloads that are made up of individual files
with no locality among consecutive files in a given window of time.  Due
to lack of locality, existing techniques perform poorly on these
workloads.  Extreme Binning exploits _file similarity_ instead of
locality, and makes only one disk access for chunk lookup _per file_,
which gives reasonable throughput.  Multi-node backup systems built with
Extreme Binning scale gracefully with the amount of input data; more
backup nodes can be added to boost throughput.  Each file is allocated
using a stateless routing algorithm to only one node, allowing for
maximum parallelization, and each backup node is autonomous with no
dependency across nodes, making data management tasks robust with low
overhead.