Sparse Indexing: Large Scale, Inline Deduplication Using
                        Sampling and Locality

    We present _sparse indexing_, a technique that uses sampling and
exploits the inherent locality within backup streams to solve for
large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk
bottleneck problem that inline, chunk-based deduplication schemes face.
The problem is that these schemes traditionally require a full chunk
index, which indexes every chunk, in order to determine which chunks
have already been stored; unfortunately, at scale it is impractical to
keep such an index in RAM and a disk-based index with one seek per
incoming chunk is far too slow.

    We perform stream deduplication by breaking up an incoming stream
into relatively large segments and deduplicating each segment against
only a few of the most similar previous segments.  To identify similar
segments, we use sampling and a sparse index.  We choose a small portion
of the chunks in the stream as samples; our sparse index maps these
samples to the existing segments in which they occur.  Thus, we avoid
the need for a full chunk index.  Since only the sampled chunks' hashes
are kept in RAM and the sampling rate is low, we dramatically reduce the
RAM to disk ratio for effective deduplication.  At the same time, only a
few seeks are required per segment so the chunk-lookup disk bottleneck
is avoided.  Sparse indexing has recently been incorporated into number
of Hewlett-Packard backup products.