Author:       2014-05-27

The explosive growth of digital data has raised new challenges for existing storage systems. Data deduplication is an effective technology to remove redundant information and optimize storage efficiency.

To achieve high-performance in-line deduplication, a space-efficient fast index that can be hosted in-memory is critical important for quickly identifying possible duplicate data elements. Fixed-capacity indexing schemes, such as Bloom Filters (BFs), face two challenges while representing variable large datasets. First, the fast index is in risk of total reconstruction if the elements exceed its capacity, which can consume significant time overhead. Second, memory space is wasted if it takes a long time for the elements to fill the fast index.

The research group of Prof. Ke Zhou in Information Storage and Optical Display Division of Wuhan National Laboratory for Optoelectronics proposes Dynamic Bloom filter Array (DBA) to index variable large datasets and efficiently identify possible duplicate elements.

DBA consists of groups of space-efficient BFs, and its capacity can be extended by dynamically adding new BF groups. Within a group, homogeneous BFs are used and their structure is optimized at the bit level, so that dozens of BFs can be accessed in parallel to achieve high query performance. Using the compact BF structure introduces false positives, but the query accuracy can be effectively controlled by partially adjusting the error rate of the building-block BFs. Each BF is only responsible for representing an independent subset, which helps locate existing elements and confirm membership. Further, DBA supports element deletion by introducing a lazy update policy.

Experimental results on fingerprint sequences of backup datasets show that the multi-group-cardinality DBA approach can maintain a much higher query performance than the existing dynamic Bloom filter (DBF) approach. For example, while scaling up to 160 BFs with an error rate threshold of 1/214, a 64-group-cardinality DBA is able to query 4.5´105 items per second, a query performance that is about three times higher than DBF. DBA is also shown to excel in scalability, query accuracy, and space efficiency by theoretical analysis and experimental evaluation.

The research paper entitled “Efficiently Representing Membership for Variable Large Data Sets” has been published in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 25, no. 4, pp. 960-970, April 2014.

This research is partially supported by the National Natural Science Foundation of China under Grant No. 61232004, and the National Basic Research Program (973 Program) of China under Grant No. 2011CB302305.