posted on 28 Sep 2015 in data-structures
Approximate Data Structures maintain approximations about incoming streams of data, like:
The problem is that they’re all parameterized by the sizes of the internal bit-arrays, hashes, and whatever else; it’s good that we can tune the parameters, but without a deep understanding of how they work, it’s hard to tell what to use. From an interface perspective, it’d probably be better to insulate our users from the algorithmic details.
So what do we do?
The first obvious thing to do is to look at some block of entries, then make a decision based on the characteristics of those entries: if we assume that we get a constant amount of unique items per batch, , and we know our total number of items has some upper bound, , then we should be able to choose a size appropriate for about items. Z But this doesn’t work without a number upper bound (), which is another parameter our user would have to specify, so what we’d like to do is assert that one exists then behave accordingly. We’d like to batch our input with size , then join the resulting data structures in some way.
Here’s the leap
Most of these data structures are linear:
In the case of “About how many did we see?” (CountMinSketch), we can just sum the counts from multiple instances of the data structure.
In the case of “Could we have seen before?” (BloomFilter), we can just or the outputs of multiple instances of the data structure.
In the case of “About how many unique elements have we seen” (HyperLogLog), we can have each data structure keep a sufficiently-large Bloom Filter and exclude elements from the first batch from the input to the next batches.
As an immediate consequence of being able to combine data structures, we should be able to change . We can control it based on the way the current set of data looks: if we’re seeing a lot of new elements, we can increase the size parameter a lot.
The simple case of this kind of control is the dynamic dictionary, as used in python, but this can be done in much more sophisticated ways: proportional control on the log-(# elements) is the easiest adjustment (Note that most of our savings here come from the fact that the storage is much smaller than the number of operations, which is why operating in log-space can be more efficient)
The result of using this kind of scalably self-adjusting data structure is a manageable infrastructure, one that maybe not our users, but definitely our ops people, can actually use.
☐