#### Cheating the Monte-Carlo speed-bump.

posted on 29 Sep 2015 in theory

The idea of Monte-Carlo is really simple: generate a lot of random inputs for your algorithm, and collect them together to get numbers that describe its performance, like average runtime. For most people, this means somewhere they’ll insert a line like:

for x in genRandoms(500):


But if the inside of that loop is slow, we don’t want to be overestimating the number of times we run it. How do we know that we need exactly 500 random numbers? Which random numbers should we be choosing?

Should we be choosing random numbers at all?

It’s not a crazy question. Ordinary sampling will take a long time to converge to an approximate empirical distribution: if we think of $N$ histogram bins as Poisson processes, then we need on the order of $10^4 N$ samples to get 1% standard deviation.

So for uniform distributions, maybe we can just use range a lot of the time. Laying down tracks like that is called Quasi-Monte-Carlo, and is – so I’m told – used a lot in finance.

But let’s see if we can’t come up with a proper theory, without the heuristic of “yeah, just spread them out evenly”, for a general distribution. The problem is: which numbers do you give the function we want to randomize, so that the result of the composite procedure looks like that of the randomized one.

Well really what having more samples gets us is that we get more information at the output. This is exactly how much different the actual distribution is from what we would predict:

$$\phi \equiv p(x) - interpolate(x)$$

The logical nuance is that the interpolation is arbitrary: if our function is about linear at some input resolution $\lambda$, we should be linearly interpolating at resolution $\lambda$. The same thing goes if we know that the distribution looks like a mixture of Gaussians: we can use a gaussian kernel for the same purpose.

$\phi$ is really just our error from the distribution itself, up to some interpolation function.

Once we have a sample-list with sufficiently low $\phi$, we can use it in place of our distribution, and it will compose the same way ordinary distributions compose: we can add them by concatenating the lists; we can multiply them by taking the cross-product; and so on.

Note that I haven’t even touched sample weighting: in reality, we would weight each sample by its probability, which prevents us having to deal with the fact that without that, probabilities decrease as we add samples.

Doing things this way, especially in the case of easily-analyzable distributions, can multiply the performance of a normal Monte-Carlo procedure many times, keeping the easy parallelizability, without introducing the complexity of something like the Method of Characteristics.

A really interesting recent idea with a similar motivation is Machine Teaching, which does an analogous optimization in the case of ML training data, as opposed to the input to some function to be Monte-Carloed.

#### Making Approximate Data-structures Hierarchical.

posted on 28 Sep 2015 in data-structures

Approximate Data Structures maintain approximations about incoming streams of data, like:

• Could we have we seen $X$ before? (Bloom filters)
• About how many unique things have we seen? (HyperLogLog)
• About how many of $X$ did we see? (CountMinSketch)

The problem is that they’re all parameterized by the sizes of the internal bit-arrays, hashes, and whatever else; it’s good that we can tune the parameters, but without a deep understanding of how they work, it’s hard to tell what to use. From an interface perspective, it’d probably be better to insulate our users from the algorithmic details.

So what do we do?

The first obvious thing to do is to look at some block of entries, then make a decision based on the characteristics of those entries: if we assume that we get a constant amount of unique items per batch, $n$, and we know our total number of items has some upper bound, $T$, then we should be able to choose a size appropriate for about $T/n$ items.

But this doesn’t work without a number upper bound ($T$), which is another parameter our user would have to specify, so what we’d like to do is assert that one exists then behave accordingly. We’d like to batch our input with size $T$, then join the resulting data structures in some way.

Here’s the leap

Most of these data structures are linear:

• In the case of “About how many $X$ did we see?” (CountMinSketch), we can just sum the counts from multiple instances of the data structure.

• In the case of “Could we have seen $X$ before?” (BloomFilter), we can just or the outputs of multiple instances of the data structure.

• In the case of “About how many unique elements have we seen” (HyperLogLog), we can have each data structure keep a sufficiently-large Bloom Filter and exclude elements from the first batch from the input to the next batches.

As an immediate consequence of being able to combine data structures, we should be able to change $T$. We can control it based on the way the current set of data looks: if we’re seeing a lot of new elements, we can increase the size parameter a lot.

The simple case of this kind of control is the dynamic dictionary, as used in python, but this can be done in much more sophisticated ways: proportional control on the log-(# elements) is the easiest adjustment (Note that most of our savings here come from the fact that the storage is much smaller than the number of operations, which is why operating in log-space can be more efficient)

The result of using this kind of scalably self-adjusting data structure is a manageable infrastructure, one that maybe not our users, but definitely our ops people, can actually use.