posted on 29 Sep 2015 in theory

The idea of Monte-Carlo is really simple: generate a lot of random inputs for your algorithm, and collect them together to get numbers that describe its performance, like average runtime. For most people, this means somewhere they’ll insert a line like:

for x in genRandoms(500):


But if the inside of that loop is slow, we don’t want to be overestimating the number of times we run it. How do we know that we need exactly 500 random numbers? Which random numbers should we be choosing?

Should we be choosing random numbers at all?

It’s not a crazy question. Ordinary sampling will take a long time to converge to an approximate empirical distribution: if we think of $N$ histogram bins as Poisson processes, then we need on the order of $10^4 N$ samples to get 1% standard deviation.

So for uniform distributions, maybe we can just use range a lot of the time. Laying down tracks like that is called Quasi-Monte-Carlo, and is – so I’m told – used a lot in finance.

But let’s see if we can’t come up with a proper theory, without the heuristic of “yeah, just spread them out evenly”, for a general distribution. The problem is: which numbers do you give the function we want to randomize, so that the result of the composite procedure looks like that of the randomized one.

Well really what having more samples gets us is that we get more information at the output. This is exactly how much different the actual distribution is from what we would predict:

$$\phi \equiv p(x) - interpolate(x)$$

The logical nuance is that the interpolation is arbitrary: if our function is about linear at some input resolution $\lambda$, we should be linearly interpolating at resolution $\lambda$. The same thing goes if we know that the distribution looks like a mixture of Gaussians: we can use a gaussian kernel for the same purpose.

$\phi$ is really just our error from the distribution itself, up to some interpolation function.

Once we have a sample-list with sufficiently low $\phi$, we can use it in place of our distribution, and it will compose the same way ordinary distributions compose: we can add them by concatenating the lists; we can multiply them by taking the cross-product; and so on.

Note that I haven’t even touched sample weighting: in reality, we would weight each sample by its probability, which prevents us having to deal with the fact that without that, probabilities decrease as we add samples.

Doing things this way, especially in the case of easily-analyzable distributions, can multiply the performance of a normal Monte-Carlo procedure many times, keeping the easy parallelizability, without introducing the complexity of something like the Method of Characteristics.

A really interesting recent idea with a similar motivation is Machine Teaching, which does an analogous optimization in the case of ML training data, as opposed to the input to some function to be Monte-Carloed.