Have you ever wondered how to sample (pseudo)random numbers on a computer from a nonuniform distribution? I have. I’ll try to explain how its done, and illustrate the method using Python.

If you are familiar with the excellent NumPy package, you’ve most likely also discovered the numpy.random module, which offers a broad variety of functions for obtaining random numbers from a large number of distributions. But what if your favourite distribution is not in that list? And what if you only know your distribution at discrete points between which you have to interpolate?

Well, one thing at a time.

## The inverse transform sampling method

A method to sample random numbers from any distribution using uniformly distributed random numbers

is called inverse transform sampling, and it relies on knowing the cumulative dirstibution function

of our distribution, which we may call . The inverse transform method works by sampling a random number and then finding the value for which . Sounds easy enough, so let’s look at an example:

Let’s say our distribution is

We can analytically calculate the cumulative distribution function to be

.

Here’s a plot of our distribution and its cumulative distribution for different values of :

Now, if we have a random number , and we want to find the for which

we have to solve above expression for , which yields

,

the *inverse of the cumulative distribution function. *In order to sample random numbers from we have to generate some uniformly distributed random numbers and just plug them into above equation. The resulting numbers will be distributed according to . We can easily convince ourselves of this using the following Python script:

import numpy as np import matplotlib.pyplot as plt def pdf(x, l): """ Exponential probability distribution function @param x x-value @param l lambda """ return l*np.exp(-l*x) def inv_cpdf(x, l): """ inverse of the cumulative distribution function of the exponential distibution @param x x-value @param l lambda """ return -np.log(1-x)/l # let's use lambda=1 for now l=1. # generate 10000 random numbers in [0,1) r=np.random.random(10000) # put r into inv_cpdf r_exp=inv_cpdf(r, l) # x axis for plot x=np.linspace(0,5,20) # create figure fig=plt.figure() ax=fig.add_subplot(111) ax.plot(x, pdf(x, l), label=r"$f(x)=\lambda\cdot\exp(-\lambda x)$") ax.hist(r_exp, bins=200, range=(x[0], x[-1]), histtype='step', normed=True, label="Histogram") ax.set_xlabel(r"$x$") ax.set_ylabel("PDF") ax.grid(True) ax.legend() fig.savefig("hist_exponential.png") fig.show()

In Line 4 to 18 the distribution and the inverse of its cumulative distribution are defined. In line 23 we generate 10000 random numbers which we plug into our inverse cumulative distribution in line 25. The rest of the code is basically just for plotting: we make a normalized histogram of our transformed random numbers and plot the original distribution for comparison:

Nice! The histogram traces out the original distribution, which means that our transformed random numbers are actually distributed according to !

## Inverting means flipping your axes

But what do we do if we cannot analytically calculate the cumulative distribution function? Or if we cannot analytically solve ? Well, we can do so numerically. In the world of Python there’s the SciPy package which offers a function for evaluating cumulative integrals, and the equation can be rearranged to

and then solved by e.g. Newton’s method, of which there’s also an implementation available in SciPy.

But there’s an easier way, especially if you only know your distribution at a few points between which you have to interpolate.

What we did when we solved

was *inverting* . That’s why the method outlined above is called inversion transform sampling. The inverse of a function (if it has one) is usually denoted by (not to be confused with !), and its graph looks just like the one of , just with and axes swapped, or mirrored at the line defined by , here here illustrated using the cumulative distribution function from above:

Now, as we’re using a computer, we can easily flip the axes of a graph, and as such also easily generate an approximation of the inverse of by evaluating it at the points in the region of interest, and calculating an interpolation (e.g. a linear one), where we use as the -values and as the -values.

So let’s test this method. Here’s the python code:

import numpy as np from scipy.integrate import cumtrapz from scipy.interpolate import interp1d import matplotlib.pyplot as plt x=np.linspace(-3,3,50) # x axis y=np.abs((1-.1*np.sin(2.*np.pi*x))*(np.exp(-1.*np.abs(x)))) # some funky function y=y/np.trapz(y,x) # normalize cy=cumtrapz(y,x,initial=0.) # cumulative distribution function # plots fig1=plt.figure() ax1=fig1.add_subplot(111) ax1.set_title("Propability distribution") ax1.plot(x,y, label="Propability distribution") ax1.plot(x,cy, label="Cumulative PDF") ax1.legend() ax1.set_xlabel(r"$x$") ax1.set_ylabel(r"$y$") ax1.grid(True) fig1.savefig("funky_distribution_function.png") fig1.show() # swap x and y axis and interpolate inv_cpdf=interp1d(cy, x) # generate 100000 random numbers r=np.random.random(100000) # map random numbers onto our funky distribution r_dist=inv_cpdf(r) #make histogram fig2=plt.figure() ax2=fig2.add_subplot(111) ax2.plot(x, y, label="Propability distribution") ax2.hist(r_dist, bins=200, range=(x[0], x[-1]), normed=True, label="Histogram") ax2.set_xlabel(r"$x$") ax2.set_ylabel("PDF") ax2.legend() ax2.grid(True) fig2.savefig("hist_funky.png") fig2.show()

In line 7 we define our propability distribution as some funny function which looks like this (blue curve):

Then we calculate the cumulative distribution (orange curve) and create above plot. Line 25 is where the magic takes place: We calculate an interpolation using scipy.interpolate.interp1d with the cumulative distribution as the interpolation’s -values and the -axis on which we defined the distribution as the interpolaion’s -axis. Then we generate a few random numbers, and put them into our interpolated inverse of the cumulative distribution, just as we did above, only there we had an analytic expression of . The rest is just plotting again, generating a histogram to illustrate that the method actually works:

Cool, isn’t it? Of course, instead of generating our points using some funny function we could have used data points from somewhere else as well.

I hope you enjyed reading this.

Have a nice day!

*During the writing of this post I found this and this article, two of propably many on the same subject, so I’m not under the illusion that the methods discussed here are in any way original ideas of mine. However, as I’ve heard of these methods somewhere else before stumbling across these articles I’ve not cited them as sources here.*