Reservoir Sampling

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory.

Algorithm to randomly select k samples from n elements:

  1. Take an array reservoir[] of length k and initialize it with the first k elements from the given array input[] of size n.
  2. for index i := k to (n - 1) repeat:
    • Generate a random number. Let's say the random number is m.
      If m > k : do nothing
      else: swap reservoir[m] with input[i].
  3. return reservoir array.

Login to Access Content

We could also implement the algorithm recursively as shown below.

Suppose we have an algorithm that can pull a random set of k elements from an array of size (n - 1). How can you use this algorithm to pull a random set of k elements from an array of size n ?

We can first pull a random set of size k, say reservoir[0...(k - 1)], from the first (n - 1) elements. Then, we just need to decide if array[n] should be inserted into our subset (which would require pulling out a random element from it, to keep the size k). An easy way to do this is to pick a random number m from 0 through n. If m < k, then replace reservoir[m] with array[n].

Login to Access Content

Related Chapter:


If you have any feedback, please use this form:

Help Your Friends save 25% on our products