The code for stratified sampling with MapReduce is on GitHub.
Why would I sample my data?
There are many reasons for which you might want to sample your data. For example, let’s say you have a big pile of data and you would like to train a machine learning algorithm on it. The thing is, training your model on all the data would take too long. Anyway, it’s not necessary to use all of it if you are still in the prototyping phase. What you want is to generate a smaller dataset that is still representative of the entire population. The resulting dataset would have the same distribution as the original dataset, except smaller. That way you can feed it to your model more easily.
Your data statistical distribution matters
A good way to sample your original dataset while preseving its statistical distribution is Stratified Sampling.
In this post, I compare two different sampling techniques: Random Sampling and Stratified Sampling. I hope to highlight how useful it can be to use Stratified Sampling instead of Random Sampling when dealing with large datasets with an imbalanced statistical distribution.
The content of this post is based on a presentation I gave at Hacker Dojo on this subject.
Random Sampling VS Stratified Sampling
Let’s assume we want to sample a population of
N elements with a sampling rate of
R. We’ll end up with
S = R * N samples.
- Simple Random Sampling: Randomly pick your
- Stratified Sampling: Pick your
Selements according to the statistical distribution of the population.
Random Sampling is fast and simple but it fails to sample infrequent elements. In our example, where the samples will be used as a training set to build a classifier, it is likely that the model will be poorly trained. It will overfit on the class that is the most represented and will fail to learn the least frequent class. To keep the original data distribution, we need to use Stratified Sampling.
So in practice, what’s the difference between Random and Stratified Sampling? Have a look at the following videos.
Stratified Sampling with MapReduce
In Stratified Sampling, it is advantageous to sample each subpopulation independently. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. A homogeneous subgroup of members in the (sample/full) population is called a stratum (plural: strata).
- Mutually exclusive. Every element in the population must be assigned to only one stratum.
- Collectively exhaustive. No population element can be excluded.
Stratified Sampling in 3 steps
- Create the strata
- A stratum is made of elements belonging to the same class
- Apply Simple Random sampling to each stratum
That’s it! We need a first MapReduce step to construct all starta. And then a second MapReduce step where we apply random sampling to each stratum. Note that the 2 steps can be combined for more efficiency.
Check out the Stratified Sampling code on github. Bonus: in there, you’ll find plenty of other Machine Learning algorithms that have been broken down to work on top of MapReduce - using the MRJob framework.