Bootstrapping — What is the goal?
What is Bootstrapping?
To understand Bootstrapping, let us start with a simple problem:
We are given a bunch of prices of houses, and we want to know the median price of a house.
It is easy to compute the median directly, but how can we compute the error bars?
If it was the mean, we could make some assumptions and apply standard statistical techniques and get the correct result.
However, no similar technique exists for the median.
In general, if there is no explicit formula for the distribution of errors and there is not any simple way to try to understand accuracy of measure values.
However, if we had infinite data, it’d be easy to solve this problem-
Measure the quantity in many independent datasets of the same fixed size
Use the empirical distribution to provide the distribution.
The problem here is that we never will have infinite data!
We might get your 1000 data points once, and then need to work from that.
The question becomes:
“How can we expand a single fixed dataset to treat it like 1000 independent ones?”
There is a solution!!
Sampling with Replacement
What happens if we treat our data as the true distribution, and draw synthetic data datasets from this?
To create synthetic datasets, we sample with replacement from our dataset:
Given dataset:
[1, 2, 4, 5, 7, 9,10]
Median: 5
Potential samples with associated medians:
[ 1, 1, 2, 4, 9,10,10], median: 4
[ 2, 4, 5, 5, 7, 7, 7], median: 5
[ 1, 1, 1, 1, 1, 1, 1], median: 1
[ 1, 2, 4, 5, 7, 9,10], median: 5
… so on
The distribution of these medians gives us a guess at the true distribution of the medians over a data set of this size.
True-
Bootstrapped-