Rough way to determine number of buckets/classes to split a dataset into, for frequency distribution analysis / histograms:
2ⁿ⁻¹ < α < 2ⁿ
where α is the size of the dataset.
In other words, find a power of 2 that yields a number just larger than the size of your dataset. Decrementing this power would yield a number smaller.
So:
- 2ⁿ ≈ α
- n ln2 ≈ lnα
- n ≈ lnα / ln2
Example, for a dataset of 345,000 data points:
- n ≈ ln345000 / ln2
- n ≈ 18.396236836327623
The number of buckets should therefore be 19. 2^19 = 524288: larger than the dataset. But 2^18 = 262144 is smaller.
- Wish I could remember where I found this.
- It's a good rule of thumb for ballparking, but I've always ended up adding one or two buckets to it for a tighter fit.
- Somewhere there's a more 'correct' algorithm for buckets of uneven widths
- Maybe based on clustering/density of points within each bucket?
- e.g. keep each bar's 'area' the same, so bars in sparse parts of the dataset don't get very tall