orls/gist:870165

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    How to figure out numbers of buckets

Rough way to determine number of buckets/classes to split a dataset into, for frequency distribution analysis / histograms:
2ⁿ⁻¹ < α < 2ⁿ
where α is the size of the dataset.
In other words, find a power of 2 that yields a number just larger than the size of your dataset. Decrementing this power would yield a number smaller.
So:

2ⁿ ≈ α
n ln2 ≈ lnα
n ≈ lnα / ln2

Example, for a dataset of 345,000 data points:

n ≈ ln345000 / ln2
n ≈ 18.396236836327623

The number of buckets should therefore be 19. 2^19 = 524288: larger than the dataset. But 2^18 = 262144 is smaller.
Notes


Wish I could remember where I found this.
It's a good rule of thumb for ballparking, but I've always ended up adding one or two buckets to it for a tighter fit.
Somewhere there's a more 'correct' algorithm for buckets of uneven widths

Maybe based on clustering/density of points within each bucket?
e.g. keep each bar's 'area' the same, so bars in sparse parts of the dataset don't get very tall