use "MathJax Plugin for Github" Chrome extension for Equation support
author: Evan Haas
2009.12.09
Source
In this three part series I’m going to talk about statistics in the context of A/B Testing. Part I discusses how to analyze experiments using traditional techniques from the frequentist school. Part II will discuss the Bayesian approach, and Part III will provide an implementation of the Bayesian method. Much of the information is adapted from the excellent Information Theory, Inference, and Learning Algorithms by David MacKay, chapter 37.
For simplicity, I’m going to talk about the simplest case, which is when only one test is run at a time, the two alternatives are assumed not to interact with each other, and outcome is binary – mathematicians call this a Bernoulli process. Working with these assumptions, we can model each of the alternatives as a binomial distribution with an unknown rate of success. Alternative A has
To review, the traditional way to answer this question would be to assume a Gaussian distribution, since it is a good approximation of the binomial distribution with a large enough sample, and use the null hypothesis:
Finally, using this value we look in our handy Standard normal table and find the p-value, to determine the probability that we got this result by chance, assuming the truth of the null hypothesis. For example, let’s say we look up the value in the table and see that our p-value is 0.04. This means that if we were to run our experiment a large number of times, and the null hypothesis is true, about 4% of those times we would see results at least as extreme as the ones we recorded. Since a p-value of 0.05 is typically accepted as the threshold for statistical significance, we would conclude that the null hypothesis can be rejected – there is a statistically significant difference between
Overall it’s a pretty simple procedure, so what are the downsides? The first is that approximating a binomial distribution with a normal distribution is not exact. There are a number of rules that can be used to determine when the approximation is valid – see Wikipedia for a discussion of these rules. A second problem is that this simple test doesn’t tell us anything about the magnitude of the difference between
To solve these problems we will turn to Bayesian methods in part II.
author: Evan Haas
2010.01.28
Source
In part one we learned how to determine whether one alternative is better than another using classical statistical methods. While these methods are easy to perform, they unfortunately don’t answer the questions that we intuitively want to see answered – “What is the probability that A is better than B?” or “How much better is A than B?”. Remember that the z-test from part one only tells us how confident we can be in rejecting the null hypothesis that the two alternatives are equal.
Enter Bayes Theorem. Using this magical piece of mathematics, we can actually give an exact answer to the question, “What is the probability, given these results, that A has a higher conversion frequency than B?” In the following equations, for convenience, I will refer to the probability of conversion
The value we are interested in is the posterior probability distribution
Ok, so how do we actually use this? Let’s start by writing down the probability distributions that we know.
The easiest one to write is the joint prior distribution for
Next we have the probability of the observed data –
The final component is called the likelihood:
Putting it all together, we have:
Now that we have the joint posterior distribution of
So now we’ve got a very difficult-looking integral instead of a straightforward computation like in part 1. Where do we go from here? Normal people would throw this equation into a numeric solver and get a very close approximation to the answer. However, since I can’t afford Mathematica, we’re going to have to solve this one exactly. Stay tuned for part 3 to see how.
author: Antti Rasinen
2011.12.29
Source
At my last employer I looked briefly into A/B tests. Many of the proposed solutions rely on classical statistical tests, such as the z-test or G-test. These tests are not very appealing to me. They solve an approximate problem exactly; I prefer the converse.
To my delight, I found a three-part series by Sir Evan Haas, which discussed using Bayesian inference on A/B tests. Sadly, only parts I and part II are available. The blog seems to have died out before part III ever came out. Part II ended with a cliffhanger: a hairy integral and a promise "solved in Part III".
For the last year and a half I've thought about the problem on and off. WolframAlpha could not solve the integral in a meaningful fashion. I plotted the joint probability distribution with R, but this left me wanting more.
A month ago I bumped into Bayesian networks (although I prefer "probabilistic graphical models"). This triggered a problem solving cascade in my mind. I finally managed to crack the problem! Both approximately and exactly! Oh happy times! Oh groovy times!
In his articles, Evan considers the whole joint probability distribution at once. That approach leads quickly into a mathematical briar patch with eye-poking indices and double integrals. However, the thorny mess becomes simpler with some theory.
So let us look at a simpler case–estimating the conversion rate for only one alternative. More precisely, what do we know about the conversion rate, if we've seen s conversions out of n trials? How is the probability distributed?
This is incidentally the oldest trick in the book. I mean this literally. The essay, in which Thomas Bayes debuted his formula, was dedicated to finding out the answer to this very question. His "An Essay towards solving a Problem in the Doctrine of Chances" is available on the Internet.
I will use different symbols than Evan does. Sorry about that. The unknown conversion rate is denoted by
Let's use the same binomial model as Evan for our likelihood. That is, we assume the number of successes s to be distributed according to the binomial distribution:
At this point I found the missing ingredient: conjugate priors. A conjugate prior means that our posterior "looks the same" as the prior itself. Both the posterior and the prior belong to a "same family." Their difference is usually in the parameters of the distribution. This allows algebraically nice updating rules for the posterior. In this case, the rules are very very pleasant.
The conjugate prior for the binomial distribution is the beta distribution
The updating rule for the posterior is now exceedingly simple:
In general, if our prior is distributed according to Be(α, β), then the posterior is
This is a wonderful result. We can mostly omit the Bayesian mechanism that led us here. Instead, we can simply use the beta distribution! (Turns out that people in medicine and other fields have known this result for ever.)
Let's assume we've seen 5 conversions out of 100 trials. This means that
http://www.wolframalpha.com/input/?i=BetaDistribution%5B6%2C96%5D
If we have more data, our estimate improves. For 50 conversions and 1000 trials, we get the following PDF:
https://www.wolframalpha.com/input/?i=BetaDistribution%5B51%2C951%5D
The probability mass is now very sharply centered around 0.05.
Since we now have a handle what happens with a single variant, let's crack the full problem. Given the successes
Originally I solved the question by integrating the joint distribution numerically over a 100 x 100 grid. If This gives rather good results. Sadly, I can not find the code I used. After discovering that the distribution is a joint distribution of two independent beta distributions, I realized I can also use simulation. This is conceptually super simple.
First, generate
Count how many of the pairs have the first number greater than the second. Let us call this number
Then you can get the estimate for the probability
Below is a simple implementation in Python:
from __future__ import division
from random import betavariate
import math
import sys
def sample(s1, f1, s2, f2):
p1 = betavariate(s1 + 1, f1 + 1)
p2 = betavariate(s2 + 1, f2 + 1)
return 1 if p1 > p2 else 0
def odds(p):
return p / (1 - p)
def main():
if len(sys.argv) != 6:
sys.exit("Usage: %s n_iter succ1 fail2 succ2 fail2" % sys.argv[0])
total, s1, f1, s2, f2 = map(int, sys.argv[1:])
count = sum(sample(s1, f1, s2, f2) for k in range(total))
o = odds(count / total)
b = 10 * math.log(o, 10)
print odds(count / total), "to 1 or", b, "dB"
if __name__ == "__main__":
main()
First of all, don't thank me. Thank John D. Cook. His article about random inequalities contained the hint that when one of the parameters is an integer, we can compute a closed solution. One? All of our parameters are integers!
My solution follows the mathematical form given in the linked PDF article "Exact Calculation of Beta Inequalities John Cook" by John Cook. The code is below.
from __future__ import division
import math
import sys
def gamma(n):
return math.factorial(n - 1)
def h(a, b, c, d):
num = gamma(a + c) * gamma(b + d) * gamma(a + b) * gamma(c + d)
den = gamma(a) * gamma(b) * gamma(c) * gamma(d) * gamma(a + b + c + d)
return num / den
def g0(a, b, c):
return gamma(a + b) * gamma(a + c) / (gamma(a + b + c) * gamma(a))
def hiter(a, b, c, d):
while d > 1:
d -= 1
yield h(a, b, c, d) / d
def g(a, b, c, d):
return g0(a, b, c) + sum(hiter(a, b, c, d))
def print_odds(p):
o = p / (1 - p)
b = 10 * math.log(o, 10)
if o > 1:
s = "%.4f to 1" % o
else:
s = "1 to %.4f" % (1 / o)
print s, "or %.4f dB" % b
def main():
if len(sys.argv) != 5:
sys.exit("Usage: %s succ1 fail1 succ2 fail2" % sys.argv[0])
s1, f1, s2, f2 = map(int, sys.argv[1:])
print_odds(g(s1 + 1, f1 + 1, s2 + 1, f2 + 1))
if __name__ == "__main__":
main()
The code is a rather straightforward translation. I'd rather have used recursion, but the Python call stack blows up when
Also note that my code is numerically insane (and slow). If you wanted to do this for a living, you'd better use the lgamma function to compute the logarithm of the gamma function. Gamma functions are in effect factorials. They grow very very fast; a float or even a double will overflow quickly. For example, the g0 function would be
exp(lgamma(a+b) + lgamma(a+c) - lgamma(a+b+c) - lgamma(a))
You'd also want to use the permutation tricks mentioned in the article to recurse on the smallest parameter.
- The distribution of the conversion rate for one variant is the beta distribution. (This applies when our model is the binomial model. More complex approaches are of course possible.)
- You can solve
$P(A \gt B)$ it numerically (in two ways) or exactly. - These are all known results.
Hi,
I've tried to implement your code (D.Cook to be honest) and got the following output -
1 to 917105171.3252 or -89.6242 dB
Im trying to find the probability of AB test but those numbers are telling me nothing tbh
Any suggestions ?
Thanks