Probability is hard
For more than a month, my colleague Sanjoy Mahajan and I have been banging our heads on a series of problems related to conditional probability and Bayesian statistics. We knew when we started that this material is tricky, as demonstrated by veridical paradoxes like the Monty Hall problem, the Girl Named Florida, and so on. But even though we were prepared, we have been surprised, continually, by how long it is taking and how effectively we have confused ourselves and each other.
A few times we have hit a brick wall on a hard problem and made a strategic retreat by working on a simpler problem. At this point, we have retreated all the way to what I'll call the Red Dice problem, which goes like this:
Suppose I have a six-sided die that is red on 2 sides and blue on 4 sides, and another die that's the other way around, red on 4 sides and blue on 2.
I choose a die at random and roll it, and I tell you it came up red. What is the probability that I rolled the second die (red on 4 sides)? And if I do it again, what's the probability that I get red again?
There are several variations on this problem, with answers that are subtly different. I explain the variations, and my solution, in a Jupyter notebook:
You can read a static version of the notebook here.
OR
You can run the notebook on Binder.
If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog. Click on red_dice.ipynb to load the notebook for this article.
Once we have settled the Red Dice problem, we will get back to the original problem, which relates to interpreting medical tests (a classic example of Bayesian inference that, again, turns out to be more complicated than we thought).
UPDATE: After reading the notebook, some readers are annoyed with me because Scenarios C and D are not consistent with the way I posed the question. I'm sorry if you feel tricked -- that was not the point! To clarify, Scenarios A and B are legitimate interpretations of the question, as posed, which is deliberately ambiguous. Scenarios C and D are exploring a different version because it will be useful when we get to the next problem.