Frequentism and Bayesianism: A Practical Introduction

thebear · on Aug 31, 2014

It is perhaps worth drawing attention to this sentence in the article: "Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the Bayesian interpretation of probability implied by the term P(F_true | D)." A widespread misunderstanding is that there is something fundamentally Bayesian about Bayes' theorem, or even that frequentists don't believe in it. It is rarely pointed out that this is not the case, and we should thank the authors for doing so.

atakan_gurkan · on Aug 30, 2014

The follow-ups are also well worth reading:

http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bay...

http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bay...

http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bay...

nkurz · on Aug 31, 2014

In case it's not clear from the beginning where he really stands on the matter, in Part 3 he offers his opinion on the relative merits of both approaches:

The moral of the story is that frequentism and Science do not mix. Let me say it directly: you should be suspicious of the use of frequentist confidence intervals and p-values in science. In a scientific setting, confidence intervals, and closely-related p-values, provide the correct answer to the wrong question. In particular, if you ever find someone stating or implying that a 95% confidence interval is 95% certain to contain a parameter of interest, do not trust their interpretation or their results. If you happen to be peer-reviewing the paper, reject it. Their data do not back-up their conclusion.

</spoiler>

judk · on Aug 31, 2014

Um, the folks at CERN don't agree with your assessment of frequentism in science. Their papers are explicitly frequentist, they celebrate results based on how many "sigmas" they get. P-value all the aym

bayesianhorse · on Aug 31, 2014

I'm not a physicist, but as far as I know, CERN has access to tons of data with every intention of drowning any prior belief. In this setting, I would expect frequentist methods to shine, and bayesian method to be intractable.

However I would not expect CERN papers to make the kinds of terminological / theoretical lapses about confidence the parent thread was talking about. Papers should be rejected for that kind of error, even if you are not a bayesian.

betatim · on Aug 31, 2014

Exactly this. At CERN (and all of HEP elsewhere) the Bayes vs Freq wars were fought many years ago and are long over. The conclusion: when you have lots of data they converge (as they should!).

In the case that they differ you almost always find that you have very few observations. I would argue that this 'difference' is not that exciting because it must be dominated by your assumptions, not your observations. After all once you accumulate enough observations the two methods tend to converge.

Personal conclusion: if the methods disagree work on getting more data instead of fighting over which method is better.

nkurz · on Aug 31, 2014

My interpretation? If not clear, I'm quoting the penultimate paragraph of the last post in the series, right before he thanks the reader for making it to the end: http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bay...

Houshalter · on Aug 31, 2014

Yes frequentism is common in science and the author strongly objects to it. Read the whole post:

https://jakevdp.github.io/blog/2014/06/12/frequentism-and-ba...

anon4 · on Aug 31, 2014

That first followup seems patently insane. "We don't know the value of p̂, so we'll assume it's 5/8". Why would you ever assume like that? The Bayesian approach appears to me exactly like the Frequentist, without the part where they assume convenient values of unknown parameters.

lutusp · on Aug 30, 2014

Readers should be aware that the linked article was composed entirely in the IPython notebook environment, which means Python code blocks, Latex renderings, and graphics, can all be freely mixed in a (to me) very nice, readable article format.

http://ipython.org/

bayesianhorse · on Aug 31, 2014

And it literally can't be said enough times how awesome it is!

graycat · on Aug 31, 2014

Here's how I relax, avoid both frequentism and bayesianism, and just love probability:

I assume that there is a non-empty set, commonly called Omega, which I regard as the set of all experimental 'trials' that I might observe. But, actually, in all the history of everything in the universe, we see only one trial, only one element of this set Omega.

Next, there is a non-empty collection, usually denoted by script F, of subsets of Omega. I assume that set script F is contains Omega as an element and is closed under relative complements and countable unions. By relative complements, suppose A is an element of script F. Then the relative complement of set A, maybe written A^c, is essentially set Omega - A, that is, the set of all trials in Omega and not in A. Then set script F is a sigma-algebra. Each set A in script F is an event. If our trial is in set A, then we say that event A has occurred.

Next there is a function P: script F --> [0, 1]. P assigns 0 to the empty set (event) and is countably additive. Then function P is a probability measure. So for each event A in script F, P(A) is a number in [0, 1] and is the probability of event A.

Now can define what is means for two events to be independent and can generalize to two sigma algebras being independent.

Next, on the set R of real numbers, I consider the usual topology, that is, the collection T of open subsets of R. Then I let set B, the Borel sets, be the smallest sigma algebra such that T is a subset of B.

Next I consider a function X: Omega --> R such that for each Borel set A, X^{-1}(A) is an element of script F. Then X is a random variable.

Essentially anything that can have a numerical value we can regard as a random variable.

Then we can state and prove the classic limit theorems -- central limit theorem, weak and strong laws of large numbers, martingale convergence theorem, law of the iterated logarithm, etc.

Now we are ready to do applied probability and statistics. And we have never mentioned either frequentism or Bayesianism.

For more details, in an elegant presentation, see J. Neveu, Mathematical Foundations of the Calculus of Probability.

throwaway000002 · on Aug 31, 2014

Sure there's this thing called mathematics, but what reason do you have to give that the universe comes equipped with F? I'm not trying to be difficult, but mathematicians often try and live in their happy axiomatic universe and forget the axioms come from somewhere. Naïve mathematicians go further and forget that the axioms are aesthetic objects of mathematical culture and can and do change. I'm not a statistician, but like physics is to mathematics, statistics is to probability theory, and questioning the nature of the model as it corresponds to reality is part of the discipline. You can prove all the consequences you want from the axioms, but it'll never get you to closer to appreciating the relationship of those axioms and their results to an extra-theoretic "reality".

Take Buffon's needle (only chosen because is interesting, and continuous, as opposed to discrete). Consider this in the real world. X = needle crosses line. You haven't dropped it yet. Is X real, how so? How does it differ from before you dropped the needle to afterwards? Deterministically X is something except that you don't know it, and suddenly you know it. In so far as that is the case, what is a reasonable way to speak about knowledge of X? For the frequentist the model is reality, i.e. you construct your sigma algebra, apply geometric arguments, crunch away, and you start glowing when you see π. What happens when you're not so sure about the model? In other words, the mathematics is always fine, but the reasoning behind the use of the mathematics is what is up for debate.

graycat · on Aug 31, 2014

In case I understand your concern, here's how I address it: We like something like probability theory because in practice it works great for saying that some roulette wheel is crooked, for confidence intervals on measurements, ..., for various stochastic processes and their power spectra, e.g., for the 3 degree K background radiation. Roughly we know quite well what we want in our probability theory.

From what we know we want in our probability theory, we're essentially pushed into the axioms whether we like it or not. Or, the axioms are basically what the heck we need in order to talk about probability. E.g., we want to talk about events, say, event A, so that we can consider the probability of events, say, P(A), the probability of A. And given event B, we want to be able to consider the new event A or B.

So, if in our foundations we have that events A and B are subsets of the set of trials Omega, then the new event A or B is just the set union of sets A and B. So, we want to be able to consider the new event A or B so go ahead and accept the set theory because it lets us get what we want. So, basically we got pushed into the set theory.

For more, if event A is 'the next flip of our fair coin comes up heads' and event B there is 'a magnitude 1 quake in SF 10 days from now', then we jump at saying that events A and B have nothing to do with each other and are independent so that we know that P(A and B) = P(A)P(B).

So, in practice, how do we justify independence? Sadly, in my view, mostly just intuitively as in the little example of coins and quakes I gave.

The axioms ask for a little more than we might, first cut, have believed we need to specify. E.g., so that we will have plenty of ability to take old events and create new ones, we want the events actually to form a sigma algebra. Next, so that we can talk about the event 'the coin never comes up heads', again we want a sigma algebra, and we want our probability P to be countably additive.

What about uncountably additive? Quickly we see that that causes us serious problems so don't ask for it.

Net, we really do like probability and its applications, say, to statistics, and the axioms I gave (thank you A. Kolmogorov and H. Lebesgue) are basically about the minimum we want. Really, we just get pushed into those axioms; if we like probability, and we do, then we don't have much choice but to accept those axioms. Curious universe we live in.

judk · on Aug 31, 2014

OK, now try to make some claims about the real world. You will have to make frequentist confidence or Bayesian credibility claims.

graycat · on Aug 31, 2014

No, commonly the key to "some claims about the real world" is independence. Then can apply, with meager assumptions, say, the weak law of large numbers (which we do prove), that is, take an average. Also sometimes can make some "claims' based on conditional independence (which my axioms give us the ability to define) and apply, say, the martingale convergence theorem.

bayesianhorse · on Aug 31, 2014

Interpreting frequentist confidence in terms of the real world is actually quite difficult.

grayclhn · on Aug 31, 2014

Of course we haven't mentioned Bayesianism or Frequentism yet... you haven't said anything about estimating aspects of the probability space from observed data.

graycat · on Aug 31, 2014

> estimating aspects

Use the classic limit theorems, especially just the weak law of large numbers, i.e., take an average.

grayclhn · on Aug 31, 2014

...and now you have a naive version of frequentist estimation.

graycat · on Aug 31, 2014

Well, we need an independence assumption. And then we have some rock solid theorems and not just something "naive".

Maybe we have to pull the rabbit out of the hat someplace, and with the axioms our place is an independence assumption.

Indeed, the heartburn when trying to swallow frequentist is that we don't have the crucial assumptions to make an average have valuable properties; the sufficient and standard assumption is independence, and then the laws of large numbers, which we can prove just as theorems from the axioms and an independence assumption, tell us that an average does give us what we want to make the whole subject go, in statistics, etc.

So, net, where we pull the rabbit out of the hat is the independence assumption. If we are willing to assume independence, then we don't have to address the philosophical points, or we move the consideration of the philosophical points to considering the independence assumption. But, we are forced: We know that we want averages to be good estimators and know that in practice they often are; since the frequentist approach does not have enough to let us be clear on the properties of an average, we need more; a way to get what we need is the axioms and an independence assumption.

To me the key to probability, and why it is really not the same as just Lebesgue's measure theory, is the role of independence. We can also add in conditional independence. Net, I don't think that we can get very far in probability or its foundations without something as powerful as independence (or at least uncorrelated, which independence implies). The axioms and independence make a nice package to give us a rock solid version of what we want. So, that's why I go for it.

grayclhn · on Aug 31, 2014

> Well, we need an independence assumption. And then we have some rock solid theorems and not just something "naive".

We have rock solid theorems that the average converges under some conditions. (As an aside, do you rule out the Cauchy distribution in your axiomatic approach?). We don't have rock solid theorems that the average is of any interest. If we're interested in knowing whether a dam is likely to overflow based on historical water levels, the average is of limited use -- for heights well below the highest historical depth, an average will be something that we can at least calculate, but not precisely. If we want to leave room for error (build a dam above the highest historical waterlevel, and do it conservatively so that we'll be above future maixima for the next x years w/ some probability), we can't even calculate an average.

Now what? That's what I mean by "naive." If we're going to try to attack problems that aren't transparently easy (something like stationary weakly-dependent data with finite moments, enough observations that not only the CLT is a good approximation, but that we don't even need to worry about efficiency, loosely defined) then we need to start thinking about the properties we want our estimation procedure to have.

One set of properties leads to something like frequentist statistics, another set of properties leads to Bayesian statistics. Other sets of properties lead to some of the more idiosyncratic branches of statistics that I don't know much about.

> since the frequentist approach does not have enough to let us be clear on the properties of an average, we need more

This makes no sense. Every branch of statistics puts the estimator on top of a probability space, just usually not explicitly. (Because it would be boring.) The t-test has certain properties if the probability space is in a class with....

> The axioms and independence make a nice package to give us a rock solid version of what we want. So, that's why I go for it.

I want to get decent answers for hard problems and have a measure of their reliability. Both frequentist and Bayesian stats are often good for this, and certainly better than "no stats."

graycat · on Aug 31, 2014

> (As an aside, do you rule out the Cauchy distribution in your axiomatic approach?).

Of course not. Essentially the only assumption about a random variable is measurability, and I essentially gave that.

Each of the limit theorems has assumptions, and in some cases minimal assumptions are difficult to consider, e.g., for the Lindeberg-Feller version of the central limit theorem.

For a random variable with the Cauchy distribution, as I recall, the integral of the positive part is infinity and that of the negative part is infinity so that to define its expectation we would have to subtract one infinity from another which, with the measure theory approach to 'improper' integrals, we are not willing to do. So, that random variable does not have an expectation, and the law of large numbers does not apply to an independent, identically distributed sequence of such random variables. Again,

> We don't have rock solid theorems that the average is of any interest.

Sure, we do: Maybe for real valued random variables X and Y and real number a, we are told that

X = a + Y

and want to find a.

Suppose we have E[X] exists and is finite and for i = 1, 2, ..., we have real value random variables X(i), independent and distributed like X. So, then we also have independent Y(i) distributed like Y. Suppose we are told that E[Y] = 0.

Then we can use the law of large numbers to estimate a.

> If we're interested in knowing whether a dam is likely to overflow based on historical water levels, the average is of limited use -- for heights well below the highest historical depth, an average will be something that we can at least calculate, but not precisely.

If we have random variable X that is 1 when the dam overflows and 0 otherwise, then the probability the dam overflows is just E[X]. If we have independent, identically distributed 'samples' of X, that is, X(i) for i = 1, 2, ..., then we can use the weak law of large numbers to estimate E[X].

The role of "historical water levels" here is not trivial to evaluate; the main reason is that we are not sure we have the values of a sequence of independent, identically distributed random variables.

Generally from experience in applied probability and statistics, we should know well that saying what the probability of some particular water level is in the next 100 years is a challenging problem with no royal road to a solution and not a weakness of the axioms, expectation, or the laws of large numbers.

> This makes no sense. Every branch of statistics puts the estimator on top of a probability space, just usually not explicitly. (Because it would be boring.) The t-test has certain properties if the probability space is in a class with....

With the axioms I gave, independence, etc., the Student's t distribution and its applications work just fine as long practiced without mention of either frequentist or Bayesianism.

The reference I gave, Neveu, uses the axiomatic approach I outlined. So do the other famous texts in 'graduate probability' by Loeve, Breiman, and Chung.

So, let's see: I just got out my copy of Neveu and checked the index. 'Frequentist' is never mentioned, and Bayes is mentioned only on one page and there only as a 'strategy' in the "formalism" of statistical decision theory. Net, beyond the simple Bayes rule

P(A|B) = P(A) P(B|A)/P(B)

which is immediate from the definition

P(A|B) = P(A and B)/P(B)

really, Bayes has next to no role in the whole book.

Net, with the axiomatic foundation I outlined, we don't have to struggle, consider, or even mention either frequentist of Bayesianism. That should be welcome good news.

kgwgk · on Aug 31, 2014

> Net, with the axiomatic foundation I outlined, we don't have to struggle, consider, or even mention either frequentist of Bayesianism. That should be welcome good news.

People have been building on top of the axiomatic foundation that you outlined for decades. This is old news.

You mention Breiman's book on Probability. He has also a book on Statistics. Maybe there is something to it after all?

graycat · on Aug 31, 2014

> People have been building on top of the axiomatic foundation that you outlined for decades. This is old news.

Yup, essentially since Kolmogorov's 1933 paper.

The axiomatic approach I gave is just what is used in Breiman's Probability, as published by SIAM. It's a super nicely written book. I have never seen his book on statistics.

kgwgk · on Sept 1, 2014

I can't resist quoting Jaynes (from the closing remarks in Chapter 2 of "Probability Theory: The Logic of Science"):

In 1933, A. N. Kolmogorov presented an approach to probability theory phrased in the language of set theory and measure theory. This language was just then becoming so fashionable that today many mathematical results are named, not for the discoverer, but for the one who first restated them in that language. For example, in the theory of continuous groups the term “Hurwitz invariant integral” disappeared, to be replaced by “Haar measure.” Because of this custom, some modern works—particularly by mathematicians—can give one the impression that probability theory started with Kolmogorov. [...]

However, our system of probability differs conceptually from that of Kolmogorov in that we do not interpret propositions in terms of sets, but we do interpret probability distributions as carriers of incomplete information. Partly as a result, our system has analytical resources not present at all in the Kolmogorov system. This enables us to formulate and solve many problems—particularly the so-called “ill posed” problems and “generalized inverse” problems—that would be considered outside the scope of probability theory according to the Kolmogorov system. These problems are just the ones of greatest interest in current applications.

graycat · on Sept 2, 2014

I found a PDF of Jaynes and read the part at the end of chapter 2. There it said to read Appendix A, and I did that, too.

Sorry, but in Appendix A, quickly Jaynes doesn't understand the Kolmogorov foundations. E.g., he confuses a trial and an event. Not good.

For what he thought about "ill posed", etc., problems, I couldn't find.

To me, Jaynes does not look good.

kgwgk · on Sept 2, 2014

> For what he thought about "ill posed", etc., problems, I couldn't find.

Jaynes' book has a full chapter on "Paradoxes of Probability Theory", including Borel's paradox (or Borel-Kolomogorov paradox).

See also http://www.cims.nyu.edu/~csplash/data/notes/2010-14.pdf

grayclhn · on Sept 2, 2014

Oh, that looks interesting. Thanks for posting it.

grayclhn · on Aug 31, 2014

Nothing about this reply makes me reconsider labeling it "naive frequentism." I think you'll find that it's hard to do even bread-and-butter statistical tasks like deciding how many subjects to include in a randomized trial without moving beyond axiomatic probability.

graycat · on Sept 2, 2014

Sorry, it appears that we just are a long way from communicating clearly. I'll try again, a little:

What I outlined are the axioms of probability as started by Kolmogorov in 1933 and based on Lebesgue's measure theory (of near 1900). So, Kolmogorov finds something in Lebesgue's theory, that is, a particular measure space that has everything we want for probability theory. In this way Kolmogorov shows that we can regard probability theory as just another measure space in measure theory.

Why do that? To get a different probability theory? Not really: Kolmogorov's start just gives essentially the same probability theory we had in 1932. It is just that in 1932, we had to talk about trials, events, probabilities, and random variables without being able to say, mathematically, what the heck they were. To see the importance here, back to near 1900 with B. Russell, etc. there was an effort to redefine everything in mathematics starting with just sets. Then everything was constructed starting with just the low level ideas of sets. The serious result was axiomatic set theory, e.g., as in P. Suppes, Axiomatic Set Theory. So, numbers, functions, calculus, lines, planes, spheres, groups, rings, fields, vector spaces. etc. were all defined based on sets. And similarly for Lebesgue's measure theory. Then, after Kolmogorov, probability theory was also defined based just on sets. Whew!

For what you want to do with probability theory in statistics, etc. as you mentioned, Kolmogorov's axioms should be of zero concern to you except you might feel a little better knowing that the probability theory you have been using all along does have a solid foundation just on sets. So, you can go right along with applied probability as you have been doing, and Kolmogorov's axioms will essentially never get involved, never help you and never hurt you.

For more, via the axioms, we can define trials, events, probabilities, random variables, conditional probability, stochastic processes, Markov processes, Gaussian random variables, Chi squared random variables, sufficient statistics, distributions, random vectors, and on and on that you have already been knowing, loving, and using.

But, maybe I spoke too soon: For Markov processes, martingales, sufficient statistics, we very much want the Radon-Nikodym theorem of measure theory and use it for random variables, and we do. So, without the Kolmogorov's axioms, we would be somewhat stuck-o for Markov processes, etc.

> I think you'll find that it's hard to do even bread-and-butter statistical tasks like deciding how many subjects to include in a randomized trial without moving beyond axiomatic probability.

Not at all. There is no "moving beyond". Instead we proceed essentially as we might have in 1932. That is, in such work we rarely or never think about measurable spaces, measure spaces, sigma algebras, measurability, random variables as functions on the set of trials, etc. Again, we just continue to do applied probability and statistics essentially as we might have in 1932.

Or, the Komogorov axioms are down in the sub, sub basement, and we rarely go there. For the rest of the structure, it is just the same or nearly so.

Okay?

grayclhn · on Sept 2, 2014

I've studied probability theory, maybe more than you. I like it. But nothing from probability theory suggests why "size" and "power" might be useful properties in test statistics. Without other ideas like those, doing useful statistics would be pretty tough.

I notice you didn't propose a way to choose the sample size for a randomized trial in your reply. I'd love to see it -- using only probability theory and nothing from the stats literature. ;)

graycat · on Sept 2, 2014

> I notice you didn't propose a way to choose the sample size for a randomized trial in your reply.

We're not communicating well.

To respond to your question about sample size, I'd have to look into some of the details of your question. And I can say now, that question, those details, and any answer all have essentially nothing to do with the axiomatic foundation of probability I gave; the answer is the same or essentially so independent of those foundations down in the deep sub basement of the subject.

Whether something is "only probability" or also "stats" can be important in practice -- e.g., will find relatively little about a lot of important work in statistics in Neveu's book on probability. And there is a lot of practical knowledge in applied statistics, e.g., how well principle components analysis tends to work in practice (quite well). E.g., there is a lot in survey and sampling techniques. And a lot that is in mathematical statistics has yet to have been derived as fully clean applied mathematics from only something like Neveu. Still, basically mathematical statistics is applied probability which in principle can all be done back to Neveu and only a little more, e.g., matrix theory, some combinatorics, maybe some group theory for bootstrap and resampling plans, etc.

Again, my post was on the foundations of probability and showing that there we did not need to mention either frequentism or Bayesianism. That is, maybe my post would be helpful for people struggling with frequentism or Bayesianism -- I'm saying, since 1933, for just the foundations, get to f'get about both of them.

I studied probability from a star student of Cinlar at Princeton and from books by Neveu, Chung, Breiman, Loeve, and others. I did a lot in applied probability and applied statistics. I've published in mathematical statistics. My Ph.D. dissertation was in applied probability. So? I have background enough to post.

But here I only wanted to explain the foundations of probability and not compete with anyone or compare my expertise with anyone. Instead, I'm just reporting some news current as of 1933.

It does appear to me that now for nearly all serious work in probability and stochastic processes, the Kolmogorov foundations are nearly universally accepted; thus, readers are safe in taking seriously the news I reported.

Maybe someday I will return to pure and applied probability, but for now my interests are in my startup. There, now, mostly the work is in software and other parts of business. At the core, my startup is some work in applied probability, but I did that months ago and long since have had the corresponding computations in solid software. So, for now, more background in probability is not on my TODO list.

Again, here I'm just giving some HN readers the news that as of 1933 get to f'get about the frequentism and Bayesianism foundations of probability.

kgwgk · on Sept 2, 2014

Leaving aside that Kolmogorov's measure-theoretic approach is not the only axiomatic definition of probability (Cox's axioms yield a quite similar foundation, though with finite additivity only), you won't find anyone here that says there is a problem with the mathematical construction.

The frequentist/Bayesian debate is related to the INTERPRETATION of probability. Kolmogorov won't help you to map the real world to the probability space.

Let's say we have a loaded coin, we want to estimate the probability of getting tails (assume this is a i.i.d. random variable).

Alice decides to keep throwing until she gets a tail: she gets the sequence HHT

Bob decides to throw the coin three times: he gets the sequence HHT

Alice takes her event A, her sigma-algebra, the whole shebang, and produces an interval estimate for p.

Bob takes his event A (which happened to be the same) and his probability space (which is different because the experimental design is different), and produces a different estimate for p.

If you think that getting different results from the same data makes sense, you might be a frequentist.

If you think that it doesn't, you might be a Bayesian.

If you think that the question is not relevant because you can't derive the answer from your axioms, you might at least understand what we're talking about.

graycat · on Sept 3, 2014

This is all mixed up. Alice has real random variables X_1 (borrowing TeX notation for a subscript), X_2, X_3. We assume that {X_i|i = 1, 2, 3} is independent. We assume that for some number p in [0,1], P(X_i = 0) = p -- Alice uses 0 for H and 1 for T. Alice observes that X_1 = 0, X_2 = 0, and X_3 = 1.

Now, for the set of real numbers, R, Alice has a Borel measurable function f: R^3 --> [0,1] and lets f(X_1, X_2, X_3) be her estimate of p.

Okay, if that is what you meant.

Bob does much the same with real random variables Y_i, i = 1, 2, 3 and function g: R^3 --> [0,1] and lets his estimate of p be g(Y_1, Y_2, Y_3). And Bob observes Y_1 = 0, Y_2 = 0, and Y_3 = 1.

If functions f and g are the same, then, in this case, that is, with the data Alice and Bob observed, Alice and Bob get the same estimate for p. Else if f and g are different, then, even if the data they observe is the same, they might get different estimates. Even if f = g, since each of Alice and Bob is flipping the coin for themselves, they need not get the same estimate for p.

Note: Since f is Borel measurable, we can set real random variable Z = f(X_1, X_2, X_3) and ask for E[Z], etc. Sometimes this step is useful. That is, our estimator of p is also a random variable.

We might like to have E[Z] = p; in this case, Z is an 'unbiased' estimator of p.

Also we can ask for the variance of Z, say, Var(Z), and maybe we want Var(Z) to be small. If we can show that Var(Z) is the smallest among all Borel measurable f, then the choice Alice made for f is a 'minimum variance' estimator.

If Alice gets charged money for being wrong, then we can try to minimize the expected value of what Alice gets charged, and here we have a case of 'statistical decision theory'.

Since order statistics are always sufficient, with the assumptions we have, Alice need only be told 2 zeros and 1 1 and can f'get about the rest.

I see no surprises or difficulties here.

But the sample space Omega is the same for both Alice and Bob. And, for both Alice and Bob, there is only one trial, that is, only one point little omega, in Omega involved. That is, in more detail, X_1 is a function, the function X_1: Omega --> R, and in our case for our trial little omega we have X_1(little omega) = 0. That is, usually in the notation we suppress little omega. Alice and Bob are both using the same trial little omega and the same sigma algebra script F on the same sample space Omega.

We have no reason to believe that random variables X_1: Omega --> R and Y_1: Omega --> R are equal. Thus, given a Borel subset K of R, the events X_1^{-1}(K) in the sigma algebra script F on Omega and Y_1^{-1}(K) also in the sigma algebra script F on Omega need not be the same.

That's a little of how applied probability based on 'modern probability' works.

I see no problems and no need to consider frequentism or Bayesianism.

Questions?

kgwgk · on Sept 3, 2014

"We might like to have E[Z]=p","maybe we want Var[Z] to be small"... How do you select f and g? That's the difficulty. You see no problems because you're happy playing with your Borel measurable thingies. Try to go further in the 'statistical decision' field and see how longer can you avoid frequentist considerations.

What do you think of the "likelihood principle"? Alice and Bob get the same data and the same likelihood function. Should they make the same inference?

Do you think unbiasedness is an important property for an estimator? What makes an estimator admissible? Do you see a problem with an estimator that provides negative values for non-negative variables?

By the way, I don't think you analysis is correct: Alice doesn't have random variables X_1, X_2, X_3. The possible events for her are T,HT,HHT,HHHT,HHHHT,... (she stops when she get tails, but not before).

EDIT: In case it's not yet clear: saying that "our estimator of p is also a random variable" and looking at its sampling distribution is frequentism.

graycat · on Sept 4, 2014

> How do you select f and g? That's the difficulty. You see no problems because you're happy playing with your Borel measurable thingies.

Borel measurablility is important work; it sounds like you have contempt for it. There is no need or justification for contempt.

We should mention Borel measurability, as I did, if we are carefully considering the Kolmogorov foundations, but in practice Borel measurability means essentially nothing since cooking up a function, such as f or g in what I wrote, that is not Borel measurable is so tricky that essentially any function anyone would select for f or g will be Borel measurable. The usual example of a function not Borel or Lebesgue measurable uses the axiom of choice -- we're talking tricky stuff won't see in SPSS, SAS, R, Mathematica, Matlab, big data, machine learning, etc.

For picking f or g, let's see, from our definition of p above,

p = P( X_i = 0 ) = P( heads )

and

1 - p = P( X_i = 1 )

So, for Bob's case, we can just set

Z = g(X_1, X_2, X_3) = 1 - (1/3) (X_1 + X_2 + X_3)

which is what anyone would guess anyway.

Or, for more, for i = 1, 2, 3,

E[X_i] = 0 P(X_i = 0) + 1 P(X_i = 1)

= 1 - p

so that

E[Z] = E[ 1 - (1/3) (X_1 + X_2 + X_3) ]

= 1 - (1/3) E[ X_1 + X_2 + X_3 ]

= 1 - (1/3) ( 3(1 - p) )

= p

so that our estimator

Z = g(X_1, X_2, X_3) = 1 - (1/3) (X_1 + X_2 + X_3)

is an unbiased estimator of p. We expected something else? This was difficult?

Here we specified a function g and showed that it gives an unbiased estimator of p and did this without mentioning the trial little omega, the sample space big Omega, or the sigma algebra of events script F. So, what we did is just an elementary part of standard junior level introductory mathematical statistics, and so would be responses to your other questions.

And, again, we never mentioned frequentism or Bayesianism. Yet again, when considering the mathematical foundations of probability, I see no need to consider either frequentism or Bayesianism. Again, the Kolmogorov foundations work just fine for long standard probability, applied probability, mathematical statistics, and applied statistics. No worries.

For Alice, a different derivation is required.

My point is that from Kolmogorov we have some rock solid foundations for probability and, then, do not have to consider either frequentism or Bayesianism. So, students struggling over frequentism or Bayesianism can relax and just f'get about these two.

kgwgk · on Sept 5, 2014

> So, for Bob's case, (....) so that our estimator Z = g(X_1, X_2, X_3) = 1 - (1/3) (X_1 + X_2 + X_3) is an unbiased estimator of p. We expected something else? This was difficult? (....) For Alice, a different derivation is required.

A different derivation that you tried (I saw your comment appear briefly). You proposed Z=1/N, which is (as before) the maximum-likelihood estimator. And the likelihood function is the same for Alice and for Bob, so it is not surprising that we get the same estimate p=1/3 in both cases. But after "proving" that "again our estimator Z is unbiased" I guess you noticed that this was not in fact correct. Did you expect something else? What was the difficulty?

> My point is that from Kolmogorov we have some rock solid foundations for probability and, then, do not have to consider either frequentism or Bayesianism.

Both are based on probability, how is probability going to replace them?

> So, students struggling over frequentism or Bayesianism can relax and just f'get about these two.

Sure, they can avoid thinking about the different approaches to inference... and just do it in the frequentist way. Assuming that the unknown parameter is fixed, that the estimator is a random variable, that the criteria to select an estimator are the unbiasedness, consistency or asymptotic distribution... all of these are frequentist considerations (even if you say that you "do not have to consider either frequentism or Bayesianism").

You never answered my questions about Alice and Bob getting different interval estimates (even though the outcome of their experiments is identical and their model for the loaded coin as well). We've seen that they agree on their (MLE) point estimate, but their confidence intervals will be different. I imagine you accept the frequentist idea of confidence intervals, and agree that they will be different because the distribution of potential experiment outcomes is different.

Do you think that Alice and Bob can get different conclusions from the same model and the same data?

Carol performs another experiment. She starts by rolling a die to see if she'll do it like Alice (even) or like Bob (odd). So with 50% probability she will throw the loaded coin 3 times and with 50% probability she will do it until she gets a tail. She gets 5 on the die, she throws the loaded coin three times and gets 'HTH'. How will you select your estimator for p? Do you use the results you obtained for Bob? Do you repeat you analysis considering the mixture of both experiments?

grayclhn · on Sept 3, 2014

This is some seriously dedicated trolling.

graycat · on Sept 3, 2014

> trolling

That's a contribution to this discussion about the foundations of probability, pure math, applied math, probability theory, statistics, econometrics, or just an insult?

grayclhn · on Sept 3, 2014

It's an estimate.

grayclhn · on Sept 2, 2014

If your point is that all branches of statistics are built on the same axiomatic foundations of probability, of course I agree.

> To respond to your question about sample size, I'd have to look into some of the details of your question. And I can say now, that question, those details, and any answer all have essentially nothing to do with the axiomatic foundation of probability I gave; the answer is the same or essentially so independent of those foundations down in the deep sub basement of the subject.

This has been my point all along.

eli_gottlieb · on Aug 31, 2014

Frequentists and Bayesians care about two different likelihood functions:

* Frequentists care about p(evidence | parameters), and interpret probability as a measure over subsets of the counterfactual set of repeated trials (usually independently identically distributed) produced by their model.

* Bayesians care about p(parameters | evidence), and interpret probability as "belief" or "propensity to bet". This is, of course, philosophically ridiculous, since they proceed to ground rational belief in Bayesian statistics. What they are really doing is exactly what their likelihood function says: taking a measure over subsets of the counterfactual set of possible worlds which could have produced their evidence.

The frequentists have the advantage of their methods being more computationally tractable. The Bayesians have the advantages of intuitiveness and of yielding more accurate inferences from the same limited data-sets. Pick the tool you need and remember what you're taking a measure over!

jmount · on Aug 30, 2014

Nice article. In this direction my group has been trying to help teach that you tend to need to be familiar with both frequentist and Bayesian thought (you can't always choose one or the other: http://www.win-vector.com/blog/2013/05/bayesian-and-frequent... ) and that Bayesianism only appears to be the more complicated of the two ( http://www.win-vector.com/blog/2014/07/frequenstist-inferenc... ).

afafsd · on Aug 31, 2014

I don't understand why Bayesian statistics needs to be an "-ism", and still less why other statistics needs to be an "-ism" too. I don't understand why people feel the need to line up on one side or the other or get so worked up about it. Other branches of mathematics seem to avoid this kind of thing, they have no problem with the idea that there's different ways of doing the same thing.

It actually discourages me from learning more about Bayesian statistics, because the whole thing sometimes comes off as a cult.

lutusp · on Aug 31, 2014

> I don't understand why Bayesian statistics needs to be an "-ism", and still less why other statistics needs to be an "-ism" too. I don't understand why people feel the need to line up on one side or the other or get so worked up about it.

Don't get hung up on the terminology, instead pay attention to the ideas behind the terms. The Frequentist and Bayesian approaches are very different, and produce different outcomes, so they deserve to be understood and their differences sorted out.

For example, and without providing all the technical details, using a Frequentist analysis a decision was made to recommend breast cancer screening x-rays for women over a given age. Later, after serious problems arose, a Bayesian approach showed that the false positive rate was 5 to 1 (5 false positives to 1 real cancer detection), which meant many more people were given the false news that they had cancer than those who actually did.

https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Bayes_...

This is not to exalt the Bayesian approach over the Frequentist, because both have their place, it is only to show how dramatic the difference can be.

pessimizer · on Aug 31, 2014

It seems like you're confusing Bayes' Theorem with Bayesian statistics. Bayes probably wasn't a Bayesian, and everybody uses Bayes' theorem.

lutusp · on Aug 31, 2014

> It seems like you're confusing Bayes' Theorem with Bayesian statistics.

Yes, a tempest in a teapot. Anyone caught using Bayes' name without qualifying the use is placed in the same position as someone referring to the Victorian Era without mentioning that Victoria wasn't a Victorian. In most cases, it's not worth the digression.

Tarrosion · on Aug 30, 2014

What is the theoretical justification for taking a completely flat prior? "If we set the prior P(Ftrue)∝1 (a flat prior),"

There's no probability distribution which is constant over the whole real line. Is the idea that we can pick a distribution which is constant over an arbitrarily large (but finite) interval around the observed data, and so in practice, we may get results arbitrarily close to those given?

jmount · on Aug 30, 2014

I'd say the justification is two parts. First is the Bernstein–von Mises theorem (priors don't matter once you have enough data, as long as you didn't violate Cromwell’s rule by using zeros). The second part is: improper priors are considered okay- as long as you check the posterior corresponds to a sensible distribution.

Houshalter · on Aug 31, 2014

If I understand correctly, assuming all real numbers have equal prior probability, then no matter how much data you gather, it's still infinitely unlikely that the true value will exist within any finite range. E.g. the probability of the true value being between -100! and +100! is 0. If you draw from the distribution you will always draw "infinities".

jmount · on Aug 31, 2014

That is why you have to check. If the P(data|param) is concentrated, then even for uniform priors on the real like you can have P(param|data) is proportional to a sensible distribution. But you have to check for specific model and data (say P(data|param) = c e^{(data-param)^2}), it isn't enough to verbally work through infinities.

pash · on Aug 31, 2014

Probabilists often think of a uniform distribution over the whole real line as giving an infinitesimal (but non-zero) probability at each point. The intuition is very similar to the way physicists understand the Dirac delta function. Like the Dirac delta, an improper prior doesn't have any formally acceptable definition as a stand-alone mathematical object [0], but it does have a good definition that covers the way it's used.

Now, why does a uniform distribution correspond to an "uninformative" or "unbiased" prior in the first place? Because the uniform is the unconstrained maximum-entropy distribution. If you don't have rock-solid intuition about the concept of entropy, I recommend starting with Arieh Ben-Naim's book [1].

0. But with nonstandard analysis, it's a simple thing to give a rigorous, intuitive definition of a uniform distribution whose support includes as much of the real line as we could possibly care about.

1. A Farewell to Entropy, http://www.amazon.com/FAREWELL-ENTROPY-Statistical-Thermodyn.... It's a probability book masquerading as a physics book.

micro_cam · on Aug 30, 2014

Actually kind of the opposite idea. So called "uninformative priors" can be considered a form or model regularization in that they prevent the model from fitting the data too well...they spread the posterior distribution out. Weakly informative priors like a broad normal distribution are also popular and may yield better results.

droob · on Aug 31, 2014

"37 Ways to More Accurately Read the Bones You're Casting to Predict the Harvest"

adrianbg · on Aug 31, 2014

Does anyone know how to make the formulas render properly? Even using the iPython notebook viewer hasn't helped.

walrus · on Aug 31, 2014

In your JavaScript console, run:

  var s = document.createElement('script');
  s.src = 'https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML';
  document.body.appendChild(s);

The page is pointing to the old MathJax CDN, which was decommissioned on July 31: http://www.mathjax.org/changes-to-the-mathjax-cdn/