I said I’d make some comments on Birnbaum’s letter (to Nature), (linked in my last post), which is relevant to today’s Seminar session (at the LSE*), as well as to (Normal Deviate‘s) recent discussion of frequentist inference–in terms of constructing procedures with good long-run “coverage”. (Also to the current U-Phil).

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence.This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data.(The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice.(emphasis mine)While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum

New York University

Courant Institute of Mathematical Sciences,

251 Mercer Street,

New York, NY 10012

Possibly Birnbaum’s *confidence concept,* sometimes written (Conf), is what Normal Deviate has in mind (as a key rquirement of frequentist inference?). In Birnbaum 1977 (24), he states it more fully as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum says N-P methods do not have “concepts of evidence”–a term that he seems to have invented–essentially simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of such and such founder (not to mention that so many of their statements are tied up with personality disputes). That appears to be what Birnbaum is doing in erecting Conf to capture how he thinks the methods are used for “informative” scientific inference in practice.

Still, since Birnbaum’s (Conf) sounds to be alluding to pre-trial error probabilities, I regard (Conf) as too “behavioristic”. Some of his papers hint at the possibility that he would have wanted to use it in a (post-data) assessment of how well (or poorly) various claims were actually tested. (Aside from that he also leans to a focus on simple statistical hypotheses, though Conf need not be so restricted.)

I think that Fisher (1955) is essentially correct in maintaining that “When, therefore, Neyman denies the existence of inductive reasoning he is merely expressing a verbal preference”. It is a verbal preference one can also find in Popper’s view of corroboration. (He, and current day critical rationalists, also hold that all reasoning is deductive, and that probability arises to evaluate degrees of severity, well-testedness or corroboration, not inductive confirmation.) This blog may be searched for more on Popper and the rest….

*Thanks to all who attended! Feel free to write with questions: error@vt.edu

[1] Edwards, A. W. F.,

Nature,222, 1233 (1969)[2] Birnbaum, A., J.

Amer. Stat. Assoc.,57, 269 (1962)Birnbaum, A. (1962), “On the Foundations of Statistical Inference“,Journal of the American Statistical Association57(298), 269-306.[3] Birnbaum, A., in

Philosophy, Science and Method: Essays in Honor of Ernest Nagel(edited by Morgenbesser, S., Suppes, P., and While, M.) (St. Martin’s Press. NY,1969).

[4] Likelihood in

International Encyclopedia of the Social Sciences(Crowell-Collier, NY, 1968).

Birnbaum, A. (1977). “The Neyman-Pearson theory as decision theory, and as inference theory; with a criticism of the Lindley-Savage argument for Bayesian theory“. *Synthese* 36 (1) : 19-49.

In my view, any attempt to frame and formalize Birnbaum’s confidence concept in terms of pre-data error probabilities α and (1 – β), is likely to raise numerous problems, including the nature of J and H (simple, composite, whether they consititute a partition of the parameter space, etc.), the appropriateness of pre-data error probabilities for an evidential account, etc., etc.

Mayo’s severity assessment, however, can provide such an evidential account by circumventing these difficulties using post-data error probabilities and framing the N-P hypotheses as a partition of the parameter space.

Let me illustrate what I have in mind using a very simple empirical example from next week’s assignment for a course on “Empirical Research Methods in Economics”. Data on newborns (n=9869) for a certain locality and time period are divided into male (5152) and female (4717) are modeled in the context of a simple (IID) Bernoulli model with (X=1)={male} and mean “theta”.

The substantive hypotheses of interest are:

(H) Arbuthnot: theta=.5 and (J) Bernoulli (Nicholas): theta=(18/35).

For statistical inference purposes, however, all possible values of theta are of interest, and thus the best way to avoid fallacious results is to take one of the two substantive values as the basis for defining the null, say H, and define the alternative to include the other value (J) with the difference g*=(.5 – (18/35)) as the discrepancy of substantive interest:

H0: theta (less than or equal to) .5 vs. H1: theta > .5

Assuming my calculations are correct, the relevant test statistic yields d(x0)= 4.379>0 with a p-value=.000006; a clear rejection of H0.

This indicates that the inferential claim that “passed” is of the generic form: theta > .5 + g1, for g1>0.

The post-data severity assessment of this inferential claim for g1=g* is: SEV(theta > .5 + g*)=.938,

which indicates strong evidence for the Bernoulli substantive hypothesis, and clear evidence against Arbuthnot’s hypothesis.

Note that recasting the N-P hypotheses as:

H0: theta (greater or equal to) (18/35) vs. H1: theta < (18/35)

will yield different values for d(x0), an acceptance of H0, but identical severity assessements, i.e. the evidence for or against H and J will not change.

ARis: Thanks fo a numerical example. So, for instance, SEV(mu > .514) is ~ .93. .514 would also be the .93 lower confidence bound*. As you know, Birnbaum also tried his hand (as does Cox) at a series of confidence bounds long before people like Kempthorne and Poole. But I don’t know if he linked these directly with Conf, although it would make sense if he did.

*In some cases this limit would differ.

Dr. Spanos,

SEV(theta > .5 + g*)=.938 but what would SEV(.5000000001+g*>theta>.5+g*) be?

Why do you say you “regard (Conf) as too ‘behavioristic’.”?

BLOG ADMIN POSTING FOR D. MAYO:

Yes, I wrote that too quickly (relying on past discussions of “behavioristic vs evidential” construals of Neyman-Pearson (N-P) statistics. I meant basically that if only these pre-data error probabilities are required (even though they are an improvement over merely saying we require good long error rates) untoward inferences can still be licensed. Moreover, there are distinctions we would want to make based on sample size and actual outcomes.

EGEK (p. 377)Note 4.

“Birnbaum’s approach, incomplete at the time of his death,sought to make explicit the correspondence between an NP result and a statement about the strength of evidence (e.g., conclusive, very strong, weak, or worthless). For example, he interprets reject H against J with error probabilities a, b equal to .01 and .2, respectively, as very strong statistical evidence for H as against J. A main shortcoming, as I see it, is that it interprets a test output—say, reject H—from two tests with the same a, b as finding equally strong evidence for J. Depending upon the particular outcome and the test’s sample size, the two rejections may constitute very unequal tests of J.”