The Neyman-Pearson Theory as Decision Theory, and as Inference Theory; With a Criticism of the Lindley-Savage Argument for Bayesian Theory Author(s): Allan Birnbaum Source: Synthese, Vol. 36, No. 1, Foundations of Probability and Statistics, Part I (Sep., 1977), pp. 19-49 Published by: Springer Stable URL: http://www.jstor.org/stable/20115212 . Accessed: 03/10/2011 17:34 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Springer is collaborating with JSTOR to digitize, preserve and extend access to Synthese.
http://www.jstor.org
ALLAN
THE
BIRNBAUM*
NEYMAN-PEARSON
AS
THEORY
DECISION
AND AS INFERENCE THEORY; WITH A THEORY, CRITICISM OF THE LINDLEY-SAVAGE ARGUMENT THEORY
FOR BAYESIAN
1. INTRODUCTION
AND
SUMMARY
of a decision, which is basic in the theories of Neyman and Savage, has been judged obscure or inappropriate Pearson, Wald, of data in scientific research, by Fisher, when applied to interpretations The
concept
and other writers. This point is basic for most statistical Tukey, is based on applications of methods derived in the which practice,
Cox,
theory or analogous Neyman-Pearson least squares and maximum likelihood.
as of such methods applications Two contrasting interpretations are formulated: to 'deci behavioral, applicable
of the decision
concept sions' in a concrete literal sense as in acceptance sampling; and evidential, as a to 'decisions' such in research context, where applicable 'reject H{ the pattern and strength of statistical evidence statistical concerning
is of central interest. Typical standard practice is charac hypotheses terized as based on the confidence concept of statistical evidence, which is defined in terms of evidential of the 'decisions' of decision interpretations concepts are illustrated by simple formal examples with in genetic of and are traced in the writings research, interpretations and other writers. The for Pearson, argument Neyman, Lindley-Savage Bayesian theory is shown to have no direct cogency as a criticism of theory. These
typical
standard
evidential,
practice,
interpretation
2. TWO
since
it is based
on
a behavioral,
not
an
of decisions.
INTERPRETATIONS
OF
DECISIONS'
decision problems are the subject of major theories of modern and have been developed with great precision and generality on statistics, the mathematical side. But in the view of many applied and theoretical
Statistical
Synthese 36 (1977) 19-49. All Rights Reserved. Copyright? 1977 byD. Reidel Publishing Company, Dordrecht,Holland.
20
ALLAN
the statisticians, remained obscure
scope and or doubtful
BIRNBAUM
interpretation in connection
in typical scientific research situations. The reason for concern here is that most
of with
decision
theories
interpretations
has
of data
applied to research data have been given their most systematic mathematical jus turn tification within the Neyman-Pearson and that in theory; theory has been given itsmost systematic mathematical within the development (non-Bayesian)
statistical
statistical methods
decision
initiated by Wald. theory statistical hypotheses which may
In this
the alternative be 're development on or a are the basis of identified jected' 'accepted' testing procedure in the formal model with the respective 'decisions' of a appearing decision
problem. each confidence interval which may be determined Similarly, by an estimation procedure is identified with one of the 'decisions' of a model. This leads to questions about the scope and interpretation of the 'deci sion' concept which have been discussed by a number of writers: In what to regard the results of typical scientific data sense, if any, it is appropriate on standard methods of testing and estimation as based statistical analysis decisions? shall treat this question in a way which is self-contained, and more some in respects than previous discussions. Our intention is to systematic We
in certain respects, without and clarify previous discussions or to review summarize The them. interested reader is urged attempting to read or re-read such earlier discussions, those of Tukey particularly and below. others cited (1960), Cox (1958, p. 354), complement
'decide' and 'decision' were used heavily by Neyman and in the series of joint papers which initiated their theory, notably in the preliminary paper of 1928, and in the 1933 paper in exploratory terms
The
Pearson
in a problems of testing statistical hypothesis were first formulated a as case can of be which statistical decision way regarded problems. of statistical deci A frequently cited ('paradigm') type of application
which
sion
theories
and of the Neyman-Pearson sampling (Neyman and Pearson,
theory
is that of
industrial
1936, p. 204; Wald, 1950, or must not to A manufacturer decide whether pp. 2-3): place a lamp batch of lamps on the market, on the basis of tests on a sample from the batch. acceptance
THE
NEYMAN-PEARSON
of decision The simplest models our present purposes of discussion, Simple
hypotheses:
Possible
decisions:
Error
THEORY
21
are characterized problems fully, for of sch?mas the by following form: Hx, H2
dx,
d2 a = Prob
probabilities:
?
[?i|Hi],
= Prob
[d2\H2]
A simple hypothesis is any probability distribution which may be defined over the range of possible outcomes (the sample space) of an experiment or observational procedure. For example,
the lamp manufacturer in the simple may be interested a that of contains batch defective 4% lamps hypothesis H1 exactly lamps, and in the alternative that the batch contains simple hypothesis H2 exactly
10% defectives, possibly because a batch is considered definitely if it has 4% or fewer defectives, and is considered definitely bad if it
good has 10% or more
defectives.
For a given batch, withhold di:
his possible decisions are: the batch from the market;
and
the batch on the market.
d2: place The performance of any decision function (that is any rule for using data on a sample of lamps from the batch to arrive at a decision d\ or d2) is error prob characterized and H2, by the respective fully, under Hi a
and ? defined in the schema. of a decision (An example is the rule: Place the batch on the market if and only if fewer are found in a random sample of 25 lamps.) than 3 defectives
abilities
function here
Consider the interpretation of the decisions dx and d2 which appear in the schema, in its application to the problem of the lamp manufacturer. When the manufacturer he places a batch of lamps on the market, an so or If one more action. he does after considering also performs alternative decision Here
actions, as possible in favor of that action. the terms
'decision'
in a simple
and
in our
example,
'action'
refer
then he has to the behavior
taken of
a the
direct and literal way. We shall use the term interpretation of the decision concept to refer to any compara of a 'decision' appearing in a bly simple, direct, and literal interpretation formal model of a decision problem.1
manufacturer behavioral
22
ALLAN
BIRNBAUM
in the behavioral interpretation must be criticized and rejected, a and when such of many schema and statisticians, investigators model are applied in a typical context of scientific research in connection The
view
with
of data analysis. Convenient examples may be which have the studies, general scientific goal linkage of the which 'chromosome knowledge map' largely charac
standard methods
drawn
from genetic
of extending terizes a species
or strain in classical Mendelian genetics.2 an Consider investigator who judges that his linkage studies provide that two genetic loci lie on the same chromosome very strong evidence reverse that future studies could conceivably usual the (with appreciation a his judgement); and who reports his conclusion, with summary together of it, based interpretation the determined Neyman-Pearson by applying in a research 1955, or Smith, 1953, pp. 180-3), of his data
and his
in part on use of a test theory (as in Morton,
journal. favoring the scientific hypothesis of linkage corresponds in some way to a 'decision' dx in a schema like that above, where now Hx no linkage. It is the nature of is the statistical hypothesis characterizing His conclusion
this correspondence
3.
which we wish
STATISTICAL
EVIDENCE, BY
REPRESENTATION
problem
statistical
AND
DECISIONS'
carefully.
ITS OF
INADEQUATE DECISION
THEORY
is often described (in the a as of papers elsewhere) problem Neyman-Pearson deciding such asHx (e.g. Neyman whether or not to 'reject a statistical hypothesis' and Pearson, 1928, p. 1; 1933, p. 291). This suggests the interpretation most of testing as decision writers who formulate problems given by The
of testing
THE
to examine
hypotheses
and
problems: dx:
reject Hx
d2:
do not reject//!.
to the question: What is the leads immediately this interpretation of for of the situation the inves in, example, 'reject Hx interpretation that linkage was present? tigator of our example who concluded
But
if the geneticist uses typical terminology such as 'reject Hx, understand of no linkage,' neither he nor his colleagues hypothesis Even
the that
THE
THEORY
NEYMAN-PEARSON
23
sense which could be he ismaking a decision in any literal and unqualified a with that of the behavioral closely comparable interpretation given in the example above. decision lamp manufacturer's term 'reject' expresses here an interpretation the decision-like Rather, of the statistical evidence, as giving appreciable but limited support to one of This evidential statistical hypotheses. of the alternative interpretation results is in principle based on a complete schema of the the experimental indicated above, even when this is only implicit. In this essential suggested above between respect, the identification and is of the schema is inadequate, and the single element dx 'reject Hx
kind
misleading statistical d\*:
when evidence
taken out of the context are adequately
(reject Hx
forH2,
a, ?)
(reject H2
forHx,
a, ?),
of the schema.
represented
by symbols
Such cases of like
and d2 :
each of which
of the complete schema which serves of statistical for the interpretation frame of reference
carries an indication
as the conceptual evidence here.
The symbols d* and d2 represent in prototype typical interpretations in scientific and reports of data treated by standard statistical methods research
contexts.
interpretation of the decision concept of models of decision problems; and we shall to refer to such applications to refer to such use the term confidence concept of statistical evidence We
shall use the term evidential
of statistical evidence. interpretations In the view of this writer and some others, although typical applications in research are of the kind we have of standard statistical methods and interpreta illustrated, the central concepts guiding such applications tions (for which we have introduced the terms in italics above) have not been defined within any precise systematic theory of statistical inference. these concepts exist and play their basic roles largely implicitly Rather, of stan and interpretations in guiding applications and unsystematically, of new statistical methods. and in guiding the development dard methods, We even
shall not offer any precise theoretical account of these concepts, nor claim that such an account can be given. Our aims are limited to
24
ALLAN
the existence
illustrating
BIRNBAUM
and wide
scope of the confidence
concept,
and
clarifying some of its features. seems to be in part a primitive The confidence intuitive concept of with sch?mas of statistical evidence associated the above concept kind, which may (Conf):
be expressed
in the following
formulation:
prototypic
is not plausible unless it finds evidence as against Hx with small probability (a) when Hx is true, and with much larger probability (1-/3) when H2 is true. A
concept of statistical 'strong evidence forH2
con The following are simple examples of the confidence Examples. in of statistical evidence. be of context the of the cept They may thought of described of above. The investigation genetic linkage interpretations are in the first person because statistical evidence expressed they illus trate in simple cases the writer's own practice and thinking concerning as an independent evidence, based in part on some experience some new methods of data and of in interpreter genetic developer as on as Mendelian well (Birnbaum, 1972), theory and data analysis statistical
and analysis of general statistical practice and these examples, and their interpretations in follow are typical of widespread statistical thought and practice,
extensive
observation
thinking.
In my view
ing sections, that they are given here with with the qualification explicit expression which is unusual. The interested make an independent judgment about this.
a degree and style of reader will of course
to the usage of Savage first person form is somewhat analogous decision is from the standpoint (1954) whose Bayesian theory developed of a generic rational person 'you'. In a following section these examples The
will be referred
to in the course
of a critical discussion
tions of Savage's and Wald's decision theories. Symbols of the form dx and d2 introduced above
of some assump are used
to present
the examples.
(1) I interpret (reject Hx as strong
statistical
for H2,
evidence
(reject H2
for Hi
0.06, for H2 0.06,
0.08) as against Hi. 0.08)
Similarly
I interpret
THE NEYMAN-PEARSON as strong statistical
evidence
THEORY
25
as against H2.
for Hx
(2) I interpret (reject Hx as conclusive
for H2,
as against Hx. Here the zero value of the of the first kind indicates that the observational results
evidence
error probability are incompatible
0, 0.2)
for H2
with Hx.
(3) I interpret (reject Hx
for H2,
as very strong statistical
0.01,
evidence
0.2) for H2
as against Hx.
(4) I interpret (reject H2
for Hx,
0, 0.2)
as weak
statistical evidence forHx as against H2. Here the relatively large 0.2 of the error probability of the second kind suggests relative this evidence against H2. skepticism concerning
value
(5) I interpret (reject Hx as worthless
for H2,
statistical
0.5, 0.5)
evidence.
It is no more
relevant
to the statistical
is the toss of a fair coin, since the error hypotheses (0.5, 0.5) also represent amodel of a toss of a fair coin, with probabilities one side labeled 'reject Hx and the other 'reject H2. If such a case arose considered
than
our comments would lead us to judge test to least the be worthless. adopted, The distinction between the two interpretations
in practice,
epitomized
(as Bernard
Norton
has pointed
ordinary usages:3 behavioral: 'decide
to' act in a certain way,
out)
the experiment, of
or at
'decision' may
by contrasting
be the
26
ALLAN
BIRNBAUM
and evidential: that' a certain hypothesis supported by strong evidence.
'decide -
the different
Concerning true or well
is true or is
identification
(pragmatist) with 'decide
to act as
of 'decide that A
is
is true or well
if A
supported' itwill be clear from discussion above and below that we reject supported', and regard conclusions and statistical any such simple identification, as having autonomous status and value. evidence The
were
considerations
though
less
an inference we are 'deciding' a statement to make be argued that in making and that, therefore, the word decision type about the populations provided too narrowly, the study of statistical decisions embraces that of inferences. interpreted
is not
preceding formally by Cox
(1958,
emphasized p. 354) as follows:
clearly
itmight certain
point
here
is that one
of
the main
of statistical general problems can usefully be made and exactly
what types of statement deciding decision statistical theory, on the other already
the possible
hand,
inference
of a The
consists
in
they mean. are considered
In
what
decisions
as
specified.
between of the two interpretations analysis of the distinctions in 'decisions' of decision is those sections below theory provided which treat certain assumptions underlying Savage's and Wald's decision to regard evidential theories. In particular, it is shown that if one wishes Further
the
statements
for example,
represented, (reject Hx
df: as 'decisions'
forH2,
0.05,
in a formal model
by 0.05)
of a decision
then certain basic problem, are of statistical decision theories assumptions incompatible with certain statements. and meanings of those evidential basic properties 4. STATISTICAL
AS
EVIDENCE
ONE
REGARDING
CONSIDERATIONS SCIENTIFIC
AMONG
SEVERAL
SUPPORT
OF
CONCLUSIONS
a conclusion in a scientific reached (1960) has emphasized, our as two loci are of that the conclusion such geneticist investigation,
As Tukey linked,
requires (a) statistical hypotheses
not only evidence
of sufficient
of interest.
strength
concerning
the statistical
THE
NEYMAN-PEARSON
THEORY
27
the investigator (or community of investigators) must model, which (b) the adequacy of the mathematical-statistical as the conceptual for the interpretation frame of reference
In addition
to represent
statistical
the research
situation
judge serves of the
in relevant
evidence, respects; and and evidence of a conclu with other knowledge (c) the compatibility sion that may be supported by statistical evidence provided by the evidence current (for example, strong statistical investigation no representing against the statistical hypotheses linkage).4 prevent us from regarding a scientific important considerations as being determined in any simple or exclusive way by the conclusion statistical evidence which may support it. The Neyman-Pearson theory introduced a kind of formal symmetry of problems of testing statistical hypotheses, into the formulation by
These
requiring explicit error probabilities the complement
of alternative statistical hypotheses and specification our of the second kind (e.g. H2 and ? in schema) to a in our traditional and Hx specification (e.g. just
schema). But inmany definite
early and modern applications of statistical tests, there is a in the status of the alternative lack of symmetry statistical
in the status or related to a lack of symmetry considered, or of scientific conclu significance corresponding hypotheses possible sions. For example inmany cases one scientific hypothesis is regarded as
hypotheses
on the basis of current
or at least as acceptable knowledge, or plausible, unless and until sufficiently clear and strong evidence against it appears. Clearly such considerations lie outside the scope of mathemat and statistical ical statistical models in the sense discussed evidence established
and above, but rather in the scope of the scientific background knowledge judgment referred to in (b) and (c) above. In traditional formulations of testing problems which preceded the and to which continue in appear prominently Neyman-Pearson theory itmay be more statistics, in various applications applied and theoretical or less plausible to suppose that there is implicit, though not explicit, error to alternative reference and corresponding statistical hypotheses an as for of the basis choice and reasonable probabilities, implicit part interpretation
of a test statistic;
and possibly
to suppose
also that there is
28
ALLAN
BIRNBAUM
to possible alternative scientific hypoth implicit if not explicit reference eses or possible to such conclusions corresponding implicit statistical not does extend to tests in of the The scope present paper hypotheses. to extent formulations the that such traditional except they may be an as being interpreted at least in principle with application regarded in to some alternative statisti implicit, if not explicit, reference plausible terms as 'standard cal hypotheses. Such and statistical methods' 'standard methods this paper, must confusion.
as used throughout of testing statistical hypotheses', to avoid be understood with this important qualification
5. THE
THEORETICAL NEYMAN-PEARSON
AMBIGUITY
OF
THE
THEORY
in its mathematical The Neyman-Pearson form as theory is interpretable a special restricted part of general statistical decision theory, as we have to the extra indicated above and will elaborate further below. As and theory, which relate that mathematical interpretations one may say that there are two Neyman-Pearson form to applications, theories:
mathematical
One
is based on behavioral
has been
elaborated
behavior
as mentioned
of the decision concept, and interpretations terms in of his concept of inductive by Neyman above. It is difficult or (in the view of the present
to discover or devise clear plausible and some others) impossible in typical scientific research situations of this interpretation examples are applied. (The interested reader will make an where standard methods writer
independent judgement about this, and may wish to consider the exten of Neyman himself to the interpretation sive and important contributions of scientific data in several research areas.) structure of The second theory which makes use of the mathematical on is based evidential of the the Neyman-Pearson theory interpretations in that theory, and has as its central concept what we have 'decisions' - a called the confidence concept of statistical evidence concept whose essential role is recognizable research throughout typical applications of standard methods, but a concept which has not interpretations in any systematic been elaborated theory of statistical inference. and
THE
even
Since
NEYMAN-PEARSON
the existence
THEORY
of
this important of the mathematical
29
distinction
two
between
structure of the Neyman interpretations nor not is very widely clearly appreciated, much of the theory in the statistical found literature is not and misunderstanding obscurity and obscurity surprising. A simple step toward limiting this confusion theoretical
Pearson
would
be to make
consistent
view whenever
use of terms which
keep
such as 'confidence
the distinction
in
and 'evidential'
or
necessary, concept' and to avoid unqualified use, when ambiguity interpretation; and confusion could result, of such standard terms as: the Neyman Pearson 'objectivist', theory (or approach, or school); and 'frequentist', 'behavioral'
'orthodox', 'classical', 'standard', and the like. seems to have some In the many applications where each interpretation the two interpretations may role, a sharp theoretical distinction between have particular value in helping to clarify the purpose or purposes of the For example, application and guide the adoption of appropriate methods. new knowledge about a genetic linkage may have immediate value as a of a particular basis for the genetic counseling family. Here one can in two of models decision consider problems as having some scope principle in the literal 'decisions' situation, one having interpreted sense (for example 'do not have another child' or 'do'); and the other model having 'decisions' with evidential (for exam interpretations to related scientific conclu statistical possible hypotheses ple concerning in the same
behavioral
sions about genetic if various Even
linkage). details of
the two models
should
(for correspond example the two decision functions adopted might, though they need not, in kind of interpretation), the in form though different be identical purposes and problems considered would be distinct, and hence properly and treated by distinct theoretical concepts. characterized In other applications where there is a problem of decisions in the sense, one may seek conclusions (or strong statistical evi a as for decisions In such cases, if some basis dence) making judiciously. to be an accurate model formal model of a decision problem is considered behavioral
in the relevant respects, one may as such is at (or statistical evidence) at worst and from clear distract ous, may appreciation decision problem and accurate model. On the other hand, that any formal model of the decision problem has sufficient of
the real situation
consider
conclusions
argue that to best superflu of
the actual
if it is not clear realism
to be
30
ALLAN
BIRNBAUM
or statistical of new knowledge (conclusions as a be basis for decisions.5 evidence) may naturally sought making The second example of the 1936 paper of Neyman and Pearson involves explicit consideration of both conclusions and related decisions, but is discussed so briefly and incompletely that I am unable to interpret it then development
applied,
from the standpoint of the preceding paragraphs. No other examples of were discussed in the joint papers. Thus the joint papers applications an contain no discussion of in which a scientific conclusion application was
the sole or primary
S. Pearson conclusions conclusions
6. THE
of an investigation. Various discuss applications 1937,1947,1962) object
(notably and decisions sought
CONCEPTS
(in the behavioral sense) as a basis for making decisions.
OF
TESTS
OF
NEYMAN
AND
DECISIONS AND
writings of E. inwhich both
are of interest, with
IN THE
1933
PAPER
PEARSON
The
1933 paper of Neyman and Pearson begins (pp. 141-2) with explicit about the meanings of concepts and methods of testing. The authors discuss "What is the precise meaning of the words 'an efficient test of a hypothesis?' There may be several meanings." concern
in the preceding litera concept of an 'efficient test' had appeared of testing, but the term 'efficient' had been introduced into mathematical statistics by Fisher in connection with his theory of estima tion in the early 1920's. No
ture
Fisher's
power and conceptual theory, with its striking mathematical and in stood the of the efforts of Neyman obscurities, depths background a to initiate and Pearson comparably systematic theory of tests, as they to their exploratory indicated in the introduction paper of 1928. Their in an exact form (rather than by asymptotic plan to treat testing problems case for of the approximations large samples, as Fisher had done) would some purely eliminate technical and thereby facilitate complications clarity concerning of tests.
concepts
such as 'efficient' or its analogues
in a theory
the side of applications, there was as much need for a systematic theory of tests as there had been for a more systematic theory of to in alternative estimation, guide investigators choosing among possible On
THE
NEYMAN-PEARSON
THEORY
31
sense in problems of increasing complexity, where the common had guided traditional and faltered. testing practice (Neyman Pearson began their 1930 paper with discussion of Romanovsky's 1928 paper which had given new distribution theory for several statistics for a tests
which
standard mining The on
out the open basic problem of "deter one to use in any given case.") appropriate a of 'an efficient test' which is clear definition supplied
testing problem, which is the most
1933 paper the mathematical
pointing
side, and is neutral in relation to the contrasting and evidential of 'decision' discussed above. interpretations An efficient test is defined as one in which the error probabilities (such as a and ? in our schema) are minimized (jointly in some appropriate evidential or behavioral of 'decisions' are sense). Whether interpretations
behavioral
seem to be a of error probabilities would an No of 'efficient test' has, even now, clearly appropriate goal. concept been proposed in terms of the earlier tradition of formulating testing to error probabilities under alternative (without reference problems In this sense one may say that it appears to have been hypotheses). in view,
such minimization
formu 'necessary' to make some change in the traditional mathematical as a basis for introducing a concept of an lation of testing problems, 'efficient test' which might guide applications and theoretical develop ments.
In any case, Neyman and Pearson met a problem of broad theoretical and practical scope by changing some of the terms of the problem, as have in done all problem areas.6 original investigators frequently some change in the mathematical formulation of testing Although seems to have been necessary, in the sense just indicated, the problems of the Neyman-Pearson innovation theoretical theory, the behavioral was not of sense: An in the tests, necessary interpretation following evidential has been associated with typical applications of interpretation tests in scientific research investigations in all periods of their use (which dates from 1710), without apparent the mathematical ing 1933 when
discontinuity during the years follow structure of the Neyman-Pearson
theory became widely accepted as the new or improved mathematical basis for the theory of tests. This observation 'What roles or functions was suggests the questions: the behavioral to intended serve?' and 'What functions has interpretation it served?' The joint papers suggest less than clear answers, while later
32
ALLAN
BIRNBAUM
and Pearson clearer suggest separately for the respective authors. the 1933 paper begins, as we have noted, with concern about Although of testing, it discusses the meanings of concepts only a mathematical an of the of 'efficient test' and the ; aspect meaning meaning of 'a test' (or a
papers written by Neyman answers which are different
is not discussed such as 'reject Hi) to extra-mathematical Brief interpretations. 'decision'
and evidential
behavioral
Behavioral:
"Such
when"...
a rule
interpretations
with regard systematically but clear and contrasting
appear:
as to whether case H tells us nothing in a particular is true ... "or false when"... ... "But... if we behave "rejected." then in the long run we shall reject H when it is true not more,
"accepted" to such a rule,
according say, than once
in a hundred
the frequency concerning Evidential: 1. In the "method
times, and in addition we may have" analogous of rejections of H when it is false." (p. 142.) ... in common use ... If F were of attack very
as an indication be considered that the hypothesis, H, generally false, and vice versa." (p. 141.) 2. "Let us now for a moment consider the form in which judgements
would
practical degrees reached when when
The
experience. of confidence; the following
We
may accept or we may decide
or we
may reject to remain in doubt.
position must be recognized. it is true; ifwe accept H0, we may be accepting is true." (p. 146.) really some alternative Ht
attitude
authors'
toward
a hypothesis But whatever
assurance
small,
was
this
probably
are made
in
with
varying conclusion is
If we
reject H0, we may reject it it when it is false, that is to say,
is not made interpretations from p. 142 gives approvingly the
evidential
quite clear. The preceding quotation a test in the new mathematical of behavioral interpretation as against the traditional "method of attack ... in common
formulation, use" (tradi
But the formulation, with evidential interpretation). from p. 146 (in a discussion not linked by the authors with that quotation of a test the evidential of pp. 141-2) describes approvingly interpretation formulation. in the new mathematical tional mathematical
is this apparent discrepancy An interpretation which would reconcile a as to not in intended to regard the behavioral apply interpretation sense in any direct, literal, or concrete situation of scientific research of the with an evidential incompatible interpretation a situation in a 'decisions' in question; but rather intended to apply in such way which is heuristic or hypothetical, serving to explain the inevitably with the error probabilities, associated abstract theoretical meanings
which
would
be
THE
NEYMAN-PEARSON
'decisions' such as 'reject Hx, a formal model of a decision
formal on
THEORY
33
and evidential
based interpretations Thus (test). hypothetical
problem interpretations may be regarded as playing a role in the inner theoretical core of the confidence concept.7 This interpretation of the relation between behavioral and evidential
behavioral
interpretations
seems
to that expressed
close
in various by E. S. Pearson Professor Pearson has 1962).
1937, 1947, 1955, (in particular notes which from unpublished the following quotations kindly permitted on an earlier draft of the present he wrote in April 1974, as comments terms 'behavioral' and do not appear in the 'evidential' (The paper. in their terms there the 'literal' appear places original notes; respective
writings
and
'elliptical',
which
were
used
in the earlier
version
of the present
paper.) as a practising statistician would have been what my outlook [In the 1920's and 1930's]... But to build such a structure one had to set out a mathematical you term evidential. theory ... I on the face of things, suggested a behavioral which led to rules which, interpetation. think you will pick up here and there inmy own papers signs of evidentiality, and you can say now
that we
or I should have stated clearly the difference between the behavioral we have suffered since in the way the people interpretations. Certainly ... concentrated (to an absurd extent often) on behavioral interpretations
and
evidential
interested in an application where a is when he encounters interpretation appropriate, a as of statistical in method such appears many interpretation re interpre and theoretical works, supplies his own evidential
Itmust happen evidential
an
behavioral
have
frequently
that a reader
expository tation of the given behavioral interpretation one, in order to relate the method cogently and interpretation. The 1920's and 1930's were
a period
if the writer
has not supplied
to his intended
of much
application
critical concern with
the
of terms and concepts in the possible meaningless as as of and various other well science, psychology, disciplines philosophy concerns were usually pursued in statistics. These in terms of such
meanings
and
or verificationism. as behaviorism, operationalism, Various writers applied these criteria with varying degrees of strin gency, greater stringency entailing smaller scope and importance for the doctrines
theoretical
and hypothetical concepts. the widest and most lasting
Perhaps been heightened
appreciation
qf both
influences the values
of these doctrines
have
and the limitations
of
34
ALLAN
BIRNBAUM
for the analysis and development of a discipline, along with a of the roles of essential theoretical, hypothetical, balancing appreciation and perhaps even metaphysical concepts. such criteria
7. THE
OF
STATUS
THE
IN THEORY
CONFIDENCE
AND
CONCEPT
APPLICATIONS
and theoretical above, there is no precise mathematical use in of the the wide confidence which concept system guides closely can not is clear that further alter this standard practice. (It developments to the theoretical situation. Cf. Birnbaum, 1969.) Rival approaches
As mentioned
interpretation
of
research
offer attractive
and Bayesian (notably the likelihood features of systematic precision and general fail to satisfy those who prefer the confidence
data
approaches) ity; but their basic concepts concept for the kind of theoretical
control it provides over the objective error probabilities in sch?mas like that above).8 of interest (appearing in all The ad hoc aspects of the confidence concept are encountered that of above. testing genetic linkage discussed including applications, to its mathematical basis in the Neyman aspects are related as follows. Pearson theory of of two simple hypotheses, the problem In a given problem a and ? (solved by Neyman and of error probabilities minimization Pearson in 1933) leads not to a unique best test or decision function but to These
a family of best tests, each of which has the smallest possible value of ? among all tests with the same (or smaller) value of a, including for best tests: the following points (a, ?) representing respective example (0.01, 0.05),
(0.02, 0.02),
and (0.05, 0.01).
such as our linkage investigation, For a given application nothing in the nor to a particular the leads confidence concept Neyman-Pearson theory are choice among these, yet choices of this kind always made, implicitly if not explicitly, whenever the confidence concept is applied. concept is its aspect of the ad hoc character of the confidence not been in which has very widely great potential flexibility applications, exploited. We may illustrate this in the preceding problem of two simple tests were considered. We may define a where three possible hypotheses, Another
generalized
kind of
test of statistical
hypotheses
in terms of a formal
THE
THEORY
35
taking three (rather than the usual
function
decision
NEYMAN-PEARSON
two) possible
values,
as follows: The decision
function
takes the possible
dx:
strong evidence
d2:
neutral
d3:
strong evidence
or weak
forH2
values:
as against Hx
evidence forHx
as against H2.
It takes the value dx on those sample points where the test characterized it takes the value d3 on those points reject Hx; by (0.01, 0.05) would and it takes the value d2 on where the test (0.05, 0.01) would accept Hi, test requires a the remaining sample points. Such a 'three-decision' scheme of a new form to represent itsmore numerous error probabilities, which
as follows:
are defined
= <xi Vrob{di\Hx) ? of a major probability a2
error of Type
I
of a minor
error of Type
I
= Yro\> {d3\H2) of a major probability
error of Type
II
error of Type
II
= =
?i
Prob(?/2|//1) probability
=
02
= =
Prob(rf2|i/2) probability
of a minor
that the original tests were best, that these (It follows from the assumption error probabilities are minimized jointly in the usual sense. The ad hoc tests has not been eliminated, character of two-decision but reappears in the tests; and is illustrated once more by considering test which could be determined four-decision simi
such three-decision possible
alternative
above. larly by using also the test characterized by (0.02,0.02) In contrast and the the likelihood approach, technically Bayesian approaches, direct interpretations
related
are formally elegant, allowing intuitively plausible of all possible numerical values of the likelihood
ratio statistic as indicating strength of statistical evidence in this problem.) As other examples of methods for implementation of the confidence concept,
outside
the familiar
categories
of testing
and of estimation
by
36
ALLAN
BIRNBAUM
confidence nested regions and regions, we may mention and Schatzoff, tests (e.g. Birnbaum, 1961 Dempster ; 1965; Stone, or more for three and methods 1969); testing' among 'generalized and and for classification alternative statistical hypotheses (e.g. Birnbaum confidence related
Maxwell,
1960). theoretical
specifically concerned with apparent in the way of giving a precise general impossibilities treatment of the confidence theoretical concepts, concept and associated we may mention Barndorff-Nielsen (1959), (1971, 1973), Buehler Cox Birnbaum and Fedderson Buehler 1970, 1972b), (1969, (1963), Among difficulties
contributions
or
(1971), and Durbin (1970). The
confidence
concept
mathematical
an extra upon which appear in as a is usually described
in principle depends of the error probabilities
interpretation like that above, and this interpretation and the same terms are often interpretation; 'frequentist' or 'objectivist'
sch?mas used
to describe
The
two theoretical
the whole
based
approach
of the
interpretations in interpretations
on the confidence 'decision'
concept
concept. discussed
of probabilities. used among has term propensity become widely interpretation in recent years to denote the kinds of 'objective' interpreta philosophers terms in and accurate for many theoretical tion which seem appropriate for Mellor, 1971; (See science, including probability. Hacking, example 1965; Braithwaite, 1954.) The confidence concept seems to call for this above
have
analogues
The
kind of interpretation
of error probabilities,
rather than any more
directly as we behavioristic) interpretation, frequency (literal, operationalist, this On of the confidence have indicated in earlier discussion concept. as of of criticisms view, against interpretations probability, frequency are not relevant to the confidence concept. interpretations, a in scientific of rounded any probability interpretation (Presumably a and of role statistical for would evidence, concepts specify discipline
propensity
perhaps
'practical certainty' associated with some associated with probability among the aspects of meaning theoretical terms, such as 'genetic factor' in Mendelian
also for the notion
applications, and related
of
genetics.) We shall not attempt to survey the current status of the confidence concept in theory and applications. This would be a formidable task, since of call for an account of the largely implicit interpretations it would
THE
NEYMAN-PEARSON
37
THEORY
in a great variety of scientific research methods a in and literature statistical large growing disciplines, including theoretical and expository works. It is hoped that the present paper will prove helpful to the interested reader as he makes his own observations standard
statistical and
the nature of concerning judgements in work various statistical disciplines. applied
and
8. OBJECTIONS
TO
LINDLEY-SAVAGE
A
BASIC
standard
theoretical
ASSUMPTION FOR
ARGUMENT
OF
BAYESIAN
and
THE THEORY
of the important and influential theoretical arguments for Bayesian We shall show here that this argument. theory is the Lindley-Savage force, as an argument for argument has no direct relevance nor persuasive as against typical standard statistical practice with Bayesian methods
One
of the argument data, by showing that an assumption holds only for 'decisions' under behavioral but not under interpretations, which constitute the evidential standard statistical prac interpretations scientific
research
tice.
in terms of simple is elementary, argument being formulated like those above. The original of tests (decision functions) examples of the argument somewhat informal accounts (1962, pp. by Savage 173-5) and Lindley (1971, p. 13-14) should be read by the interested The
reader. They are complemented by a formalized version in an appendix below. additional discussion, The Lindley-Savage argument concerns judgements
of the argument,
with else
indifference
or of preference decision functions
between
alternative
on some
of statistical simple examples to express in the first person
(equivalence) with each decision function (tests) in problems of two simple hypotheses, = a in P the unit (a, ?) square, determined represented by its by point error probabilities a and ?. Our
discussion
evidence
given
will above,
be based which
we
continue
usage.
In some research situations Iwould strongly prefer to use a Examples. decision function (test) characterized by (0.05, 0.05) rather than one characterized by (0.1, 0). In such situations by use
of
(0.05,0.05),
I particularly
value
that strong
the guarantee, which is provided evidence will be obtained (either
38
ALLAN
supporting Hi (0.1, 0) allows
against H2, the possibility
BIRNBAUM
or supporting H2 that merely weak
against Hx).
The
use of
evidence, represented by will For be the obtained. 0.1, 0), example, knowledge (reject Hx in the background of a linkage investigation may include strong (though not conclusive) statistical evidence for the locations of all but one of the for H2,
genetic factors which control a certain system of immune reactions; and the current investigation may have as its object just to determine whether No. 1 or No. 2. the remaining factor lies on chromosome of Let Hx now stand for the hypothesis known to lie on No. 1, and H2 the alternative
linkage with
another
factor
In this situation hypothesis. I would avoid the risk of getting merely weak evidence by choosing and would be able to complete rather than (0.1,0); the (0.05,0.05) a on basis of (chromosome pattern of knowledge map) of the system strong evidence. in some situations
consistently
(including the same linkage investigation), Similarly, I would prefer (0.05, 0.05) to (0, 0.1), for similar reasons. In some situations (including the same linkage investigation) Iwould be indifferent as between (0.1, 0) and (0, 0.1), on grounds of their symmetry in question. the investigation and of judgements of symmetry concerning of preferences
This pattern
may
(0.05, 0.05)>(0.1, where
>
be summarized
0)^(0,
by
0.1),
~ stands for 'is to.' for 'is preferred to' and equivalent of the is incompatible with Assumption of preferences (II) as formulated in the appendix. (It is also argument
stands
This pattern Lindley-Savage
theory, as will be incompatible with a basic premise underlying Wald's to next suffices illustrate that that indicated in the section.) This example is not satisfied generally by the 'decision' concept associated assumption in tests as interpreted with statistical (not behaviorally) evidentially typical research applications. A different but analogous is the preference pattern
example
(0.1, 0)~(0,0.1)>(0.05, In some
incompatible
with Assumption
(II)
0.05).
In I would have this preference situations pattern. a of in the if the knowledge linkage investigation background particular, for the locations of all but one of statistical evidence includes conclusive research
THE
the factors which
control
in view
scientific
THEORY
NEYMAN-PEARSON
certain
39
immune
then with certain reactions, rather than the guarantee prefer,
Iwould
goals strongly of strong (but inconclusive) evidence provided by (0.05, 0.05), the uncer of completing tain possibility with conclusive evidence the pattern of in question which is provided by either (0.1, 0) or (0,0.1); and knowledge I would
be indifferent
as between
them.
in one
(II) expresses important way the concept of ration or to all statistical is central coherence) which ality (or consistency, decision theories. Our criticism of this assumption and the concept it of expresses may serve as a warning against oversimplified judgements Assumption
(or 'inconsistency',
'irrationality'
9. COMMENTS
ON
or 'incoherence').
A BASIC
DECISION
PREMISE
OF WALDS
THEORY
of decision functions play important technical and theoretical of Wald's in the development (1950) statistical decision theory. is symbolized An example of a mixture by
'Mixtures' roles
M
=
?(0,0.1)+?(0.1,0).
as before two decision and (0.1,0) functions (0,0.1) represent error their of characterized The (tests), by respective pairs probabilities. whole expression M stands for another decision function defined in terms
Here
two decision functions and an auxiliary randomization variable, a as a toss fair If of follows: coin shows the the decision coin, heads, say to function is the observed otherwise (0,0.1) applied sample point; is applied. (0.1,0)
of those
To determine
the error probabilities
which
characterize
the decision
function M, we find readily (a,/8)
=
?(0,0.1)+?(0.1,0)
=
(0.05,0.05).
are 0, if (0, 0.1) (For example, under Hx the respective error probabilities is applied; and 0.1, if (0.1, 0) is applied; and each will be applied with
probability \.) The preceding discussion is based on a tacit assumption of a behavioral, and not a literal, interpretation of the decision functions considered.
40
ALLAN
One
way
preceding
of
this
illustrating section:
Suppose
BIRNBAUM
is by reference
my preference
pattern
(0, 0.1)~(0.1,
0)>(0.05,
to an example
of
the
includes 0.05).
it is plausible that Imay be indifferent also as between (0, 0.1) and me an will with of since the latter M, (0, 0.1) or else an provide application I regard as equally satisfactory. of (0.1,0) which But this application includes that my pattern preference implies Then
0.05),
M>(0.05, or, representing
M
now by its pair of error probabilities
as determined
above,
(0.05, 0.05) > (0.05, 0.05) which
is absurd.
is that the preference The fallacy in the preceding discussion pattern of first assumed above arose in an example of evidential interpretations 'decisions', while the calculation of the preceding paragraph was based on a behavioral
In particular, for (0, 0.1) as the preference interpretation. was on a to the based value ascribed against (0.05, 0.05) particularly high of statistical evidence by symbolized possibility (reject Hx in which
the 'decision'
forH2,
0,0.1),
('reject Hx
forH2)
appears within
the symbol
for
an evidential On
interpretation. the other hand, in the calculation
of the error probability
= a=?(0) + ?(0.1) 0.05 we
without ('reject Hx for H2), just the 'decision' the which characterize the error probabilities concerning qualifications which that 'deci from decision functions (sch?mas) different respective that 'decision' behavior that is, we tacitly interpreted sion' can result above,
considered
ally. of The general point illustrated is that while behavioral interpretations 'decisions' may play a very valuable heuristic role in the mathematical statistical and Wald of the Neyman-Pearson theories, development
THEORY
NEYMAN-PEARSON
THE
developed within those theories reinterpreted) with care when considered
methods
can and must for possible
41
be interpreted (or use with evidential
interpretations.
APPENDIX.
THE
ON
LINDLEY-SAVAGE
BAYESIAN
ARGUMENT
FOR
THEORY9
a recog The Lindley-Savage takes as its point of departure argument the (non-Bayesian) encountered whenever nized problem theories of are to and Wald be illustrated the Neyman-Pearson applied, problem as one source of the ad hoc character of the confidence concept: that of choosing among the various best tests (decision functions) (a, ?) available for a given application. The argument shows that if this problem or 'coherently') in a of choice is treated 'rationally' (or 'consistently', above
sense discussed
above
you thought you wanted be viewed as a natural
8, then 'you' are "a Bayesian, whether to be or not.... Thus, the Bayesian position can an overlooked step in the classical completion,
in Section
theory." (Savage, 1962, p. 175.) The last comments refer to the final step of the argument, which may be as follows: Suppose you judge as equivalent, for a illustrated in prototype three decision functions characterized given application, respectively by (0, 0.1),
(0.05, 0.05),
and (0.1, 0).
to be Then ... "you" are "a Bayesian, whether you thought you wanted or not..." sense in the in this context, is that your preference behavior, for example, a Bayesian who from that of a Bayesian; indistinguishable toHx and H2, and losses ascribes prior probabilities gx and g2 respectively to the errors of the first and second types, will also Lx and L2 respectively as that between those three decision be indifferent functions, provided = an aspect of the represents g2L2. Such gxLx 'indistinguishability' is basic to Savage's Bayesian decision behaviorist point of view which are evident of viewpoints theory. But clear and important distinctions who may have a decision here from the standpoint of a non-Bayesian sense to reach a conclusion in who wish the behavioral but may problem a as sense basis for the discussed above) (in making a decision, perhaps he regards no complete model of a decision problem, including are clear also loss functions, as clearly accurate. Important distinctions
because
42
ALLAN
BIRNBAUM
the standpoint of an investigator who has no decision problem in the sense of evidential under the confidence except interpretations concept, and finds no place in his thinking for loss functions nor Bayesian from
even if he may be indifferent of statistical hypotheses, in a probabilities context three tests represented between given research by the three above. points in prototype, follows a final step of the argument, just discussed is that 'you' have a prefer formalized argument whose conclusion ence pattern among tests (decision functions) characterized by indiffer ence sets consisting of parallel line segments which cover the unit square The
more
of points patterns),
(and thus coinciding with certain Bayesian including for example PP' and QQ' in Figure 1.We
(a, ?)
Fig.
assumptions
preference discuss the
1.
of this argument before presenting the derivation itself. The are formulated in terms of the mathematical and derivation
assumptions concept of equivalence of interest interpretation
classes
among points of the unit square; the two tests is that a person's indifference between be that the (decision functions) may by stating points charac represented are equivalent. terizing the tests
THE
ASSUMPTION
NEYMAN-PEARSON
(I). There
two distinct
exist
THEORY
43
points P and P' which
are
equivalent. examples of (I) are P and P' in the figure; and the points (0, 0.1) seems free and (0.1,0) considered in examples above. This assumption from possible plausible objections, for the following reason. The point
Possible
to the point (0,0.1), is preferred and the latter is preferred to on the basis of the non-controversial (0.1, 0.1), principle of inadmissibil which ity (regardless of possible evidential or behavioral interpretations (0,0)
be of
the respective interest). Consider points (a, a) of the line to from and that (0.1, 0.1), (0, 0) suppose segment you judge that no such to (0,0.1). point is equivalent
may
as a
Then
implausible to (0,0.1)' intermediate Our
from 0, your preferences show an continuously some at of value from a, discontinuity 'prefer (a, a) jumping to 'prefer (0,0.1) to (a, a)' without the anywhere assuming increases
value
comments
to a simpler
reference
ASSUMPTION where
equivalent, 1.
between
(a, a) and (0,0.1)'. are stated conveniently assumption
'indifferent on the second restricted
with
case:
then P and P" are also (II*). If P and P' are equivalent, P" = kP + (1 k)P' and k is any number between 0 and
if k = \, P = (0, 0.1), and P' = (0.1, 0), then P" = (0.05, 0.05), of a mixture discussed the example in Section 9 above, representing to be of (0.05,0.05) with where we found the equivalence (0,0.1) For example
under
plausible general
a behavioral
under an evidential in the context
equivalence
of
interpretation
interpretation. of the examples
(II*) is the special Assumption = which R=P Q.
'decisions'
In particular of Section
but not
we rejected 8.
case of the following
assumption
in that in
then Q and Q' are also (II). If P and P' are equivalent, = kP + = kP' + Q (1 k)R, Q' (1 k)R, R may be any equivalent, point, and k may be any number in the unit interval.
ASSUMPTION
where
LINDLEY-SAVAGE
LEMMA. Assumptions (I) and (II) imply that
the unit square is partitioned line segment parallel to PP.
into equivalence
sets, each consisting
of a
44
ALLAN
BIRNBAUM
Proof: (1) By (I) there exist two distinct equivalent points P and P'. on the perimeter of the unit square. Let k be (2) Let R be any point = any number satisfying 0 < k < 1, and let Q kP+(1 k)R, and let = kP' + case The of R collinear with P Q' (1 k)R. (See Figure 1.) are and P' ismentioned and since Q' below.) By (II), Q equivalent, are P and P' equivalent. (3) The
line segment QQ' is parallel with the segment PP', since the triangles RQQ' and RPP' are similar and have the common vertex R. c
(4) Let
be
cP+{l-c)P'.
0<c<l, any number satisfying P and P" are equivalent, by (II). case (II*) of (II) applies here.)
and
let P"
(The special (5) Since c is arbitrary, it follows that all points of the line segment PP' are equivalent. are all points of the segment QQ' Similarly equivalent. (6) Since k is arbitrary, it follows that the triangle RPP' is covered family of line segments each parallel to PP', each of which
by a is an
class.
equivalence
of the unit square, the square is (7) As R sweeps out the circumference covered by such triangles; and each triangle is again covered by to PP', each segment consisting of equivalent segments parallel points. (The case of R collinear be special.) (8) The
union
with PP'
is seen at this point not to
with QQ' is a single of since the square; perimeter points equivalence this interval is an equivalence set. Similarly for other
of all such
segments
collinear
between
segment is transitive,
Thus the unit square is partitioned into a to each of line PP'. sets, segment parallel consisting equivalence This completes the proof of the Lemma.f segments
University
mentioned.
College,
London
the present of paper were ready only after the death were checked proofs kindly by the staff of The City and the University It was found that the bibliography London London. University, College, was incomplete, and even though several corrections and additions were made, there still t Editors'
Note.
Professor
Allan
remain
gaps
The proofs of Birnbaum. The
in the bibliographical
data.
THE
NEYMAN-PEARSON
THEORY
45
NOTES * of parts of this material of earlier versions for helpful discussions is grateful The writer E. S. Pearson, J. Pratt, C. A. B. Smith, A. D. V. Lindley, with many colleagues, particularly P. Dawid, G. Robinson, B. Norton, and M. Stone. 1 in the its appearance, linked with the term 'decide', The term 'rule of behavior' made of the problem in the discussion the formulation of testing 1933 paper, introducing the concept of p. 142, reprint). (p. 291, original; Subsequently hypotheses was elaborated to various other concepts in opposition and supported, behavior' of statistical inference 1957, 1962, 1971). (1947, (inductive by Neyman reasoning'), 2 who are also prominent theoretical the decision statisticians, concept Among geneticists statistical
'inductive
in scientific data has been rejected as inappropriate (at least in its behavioral interpretation) in statistical from different analysis, standpoints theory, by: in a non from the standpoint of standard methods 1. O. Kempthorne, interpreted below behavioral 1971, pp. 471-3, (for example, 489); way similar to that discussed a version of Bayesian 2. C. A. B. Smith, who has developed theory, and has led in the use of in genetics in scientific publications methods (1959, p. 297); Bayesian an exponent of the likelihood who has applied that 3. A. W. F. Edwards, approach, in genetics in his scientific publications (1972); and approach 4. R. A. Fisher 1956, pp. 100-103). (for example, a for problems of testing linkage, where is unrealistic The case of two simple hypotheses to is of the statistical scientific represent generally composite adopted hypothesis hypothesis of two simple hypotheses entails no sacrifice of the simplified model linkage. However to the questions of interpretation in this paper. On the realism with respect considered use of simple tests in practice of linkage often make formulations contrary, typical a more to for technical realistic reasons, represent composite effectively hypothesis, 1953, pp. 180-183). 1955; Smith, (Morton, comments of the example of the Analogous apply to the limited realism of our discussion It turns out that the realistic composite lamp manufacturer: hypothesis representing good lot quality reasons, is, for technical (at most 4% defective) represented effectively by the in the sense that the value a characterizing any simple hypothesis (exactly 4% defective), decision function for the simplified is also an upper bound of error ('admissible') problem
hypothesis
over the realistic probabilities alternative hypothesis. 3 The essential point epitomized
composite
hypothesis.
comments
Similar
apply
to
the
here is that there is a distinction of levels of language, the in the 'object language' of things and behavioral acts, the second in the occurring in which we discuss a certain statement 'metalanguage' Apparent (hypothesis). exceptions to the epitomization terms. For example, in the preceding in a scientific require explanation context research 'to decide that a certain hypothesis is supported is by strong evidence'
first phrase
tantamount
to 'to decide
to make
the statement
that the hypothesis
is supported
by strong
evidence.' occurrence The apparently here of 'decide to' with an evidential reference is exceptional occurs here in the metalanguage (where explained by pointing out that 'tomake a statement' are expressed), all evidential considerations and so is not a case of 'to act' when that phrase occurs in the object it has behavioral language, where interpretations. 4 of these aspects of simple genetics of joint consideration research problems will Examples in Smith (1968) and Mendel be found, for example, (1866). The present writer will offer an extended study
discussion
in the structure
of such
considerations
of science.'
in another
paper,
'Mendelian
genetics:
a case
ALLAN
46
BIRNBAUM
5
a behavioral of 'decisions' the where Even in applications interpretation clearly applies, has had a slow and of decision formal models of complete scope of applications problems limited development Brown, 1970); possibly due in part to considerations (see, for example, above. discussed 6 of the error probability of testing problems the counterpart formulation In the traditional a was the 'probability theoretical level' statistic P = P(x). The aspect of the traditional with that statistic, under which of statistical evidence associated is a concept formulation as an index of strength of evidence the hypothesis is interpreted Hx, with P(x) against is traditional evidence. Thus the smaller values of P{x) stronger interpretation indicating was an and not behavioral evidential (in any direct sense), and the behavioral interpretation of the Neyman-Pearson innovation theory. In many dichotomy Here 0.05
in terms of a the statistic P(x) was (and is) interpreted schematically, applications if if is and evidence such as: the statistical F(jc)^0.05. strong only against H1 a in our schema; and the schematized to the error probability form corresponds
can be represented takes function which formulation by a formal decision if and only if the observed the value sample point x gives P(x)^0.05. 'reject H{ 7 of certain relative that there is any behavioral is not to deny This (literal) realization a in the schema the error probabilities of errors, approximating representing frequencies or same of tests of form. of the in certain series test, conceivable) (actual applications long in a somewhat is related is that such a behavioral What is suggested abstract, interpretation or to of a single the evidential indirect theoretical) way interpretation (hypothetical of the traditional
relation of the evidential situation. This theoretical of a test in a given research application to a certain behavioral of the in such an application, of a 'decision' interpretation meaning same formal context does not reduce or in another 'decision' (a series of applications), ones. On the contrary, in favor of behavioral apprecia interpretations of the hypothetical with appreciation a behavioral coupled interpretation, in the given research it bears to an evidential theoretical relation situation, interpretation as an important part of appreciation of the meaning of statistical evidence may be regarded evidential
eliminate tion of
such
as interpreted under the confidence concept. 8 of statistical The likelihood 1972) is based on a primitive (Edwards, concept approach our to of the formulation confidence evidence which (Conf) appears analogous closely the kind of theoretical does not satisfy the latter nor provide but which nevertheless concept, in It was rejected by Neyman and Pearson mentioned above. of error probabilities in their 1933 paper, after they had used it as the basis of their favor of the confidence concept the two of incompatibilities between 1928 paper. A detailed discussion exploratory
control
concepts The {L')\
is given likelihood
in Birnbaum concept
may
(1969). be formulated
thus:
If an observed
sample point has very small probability then it provides to its probability (density) under H2, H2 as against H\.
relative (density) under Hu for statistical evidence
strong
were and taken up successively concepts by Neyman to the simpler primitive concept of statistical evidence which which has formulation, (usually implicitly) with tests in their traditional as since 1710. Both in applications been represented (Conf) and (L') may be considered thus: that traditional in analogous ways, concept, which may be formulated assimilating, The
likelihood
and
as plausible has been associated
(P): A
concept
against Hi
confidence
successors
Pearson
of
statistical
with
very
evidence
is not
small probability
plausible when Hi
unless is true.
it finds
'strong
evidence
THE
NEYMAN-PEARSON
THEORY
47
In traditional
this concept had been complemented by unformalized practice judgement as in the devising and selection of test statistics, which were then interpreted indices of strength of statistical evidence against a hypothesis Hx, without explicit reference to alternative hypotheses.
exercized
of the concepts of evidence mentioned may be regarded as a refined version of that moves familiar intuitive seems which observed us, when concept something or 'unlikely' toward reconsidera (in any sense, often not specified explicitly), 'improbable' Each
simpler
tion of some hypothesis, perhaps only tacitly held. 9 The reader is urged to compare this discussion with cited in Section 8. by Savage and Lindley
the original
versions
of the argument
BIBLIOGRAPHY Barndorff-Nielsen, Aarhus.
O.,
On
1971,
Statistical
Conditional
An Omnibus 'Confidence Curves: 1961, A., Statistical Hypotheses', Journal of theAmerican Testing 246-249.
Inference
(mimeographed),
for Estimation and Technique Statistical Association 56 (1961),
Birnbaum,
of Statistical Journal of the American Inference', Birnbaum, A., 1962, 'On the Foundations Statistical Association 57 (1962), 269-326 (with discussion). inPhilosophy A., 1969, 'Concepts of Statistical Evidence', Science, and Method: Birnbaum, and in Honor Patrick (ed. by Sidney Morgenbesser, Essays Suppes, of Ernest Nagel Morton White), St. Martin's Press, New York. Birnbaum, American
A., 1970, Statistical
Birnbaum,
A.,
'On Durbin's Association
1972a,
Modified 65
'The Random
of Conditionality',
Principle 402-403.
(1970), Phenotype
Journal
with Applications',
Concept,
of the 72
Genetics
(1972), 739-758. A., Birnbaum, 1972b, Statistical Association A.
Birnbaum, Formula',
and Maxwell, Statistics
Applied R. B.,
Braithwaite, R. V.,
Brown,
Review, Buehler, matical
'More on Concepts of Statistical 67 (1972), 858-861. A.
'Classification E., 1960, 152-159. 9 (1960),
D.
Procedures
1954, Scientific Explanation, Cambridge University 'Do Managers Find Decision Useful?', Theory
May-June. R. J., 1959, Statistics
R.,
Journal
1970,
30
'Some Validity Criteria (1959), 845-863.
R. J. and Fedderson, A. P., 1963, Buehler, Ann. Math. Statist. 34 (1963), 1098-1100. Cox,
Evidence',
1958,
'Some
Problems
for Statistical
Inference',
'Note on a Conditional
Connected
with
Statistical
of the American Based
on Bayes
Press. Harvard Annals
Property
Business of Mathe
of Student's
Inference',
Annals
t',
of
Mathematical Statistics 29 (1958), 357-372. Cox,
D. R.,
1971,
'The Choice
Between
Alternative
Ancillary
Statistics',
Journal
of the
Royal Statistical Society 33 (B) (1971), 251-255. A.
Dempster, Index
P. and Schatzoff, M., Journal Statistics',
for Test
1965, 'Expected the American
of
as a Sensitivity Level Significance Statistical Association 60 (1965),
420-436. Durbin,
'On Birnbaum's Theorem of the Relation Between J., 1970, Sufficiency, and Likelihood, Journal Statistical Association 65 of the American (followed notes). by two discussion
tionality, 395-398
Condi (1970),
A. W.
Edwards,
R. A.,
Fisher,
Press. 1972, Likelihood, Cambridge University and Scientific Statistical Methods Inference, Oliver Boyd, The Logic of Statistical University Inference, Cambridge
F., 1956,
Ian,
Hacking,
BIRNBAUM
ALLAN
48
1965,
Edinburgh. Press.
3rd ed., Oxford University Press, London. H., 1961, Theory of Probability, A Review', for Industrial D. V., 1971, and Applied Statistics, 'Bayesian Society
Jeffreys, Lindley,
Mathematics,
Philadelphia.
Press. 1971, The Matter of Chance, Cambridge University der Naturforschenden G., 1866, 'Versuche ?ber Pflanzenhybriden', Mendel, Verhandlungen 4 (1865), 3-44. inExperiments in Brunn in Plant Hybridiza Vereins translation (English Mellor,
Hugh,
and Boyd. tion, ed. by J. H. Bennett, 1965, Oliver N. E., 1955, 'Sequential Tests for the Detection Morton, Human Genetics 7, (1955), 277-318. Nagel,
Ernest,
Neyman,
J.,
Washington
The Structure
1961, 1938, D.C.
(2nd ed., Graduate
1952),
of Linkage',
American
New York. Harcourt-Brace, on Mathematical and Conferences of Agriculture. Department
of Science, Lectures
School, U.S. 'Raisonnement inductif
ou comportement
J., 1947, Neyman, Statistical International
inductif,
3, 423-433. Conference as a Basic Concept of Philosophy J., 1957, 'Inductive Behavior Neyman, Institute 25 (1957), 7-22. Statistical the International of in the Theory of Statistical 'Two Breakthroughs J., 1962, Neyman, Review Neyman, Criteria Neyman, Statistical
Journal
of
Statistics,
Proceedings of Science',
of the Review
Decision-Making', 11-27. Statistical Institute 30 (1962), of the International of Certain E. S., 1928, 'On the Use and Interpretation Test J. and Pearson, Part I', Biometrika 20A (1928), for Purposes of Statistical 175-240. Inference, E. S., 1933, of the Most Tests of 'On the Problem Efficient J. and Pearson, Transactions
Philosophical
Hypotheses',
the Royal
of
of London
Society
231 (A), 289-337 (pp. 140-185 in 1967 reprinting). Neyman,
J. and Pearson, Statistical
to the Theory of Testing 'Contributions S., 1936, Research Memoirs vol. I. pp. 113-137 (pp. 203-239
E.
Hypothesis',
reprinting.) J. and Pearson, E. S., 1967, Joint Statistical Papers, Neyman, E. S., 1966, The Selected Papers of E. S. Pearson, Pearson, California. Berkeley, Renwick, (1971),
J., 1971, 81-120.
G. K., Robinson, Behrens-Fisher
'The Mapping 1974, Solution
of Human
'Conditional
University Annual
Confidence
to the Two Means
of
Properties
Problem',
in 1967
Press. University of California Press,
Cambridge
Chromosomes',
Statistical
Review Student's
of Genetics t and
of
5 the
unpublished. New Wiley,
York. Leonard J., 1954, The Foundations of Statistics, inRecent Developments in Information and J., 1962, 'Bayesian Statistics', Savage, Leonard N.Y. and London. and P. Gray), Macmillan, Decision Processes (ed. by R. E. Machol in Human Genetics', of Linkage 'The Detection Journal of the Smith, Cedric A. B., 1953,
Savage,
Royal Statistical Society 15 (B) (1953), 155-192. Cedric A. B., 1959, American Investigations', Smith, Cedric A. B., 1965, Smith,
'Some Comments Journal 'Personal
of Human Probability
on
the Statistical
Genetics
11
and Statistical
Methods
(1959),
Used
in Linkage
289-403.
Analysis,
with Discussion',
Journal of theRoyal Statistical Society 128 (A) (1965), 469-499. Smith, Cedric Generation
A.
B.,
Families',
in Simple Twoand Corrections 1968, 'Linkage Scores 33 (1968), 127-150. Genetics Annals of Human
and Three
THE
Stone, M.,
1969,
'The Role
THEORY
NEYMAN-PEARSON
of Significance
Testing:
Some Data
with
49
aMessage',
Biometrika
56 (1969), 485-493. v. Decisions', J. W., 'Conclusions 1960, Tukey, Functions, Wald, A., 1950, Statistical Decision
Technometrics Wiley,
New
2 (1969), York.
423-433.