Basis of fisher.test

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Basis of fisher.test

Ted.Harding
I want to ascertain the basis of the table ranking,
i.e. the meaning of "extreme", in Fisher's Exact Test
as implemented in 'fisher.test', when applied to RxC
tables which are larger than 2x2.

One can summarise a strategy for the test as

1) For each table compatible with the margins
   of the observed table, compute the probability
   of this table conditional on the marginal totals.

2) Rank the possible tables in order of a measure
   of discrepancy between the table and the null
   hypothesis of "no association".

3) Locate the observed table, and compute the sum
   of the probabilties, computed in (1), for this
   table and more "extreme" tables in the sense of
   the ranking in (2).

The question is: what "measure of discrepancy" is
used in 'fisher.test' corresponding to stage (2)?

(There are in principle several possibilities, e.g.
value of a Pearson chi-squared, large values being
discrepant; the probability calculated in (2),
small values being discrepant; ... )

"?fisher.test" says only:

     In the one-sided 2 by 2 cases, p-values are obtained
     directly using the hypergeometric distribution.
     Otherwise, computations are based on a C version of
     the FORTRAN subroutine FEXACT which implements the
     network developed by Mehta and Patel (1986) and
     improved by Clarkson, Fan & Joe (1993). The FORTRAN
     code can be obtained from
     <URL: http://www.netlib.org/toms/643>.

I have had a look at this FORTRAN code, and cannot ascertain
it from the code itself. However, there is a Comment to the
effect:

c     PRE    - Table p-value.  (Output)
c              PRE is the probability of a more extreme table, where
c              'extreme' is in a probabilistic sense.

which suggests that the tables are ranked in order of their
probabilities as computed in (2).

Can anyone confirm definitively what goes on?

With thanks,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Fax-to-email: +44 (0)870 094 0861
Date: 12-Jan-06                                       Time: 20:19:02
------------------------------ XFMail ------------------------------

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Basis of fisher.test

Peter Dalgaard
(Ted Harding) <[hidden email]> writes:

> I want to ascertain the basis of the table ranking,
> i.e. the meaning of "extreme", in Fisher's Exact Test
> as implemented in 'fisher.test', when applied to RxC
> tables which are larger than 2x2.
>
> One can summarise a strategy for the test as
>
> 1) For each table compatible with the margins
>    of the observed table, compute the probability
>    of this table conditional on the marginal totals.
>
> 2) Rank the possible tables in order of a measure
>    of discrepancy between the table and the null
>    hypothesis of "no association".
>
> 3) Locate the observed table, and compute the sum
>    of the probabilties, computed in (1), for this
>    table and more "extreme" tables in the sense of
>    the ranking in (2).
>
> The question is: what "measure of discrepancy" is
> used in 'fisher.test' corresponding to stage (2)?
>
> (There are in principle several possibilities, e.g.
> value of a Pearson chi-squared, large values being
> discrepant; the probability calculated in (2),
> small values being discrepant; ... )
>
> "?fisher.test" says only:
>
>      In the one-sided 2 by 2 cases, p-values are obtained
>      directly using the hypergeometric distribution.
>      Otherwise, computations are based on a C version of
>      the FORTRAN subroutine FEXACT which implements the
>      network developed by Mehta and Patel (1986) and
>      improved by Clarkson, Fan & Joe (1993). The FORTRAN
>      code can be obtained from
>      <URL: http://www.netlib.org/toms/643>.
>
> I have had a look at this FORTRAN code, and cannot ascertain
> it from the code itself. However, there is a Comment to the
> effect:
>
> c     PRE    - Table p-value.  (Output)
> c              PRE is the probability of a more extreme table, where
> c              'extreme' is in a probabilistic sense.
>
> which suggests that the tables are ranked in order of their
> probabilities as computed in (2).
>
> Can anyone confirm definitively what goes on?

To my knowledge, it is the "table probability", according to the
hypergeometric distribution, i.e. the probability of the table given
the marginals, which can be translated to sampling a+b balls without
replacement from a box with a+c white and b+d black balls.

Playing around with dhyper should be instructive.

(You're right that the "two-sided" p values are obtained by summing
all smaller or equal table probabilities. This is the traditional way,
but there are alternatives, e.g. tail balancing.)

--
   O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Basis of fisher.test

Brian Ripley
In reply to this post by Ted.Harding
On Thu, 12 Jan 2006 [hidden email] wrote:

> I want to ascertain the basis of the table ranking,
> i.e. the meaning of "extreme", in Fisher's Exact Test
> as implemented in 'fisher.test', when applied to RxC
> tables which are larger than 2x2.
>
> One can summarise a strategy for the test as
>
> 1) For each table compatible with the margins
>   of the observed table, compute the probability
>   of this table conditional on the marginal totals.
>
> 2) Rank the possible tables in order of a measure
>   of discrepancy between the table and the null
>   hypothesis of "no association".
>
> 3) Locate the observed table, and compute the sum
>   of the probabilties, computed in (1), for this
>   table and more "extreme" tables in the sense of
>   the ranking in (2).
>
> The question is: what "measure of discrepancy" is
> used in 'fisher.test' corresponding to stage (2)?
>
> (There are in principle several possibilities, e.g.
> value of a Pearson chi-squared, large values being
> discrepant; the probability calculated in (2),
> small values being discrepant; ... )
>
> "?fisher.test" says only:

[That following is not a quote from a current version of R.]

>     In the one-sided 2 by 2 cases, p-values are obtained
>     directly using the hypergeometric distribution.
>     Otherwise, computations are based on a C version of
>     the FORTRAN subroutine FEXACT which implements the
>     network developed by Mehta and Patel (1986) and
>     improved by Clarkson, Fan & Joe (1993). The FORTRAN
>     code can be obtained from
>     <URL: http://www.netlib.org/toms/643>.

No, it *also* says

      Two-sided tests are based on the probabilities of the tables, and
      take as 'more extreme' all tables with probabilities less than or
      equal to that of the observed table, the p-value being the sum of
      such probabilities.

which answers the question (there are only two-sided tests for such
tables).

Now, what does the posting guide say about stating the R version and
updating before posting?

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Basis of fisher.test

Ted.Harding
On 13-Jan-06 Prof Brian Ripley wrote:

> On Thu, 12 Jan 2006 [hidden email] wrote:
>>[...]
>> "?fisher.test" says only:
>
> [That following is not a quote from a current version of R.]
>
>>     In the one-sided 2 by 2 cases, p-values are obtained
>>     directly using the hypergeometric distribution.
>>     Otherwise, computations are based on a C version of
>>     the FORTRAN subroutine FEXACT which implements the
>>     network developed by Mehta and Patel (1986) and
>>     improved by Clarkson, Fan & Joe (1993). The FORTRAN
>>     code can be obtained from
>>     <URL: http://www.netlib.org/toms/643>.
>
> No, it *also* says
>
>       Two-sided tests are based on the probabilities of the tables, and
>       take as 'more extreme' all tables with probabilities less than or
>       equal to that of the observed table, the p-value being the sum of
>       such probabilities.
>
> which answers the question (there are only two-sided tests for such
> tables).

Thanks for the above information, which is indeed the definitive
straightforward answer to my question!

(Not sure that I quite agree with the "two-sided" terminology, though,
since the ranking is unidirectional based on decreasing probability,
and the P-value is that of the least-probability tail -- i.e. analagous
to the "large (-2*loglik)" tail of a likelihood-ratio test -- which
I've always visualised as a 1-tailed test (depite the fact that
the "other tail" can on occasion be indicative of a fit "too good to
be true").

> Now, what does the posting guide say about stating the R version and
> updating before posting?

Well, I plead that in practice there is necessarily a grey area
here! My quotation was from "?fisher.test" in R-2.1.0beta of
2004/04/08, the most recent version installed on any of my machines.
Admittedly a bit behind the times, but not grossly; and that help
page has not changed in this respect since the earliest version I
have installed, which is R-1.2.3 of 2001/04/26.

Contents of help pages can change overnight as R evolves.
While it is better to be up-to-date than behind the times (even
slightly), there is a compromise to be struck between upgrading
to the latest R every time one has a question which might be
answered thereby, or going on-line to read the latest PDF
documentation from CRAN, on the one hand, and on the other asking
a straightforward question to the list.

Thanks again, and best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jan-06                                       Time: 08:55:11
------------------------------ XFMail ------------------------------

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Basis of fisher.test

Brian Ripley
On Fri, 13 Jan 2006 [hidden email] wrote:

> On 13-Jan-06 Prof Brian Ripley wrote:
>> On Thu, 12 Jan 2006 [hidden email] wrote:
>>> [...]
>>> "?fisher.test" says only:
>>
>> [That following is not a quote from a current version of R.]
>>
>>>     In the one-sided 2 by 2 cases, p-values are obtained
>>>     directly using the hypergeometric distribution.
>>>     Otherwise, computations are based on a C version of
>>>     the FORTRAN subroutine FEXACT which implements the
>>>     network developed by Mehta and Patel (1986) and
>>>     improved by Clarkson, Fan & Joe (1993). The FORTRAN
>>>     code can be obtained from
>>>     <URL: http://www.netlib.org/toms/643>.
>>
>> No, it *also* says
>>
>>       Two-sided tests are based on the probabilities of the tables, and
>>       take as 'more extreme' all tables with probabilities less than or
>>       equal to that of the observed table, the p-value being the sum of
>>       such probabilities.
>>
>> which answers the question (there are only two-sided tests for such
>> tables).
>
> Thanks for the above information, which is indeed the definitive
> straightforward answer to my question!
>
> (Not sure that I quite agree with the "two-sided" terminology, though,
> since the ranking is unidirectional based on decreasing probability,
> and the P-value is that of the least-probability tail -- i.e. analagous
> to the "large (-2*loglik)" tail of a likelihood-ratio test -- which
> I've always visualised as a 1-tailed test (depite the fact that
> the "other tail" can on occasion be indicative of a fit "too good to
> be true").

As statistics is usually taught, significance tests are always one-tailed.
The two-sided t-test is one-tailed, the test statistic being |T|.

In any case, the `two-sided' is part of the arguments given to the
function, so this para is just using the already-established terminology.

>> Now, what does the posting guide say about stating the R version and
>> updating before posting?
>
> Well, I plead that in practice there is necessarily a grey area
> here! My quotation was from "?fisher.test" in R-2.1.0beta of
> 2004/04/08, the most recent version installed on any of my machines.
> Admittedly a bit behind the times, but not grossly; and that help
> page has not changed in this respect since the earliest version I
> have installed, which is R-1.2.3 of 2001/04/26.
>
> Contents of help pages can change overnight as R evolves.
> While it is better to be up-to-date than behind the times (even
> slightly), there is a compromise to be struck between upgrading
> to the latest R every time one has a question which might be
> answered thereby, or going on-line to read the latest PDF
> documentation from CRAN, on the one hand, and on the other asking
> a straightforward question to the list.

Well, if you had given the R version number the problem would have been
much more obvious.

> Thanks again, and best wishes,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <[hidden email]>
> Fax-to-email: +44 (0)870 094 0861
> Date: 13-Jan-06                                       Time: 08:55:11
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html