

Hi all,
In a dataframe, I have two columns of data that are categorical.
How do I form some sort of measure of correlation between these two columns?
For numerical data, I just need to regress one to the other, or do
some pairs plot.
But for categorical data, how do I find and/or visualize correlation
between the two columns of data?
Thanks!
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Not an expert, but I would try some of the following:
# tabulate joint frequencies
?table
?xtabs
# plotting
mosaicplot(Titanic, main = "Survival on the Titanic", color = TRUE, shade=TRUE)
# loglinear models
check the library for more ideas.
Cheers,
Dylan
On Fri, Jun 19, 2009 at 2:04 PM, Michael< [hidden email]> wrote:
> Hi all,
>
> In a dataframe, I have two columns of data that are categorical.
>
> How do I form some sort of measure of correlation between these two columns?
>
> For numerical data, I just need to regress one to the other, or do
> some pairs plot.
>
> But for categorical data, how do I find and/or visualize correlation
> between the two columns of data?
>
> Thanks!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On 2009.06.19 14:04:59, Michael wrote:
> Hi all,
>
> In a dataframe, I have two columns of data that are categorical.
>
> How do I form some sort of measure of correlation between these two columns?
>
> For numerical data, I just need to regress one to the other, or do
> some pairs plot.
>
> But for categorical data, how do I find and/or visualize correlation
> between the two columns of data?
As Dylan mentioned, using crosstabs may be the easiest way. Also, a
simple correlation between the two variables may be informative. If
each variable is ordinal, you can use Kendall's taub (square table)
or tauc (rectangular table). The former you can calculate with ?cor
(set method="kendall"), the latter you may have to hack something
together yourself, there is code on the Internet to do this. If the
data are nominal, then a simple chisquared test (largen) or Fisher's
exact test (smalln) may be more appropriate. There are rules about
which to use when one variable is ordinal and one is nominal, but I
don't have my notes in front of me. Maybe someone else can provide
more assistance (and correct me if I'm wrong :).
Cheers,
~Jason

Jason W. Morgan
Graduate Student
Department of Political Science
*The Ohio State University*
154 North Oval Mall
Columbus, Ohio 43210
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:
> On 2009.06.19 14:04:59, Michael wrote:
>> Hi all,
>>
>> In a dataframe, I have two columns of data that are categorical.
>>
>> How do I form some sort of measure of correlation between these two
>> columns?
>>
>> For numerical data, I just need to regress one to the other, or do
>> some pairs plot.
>>
>> But for categorical data, how do I find and/or visualize correlation
>> between the two columns of data?
>
> As Dylan mentioned, using crosstabs may be the easiest way. Also, a
> simple correlation between the two variables may be informative. If
> each variable is ordinal, you can use Kendall's taub (square table)
> or tauc (rectangular table). The former you can calculate with ?cor
> (set method="kendall"), the latter you may have to hack something
> together yourself, there is code on the Internet to do this. If the
> data are nominal, then a simple chisquared test (largen) or Fisher's
> exact test (smalln) may be more appropriate. There are rules about
> which to use when one variable is ordinal and one is nominal, but I
> don't have my notes in front of me. Maybe someone else can provide
> more assistance (and correct me if I'm wrong :).
I would be cautious in recommending the Fisher Exact Test based upon
small samples sizes, as the FET has been shown to be overly
conservative. This also applies to the use of the continuity
correction for the chisquare test (which replicates the behavior of
the FET).
For more information see:
Chisquared and FisherIrwin tests of twobytwo tables with small
sample recommendations
Ian Campbell
Stat in Med 26:36613675; 2007
http://www3.interscience.wiley.com/journal/114125487/abstractand:
How conservative is Fisher's exact test?
A quantitative evaluation of the twosample comparative binomial trial
Gerald G. Crans, Jonathan J. Shuster
Stat Med. 2008 Aug 15;27(18):3598611.
http://www3.interscience.wiley.com/journal/117929459/abstractFrank also has some comments here (bottom of the page):
http://biostat.mc.vanderbilt.edu/wiki/Main/DataAnalysisDisc#Some_Important_Points_about_ContMore generally, Agresti's Categorical Data Analysis is typically the
first reference in this domain to reach for. There is also a document
written by Laura Thompson which provides for a nice R companion to
Agresti. It is available from:
https://home.comcast.net/~lthompson221/Splusdiscrete2.pdfHTH,
Marc Schwartz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Saturday 20 June 2009 04:36:55 pm Marc Schwartz wrote:
> On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:
> > On 2009.06.19 14:04:59, Michael wrote:
> >> Hi all,
> >>
> >> In a dataframe, I have two columns of data that are categorical.
> >>
> >> How do I form some sort of measure of correlation between these two
> >> columns?
> >>
> >> For numerical data, I just need to regress one to the other, or do
> >> some pairs plot.
> >>
> >> But for categorical data, how do I find and/or visualize correlation
> >> between the two columns of data?
> >
> > As Dylan mentioned, using crosstabs may be the easiest way. Also, a
> > simple correlation between the two variables may be informative. If
> > each variable is ordinal, you can use Kendall's taub (square table)
> > or tauc (rectangular table). The former you can calculate with ?cor
> > (set method="kendall"), the latter you may have to hack something
> > together yourself, there is code on the Internet to do this. If the
> > data are nominal, then a simple chisquared test (largen) or Fisher's
> > exact test (smalln) may be more appropriate. There are rules about
> > which to use when one variable is ordinal and one is nominal, but I
> > don't have my notes in front of me. Maybe someone else can provide
> > more assistance (and correct me if I'm wrong :).
>
> I would be cautious in recommending the Fisher Exact Test based upon
> small samples sizes, as the FET has been shown to be overly
> conservative.
>
> . . .
There are other ways of regarding the FET. Since it is precisely what it says
 an exact test  you can argue that you should avoid carrying over any
conclusions drawn about the small population the test was applied to and
employing them in a broader context. In so far as the test is concerned, the
"sample" data and the contingency table it is arrayed in are the entire
universe. In that sense, the FET can't be "conservative" or "liberal." It
isn't actually a hypothesis test and should not be thought of as one or used
in the place of one.
>
JDougherty
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


For measures of association between two variables with two values each,
Cramer's V and Yule's Q are useful statistics. Look into this thread, for
example: http://markmail.org/message/sjd53z2dv2pb5nd6To get a grasp from plotting (sometimes), you may use the jitter function in
the plot...
e=rnorm(n,0,1)
y=x+e
xprob=exp(x)/(1+exp(x))
yprob=exp(y)/(1+exp(y))
xcat=rbinom(n,1,xprob)
ycat=rbinom(n,1,yprob)
plot(ycat~xcat) #totally useless
plot(jitter(ycat)~jitter(xcat)) #can be somewhat useful
table(ycat,xcat) # interesting
#A measure of correlation between nominal variables
yule.Q=function(x,y){(table(x,y)[1,1]*table(x,y)[2,2]table(x,y)[1,2]*table(
x,y)[2,1])/(table(x,y)[1,1]*table(x,y)[2,2]+table(x,y)[1,2]*table(x,y)[2,1])
}
yule.Q(ycat,xcat)
Best,
Daniel

cuncta stricte discussurus

Ursprüngliche Nachricht
Von: [hidden email] [mailto: [hidden email]] Im
Auftrag von Marc Schwartz
Gesendet: Saturday, June 20, 2009 7:37 PM
An: Jason Morgan
Cc: rhelp
Betreff: Re: [R] correlation between categorical data
On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:
> On 2009.06.19 14:04:59, Michael wrote:
>> Hi all,
>>
>> In a dataframe, I have two columns of data that are categorical.
>>
>> How do I form some sort of measure of correlation between these two
>> columns?
>>
>> For numerical data, I just need to regress one to the other, or do
>> some pairs plot.
>>
>> But for categorical data, how do I find and/or visualize correlation
>> between the two columns of data?
>
> As Dylan mentioned, using crosstabs may be the easiest way. Also, a
> simple correlation between the two variables may be informative. If
> each variable is ordinal, you can use Kendall's taub (square table)
> or tauc (rectangular table). The former you can calculate with ?cor
> (set method="kendall"), the latter you may have to hack something
> together yourself, there is code on the Internet to do this. If the
> data are nominal, then a simple chisquared test (largen) or Fisher's
> exact test (smalln) may be more appropriate. There are rules about
> which to use when one variable is ordinal and one is nominal, but I
> don't have my notes in front of me. Maybe someone else can provide
> more assistance (and correct me if I'm wrong :).
I would be cautious in recommending the Fisher Exact Test based upon small
samples sizes, as the FET has been shown to be overly conservative. This
also applies to the use of the continuity correction for the chisquare test
(which replicates the behavior of the FET).
For more information see:
Chisquared and FisherIrwin tests of twobytwo tables with small sample
recommendations Ian Campbell Stat in Med 26:36613675; 2007
http://www3.interscience.wiley.com/journal/114125487/abstractand:
How conservative is Fisher's exact test?
A quantitative evaluation of the twosample comparative binomial trial
Gerald G. Crans, Jonathan J. Shuster Stat Med. 2008 Aug 15;27(18):3598611.
http://www3.interscience.wiley.com/journal/117929459/abstractFrank also has some comments here (bottom of the page):
http://biostat.mc.vanderbilt.edu/wiki/Main/DataAnalysisDisc#Some_Important_Points_about_Cont
More generally, Agresti's Categorical Data Analysis is typically the first
reference in this domain to reach for. There is also a document written by
Laura Thompson which provides for a nice R companion to Agresti. It is
available from:
https://home.comcast.net/~lthompson221/Splusdiscrete2.pdfHTH,
Marc Schwartz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


At 07:40 21.06.2009, J Dougherty wrote:
[...]
>There are other ways of regarding the FET. Since it is precisely
>what it says
> an exact test  you can argue that you should avoid carrying over any
>conclusions drawn about the small population the test was applied to and
>employing them in a broader context. In so far as the test is concerned, the
>"sample" data and the contingency table it is arrayed in are the entire
>universe. In that sense, the FET can't be "conservative" or "liberal." It
>isn't actually a hypothesis test and should not be thought of as one or used
>in the place of one.
> >
>JDougherty
Could you give some reference, supporting this, for me, surprising
view? I don't see a necessary connection between an exact test and
the idea that it does not test a hypothesis.
Thanks,
Heinz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Heinz Tuechler wrote
At 07:40 21.06.2009, J Dougherty wrote:
[...]
>There are other ways of regarding the FET. Since it is precisely
>what it says
> an exact test  you can argue that you should avoid carrying over any
>conclusions drawn about the small population the test was applied to and
>employing them in a broader context. In so far as the test is concerned, the
>"sample" data and the contingency table it is arrayed in are the entire
>universe. In that sense, the FET can't be "conservative" or "liberal." It
>isn't actually a hypothesis test and should not be thought of as one or used
>in the place of one.
> >
>JDougherty
Could you give some reference, supporting this, for me, surprising
view? I don't see a necessary connection between an exact test and
the idea that it does not test a hypothesis.
Thanks,
Heinz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
Fisher's Exact Test is a nonparametric "test." It tests the distribution in the contingency table against the total possible arrangements and gives you the precise likelihood of that many items being arranged in that manner. No more and no less. You could argue about the greater population from which your sample is drawn, but FET makes no assumptions at all about any greater sample universe. Also, since the "population" being used in FET is strictly limited to the members of the contingency table, the results are a subset of a finite group of possible results that are relevant to that specific arrangement of data. You are not "estimating" parameters of a parent population or making any assumptions about the parent distribution. You can designate a "p" value such as 0.05 as a level of significance, but there is no "error" term in the FET result. Fisher stated that the test DOES assume a null hypothesis of independence to a hypergeometric distribution of the cell members. But that creates other issues if you are attempting to use the results in conjunction with assumptions about a broader sample universe than that in the test. For instance you have to carry the assumption of a hypergeometric distribution over in to the land of reality your sample is drawn from and you then have to justify that.


On Jan 23, 2015, at 5:54 PM, JohnDee wrote:
> Heinz Tuechler wrote
>> At 07:40 21.06.2009, J Dougherty wrote:
>>
>> [...]
>>> There are other ways of regarding the FET. Since it is precisely
>>> what it says
>>>  an exact test  you can argue that you should avoid carrying over any
>>> conclusions drawn about the small population the test was applied to and
>>> employing them in a broader context. In so far as the test is concerned,
> the
>>> "sample" data and the contingency table it is arrayed in are the entire
>>> universe. In that sense, the FET can't be "conservative" or "liberal."
> It
>>> isn't actually a hypothesis test and should not be thought of as one or
> used
>>> in the place of one.
>>>>
>>> JDougherty
>>
>> Could you give some reference, supporting this, for me, surprising
>> view? I don't see a necessary connection between an exact test and
>> the idea that it does not test a hypothesis.
>>
>> Thanks,
>> Heinz
>>
>
> Fisher's Exact Test is a nonparametric "test." It tests the distribution in
> the contingency table against the total possible arrangements and gives you
> the precise likelihood of that many items being arranged in that manner.
That's not the way I understand the construction of the result. The statistic gives rather the ratio of the number of permutations as extreme or more extreme (as measured by the odds ratio) while holding the marginals constant which is then divided by the total number of possible permutations of the data.
> No
> more and no less. You could argue about the greater population from which
> your sample is drawn, but FET makes no assumptions at all about any greater
> sample universe.
It is conditional on the margins, so that is the description of the "universe".
> Also, since the "population" being used in FET is strictly
> limited to the members of the contingency table, the results are a subset of
> a finite group of possible results that are relevant to that specific
> arrangement of data. You are not "estimating" parameters of a parent
> population or making any assumptions about the parent distribution. You can
> designate a "p" value such as 0.05 as a level of significance, but there is
> no "error" term in the FET result. Fisher stated that the test DOES assume
> a null hypothesis of independence to a hypergeometric distribution of the
> cell members. But that creates other issues if you are attempting to use
> the results in conjunction with assumptions about a broader sample universe
> than that in the test. For instance you have to carry the assumption of a
> hypergeometric distribution over in to the land of reality your sample is
> drawn from and you then have to justify that.
>
And this is offtopic on Rhelp .....
David Winsemius
Alameda, CA, USA
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


comment inline
David Winsemius wrote on 24.01.2015 21:08:
>
> On Jan 23, 2015, at 5:54 PM, JohnDee wrote:
>
>> Heinz Tuechler wrote
>>> At 07:40 21.06.2009, J Dougherty wrote:
>>>
>>> [...]
>>>> There are other ways of regarding the FET. Since it is precisely
>>>> what it says
>>>>  an exact test  you can argue that you should avoid carrying over any
>>>> conclusions drawn about the small population the test was applied to and
>>>> employing them in a broader context. In so far as the test is concerned,
>> the
>>>> "sample" data and the contingency table it is arrayed in are the entire
>>>> universe. In that sense, the FET can't be "conservative" or "liberal."
>> It
>>>> isn't actually a hypothesis test and should not be thought of as one or
>> used
>>>> in the place of one.
>>>>>
>>>> JDougherty
>>>
>>> Could you give some reference, supporting this, for me, surprising
>>> view? I don't see a necessary connection between an exact test and
>>> the idea that it does not test a hypothesis.
>>>
>>> Thanks,
>>> Heinz
>>>
>>
>
>
>> Fisher's Exact Test is a nonparametric "test." It tests the distribution in
>> the contingency table against the total possible arrangements and gives you
>> the precise likelihood of that many items being arranged in that manner.
>
> That's not the way I understand the construction of the result. The statistic gives rather the ratio of the number of permutations as extreme or more extreme (as measured by the odds ratio) while holding the marginals constant which is then divided by the total number of possible permutations of the data.
>
>
>> No
>> more and no less. You could argue about the greater population from which
>> your sample is drawn, but FET makes no assumptions at all about any greater
>> sample universe.
>
> It is conditional on the margins, so that is the description of the "universe".
>
>> Also, since the "population" being used in FET is strictly
>> limited to the members of the contingency table, the results are a subset of
>> a finite group of possible results that are relevant to that specific
>> arrangement of data. You are not "estimating" parameters of a parent
>> population or making any assumptions about the parent distribution. You can
>> designate a "p" value such as 0.05 as a level of significance, but there is
>> no "error" term in the FET result. Fisher stated that the test DOES assume
>> a null hypothesis of independence to a hypergeometric distribution of the
>> cell members. But that creates other issues if you are attempting to use
>> the results in conjunction with assumptions about a broader sample universe
>> than that in the test. For instance you have to carry the assumption of a
>> hypergeometric distribution over in to the land of reality your sample is
>> drawn from and you then have to justify that.
In this respect I agree. A real world situation with a universe of fixed
margins seems unusual to me.
>>
>
> And this is offtopic on Rhelp .....
Sorry for asking a question offtopic more than five years ago. A nice
surprise to get an answer.
Thanks,
Heinz
>>
>>
>> 
>> View this message in context: http://r.789695.n4.nabble.com/correlationbetweencategoricaldatatp888975p4702235.html>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

