[R] How to Get Categorical Correlation Coefficient

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[R] How to Get Categorical Correlation Coefficient

Kum-Hoe Hwang
Howdy Gurus !

I have a different correlation result from the same data. The
"corridor1" string variable is expressed
as a number like the "corridor2" number variable.
--------------------------------------------------------------------------
> levels(corridor1)
[1] "A"   "B"   "C"   "D"     "E"   "F"
> levels(as.factor(corridor2))
[1] "0" "1" "2" "3" "4"
>
------------------------------------------------------------------------------------------
I have the correlation results followings using cor() function.
------------------------------------------------------------------------------------------
> cor(jh1_1, as.factor(corridor1))
[1] 0.01528538
> cor(jh1_1, as.factor(corridor2))
[1] -0.4972571
------------------------------------------------------------------------------------------
I donot know why the above correlation coefficients used the same data
are different.
They are 0.015 from as.factor(corridor1), -0.497 from as,factor(corridor2).
The string variable "corridor1" is the same catergory data with the
variable corridor2.
The difference is that "A" is replaced with "0", "B" with "1", "C"
with "2", .....

Could you tell me why they are different, and which correlation
coefficient is correct?

Thank in advance,

--
Kum-Hoe Hwang, Ph.D.Phone : 82-31-250-3516Email : [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] How to Get Categorical Correlation Coefficient

Peter Dalgaard
"Kum-Hoe Hwang" <[hidden email]> writes:

> Howdy Gurus !
>
> I have a different correlation result from the same data. The
> "corridor1" string variable is expressed
> as a number like the "corridor2" number variable.
> --------------------------------------------------------------------------
> > levels(corridor1)
> [1] "A"   "B"   "C"   "D"     "E"   "F"
> > levels(as.factor(corridor2))
> [1] "0" "1" "2" "3" "4"
> >
> ------------------------------------------------------------------------------------------
> I have the correlation results followings using cor() function.
> ------------------------------------------------------------------------------------------
> > cor(jh1_1, as.factor(corridor1))
> [1] 0.01528538
> > cor(jh1_1, as.factor(corridor2))
> [1] -0.4972571
> ------------------------------------------------------------------------------------------
> I donot know why the above correlation coefficients used the same data
> are different.
> They are 0.015 from as.factor(corridor1), -0.497 from as,factor(corridor2).
> The string variable "corridor1" is the same catergory data with the
> variable corridor2.
> The difference is that "A" is replaced with "0", "B" with "1", "C"
> with "2", .....
>
> Could you tell me why they are different, and which correlation
> coefficient is correct?

One thing that strikes me is that corridor1 has 6 levels and corridor2
has 5...

In general correlations are not expected to work on factors so I'd be
explicit about taking as.numeric(). A glance at
table(corridor1,corridor2) should be informative too, as would a
summary(as.numeric(as.factor(corridor1))-as.numeric(as.factor(corridor1)))

--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] How to Get Categorical Correlation Coefficient

Kum-Hoe Hwang
There was my mistake in the earlier email.
I have corrected the error by dropping "ns.omit" from data.frame().

I added a new corrected correlation and output followings:

------------------------------------------------------------------------------
#
> nrow(sdi)
[1] 65613

> print(corridor1[65600:65613])
 [1] C  C  C  C  F
 [6] F  F  F  B  B
[11] F F B  B
Levels: B C D E A F

> print(corridor2[65600:65613])
 [1] 4 4 4 4 2 2 2 2 1 1 2 2 1 1

> summary(corridor1)
          B              C                D             E
 A             F
       15092        13456         6652         1611         1796        27006
> summary(corridor2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.0     1.0     2.0     2.3     3.0     5.0

> summary(as.numeric(as.factor(corridor1))-as.numeric(as.factor(corridor1)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0       0       0       0       0       0
> table(corridor1,corridor2)
              corridor2
corridor1          0     1     2     3     4     5
  B       0 15092     0     0     0     0
  C       0     0     0     0 13456     0
  D       0     0     0  6652     0     0
  E       0     0     0     0     0  1611
  A      1796     0     0     0     0     0
  F     0     0 27006     0     0     0
>
---------------------------------------------------------------------------------------
There are different correlation coefficients from the following results:
Are there any functions or packages for a categorical correlation?

> cor(jh1_1, corridor1)
[1] 0.02753303
> cor(jh1_1, as.factor(corridor2))
[1] -0.3682788


Thanks for your kindness,

Kum


On 12 Oct 2006 10:25:33 +0200, Peter Dalgaard <[hidden email]> wrote:

> "Kum-Hoe Hwang" <[hidden email]> writes:
>
> > Howdy Gurus !
> >
> > I have a different correlation result from the same data. The
> > "corridor1" string variable is expressed
> > as a number like the "corridor2" number variable.
> > --------------------------------------------------------------------------
> > > levels(corridor1)
> > [1] "A"   "B"   "C"   "D"     "E"   "F"
> > > levels(as.factor(corridor2))
> > [1] "0" "1" "2" "3" "4"
> > >
> > ------------------------------------------------------------------------------------------
> > I have the correlation results followings using cor() function.
> > ------------------------------------------------------------------------------------------
> > > cor(jh1_1, as.factor(corridor1))
> > [1] 0.01528538
> > > cor(jh1_1, as.factor(corridor2))
> > [1] -0.4972571
> > ------------------------------------------------------------------------------------------
> > I donot know why the above correlation coefficients used the same data
> > are different.
> > They are 0.015 from as.factor(corridor1), -0.497 from as,factor(corridor2).
> > The string variable "corridor1" is the same catergory data with the
> > variable corridor2.
> > The difference is that "A" is replaced with "0", "B" with "1", "C"
> > with "2", .....
> >
> > Could you tell me why they are different, and which correlation
> > coefficient is correct?
>
> One thing that strikes me is that corridor1 has 6 levels and corridor2
> has 5...
>
> In general correlations are not expected to work on factors so I'd be
> explicit about taking as.numeric(). A glance at
> table(corridor1,corridor2) should be informative too, as would a
> summary(as.numeric(as.factor(corridor1))-as.numeric(as.factor(corridor1)))
>
> --
>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
> ~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907
>


--
Kum-Hoe Hwang, Ph.D.Phone : 82-31-250-3516Email : [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] How to Get Categorical Correlation Coefficient

Peter Dalgaard
"Kum-Hoe Hwang" <[hidden email]> writes:

> There was my mistake in the earlier email.
> I have corrected the error by dropping "ns.omit" from data.frame().
>
> I added a new corrected correlation and output followings:
>
> ------------------------------------------------------------------------------
> #
> > nrow(sdi)
> [1] 65613
>
> > print(corridor1[65600:65613])
>  [1] C  C  C  C  F
>  [6] F  F  F  B  B
> [11] F F B  B
> Levels: B C D E A F
>
> > print(corridor2[65600:65613])
>  [1] 4 4 4 4 2 2 2 2 1 1 2 2 1 1
>
> > summary(corridor1)
>           B              C                D             E
>  A             F
>        15092        13456         6652         1611         1796        27006
> > summary(corridor2)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>     0.0     1.0     2.0     2.3     3.0     5.0
>
> > summary(as.numeric(as.factor(corridor1))-as.numeric(as.factor(corridor1)))
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>       0       0       0       0       0       0

One term of course needs to have corridor2. (That's my typo, but...)



> > table(corridor1,corridor2)
>               corridor2
> corridor1          0     1     2     3     4     5
>   B       0 15092     0     0     0     0
>   C       0     0     0     0 13456     0
>   D       0     0     0  6652     0     0
>   E       0     0     0     0     0  1611
>   A      1796     0     0     0     0     0
>   F     0     0 27006     0     0     0


Notice that they are not in the same order! as.numeric(corridor1) will
have 1 for B, ..., 5 for A, 6 for F



> ---------------------------------------------------------------------------------------
> There are different correlation coefficients from the following results:
> Are there any functions or packages for a categorical correlation?
>
> > cor(jh1_1, corridor1)
> [1] 0.02753303
> > cor(jh1_1, as.factor(corridor2))
> [1] -0.3682788
>
>
> Thanks for your kindness,
>
> Kum
>
>
> On 12 Oct 2006 10:25:33 +0200, Peter Dalgaard <[hidden email]> wrote:
> > "Kum-Hoe Hwang" <[hidden email]> writes:
> >
> > > Howdy Gurus !
> > >
> > > I have a different correlation result from the same data. The
> > > "corridor1" string variable is expressed
> > > as a number like the "corridor2" number variable.
> > > --------------------------------------------------------------------------
> > > > levels(corridor1)
> > > [1] "A"   "B"   "C"   "D"     "E"   "F"
> > > > levels(as.factor(corridor2))
> > > [1] "0" "1" "2" "3" "4"
> > > >
> > > ------------------------------------------------------------------------------------------
> > > I have the correlation results followings using cor() function.
> > > ------------------------------------------------------------------------------------------
> > > > cor(jh1_1, as.factor(corridor1))
> > > [1] 0.01528538
> > > > cor(jh1_1, as.factor(corridor2))
> > > [1] -0.4972571
> > > ------------------------------------------------------------------------------------------
> > > I donot know why the above correlation coefficients used the same data
> > > are different.
> > > They are 0.015 from as.factor(corridor1), -0.497 from as,factor(corridor2).
> > > The string variable "corridor1" is the same catergory data with the
> > > variable corridor2.
> > > The difference is that "A" is replaced with "0", "B" with "1", "C"
> > > with "2", .....
> > >
> > > Could you tell me why they are different, and which correlation
> > > coefficient is correct?
> >
> > One thing that strikes me is that corridor1 has 6 levels and corridor2
> > has 5...
> >
> > In general correlations are not expected to work on factors so I'd be
> > explicit about taking as.numeric(). A glance at
> > table(corridor1,corridor2) should be informative too, as would a
> > summary(as.numeric(as.factor(corridor1))-as.numeric(as.factor(corridor1)))
> >
> > --
> >   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
> >  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> >  (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
> > ~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907
> >
>
>
> --
> Kum-Hoe Hwang, Ph.D.Phone : 82-31-250-3516Email : [hidden email]
>

--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.