Difficulty with 'merge'

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Difficulty with 'merge'

Michael Kubovy
Dear R-helpers,

Happy New Year to all the helpful members of the list.

Here is the behavior I'm looking for:
 > v1 <- c("a","b","c")
 > n1 <- c(0, 1, 2)
 > v2 <- c("c", "a", "b")
 > n2 <- c(0, 1 , 2)
 > (f1  <- data.frame(v1, n1))
   v1 n1
1  a  0
2  b  1
3  c  2
 > (f2 <- data.frame(v2, n2))
   v2 n2
1  c  0
2  a  1
3  b  2
 > (m12 <- merge(f1, f2, by.x = "v1", by.y = "v2", sort = F))
   v1 n1 n2
1  c  2  0
2  a  0  1
3  b  1  2

Now to my data:
 > summary(pL)
         pairL
a fondo   :  41
alto      :  41
ampio     :  41
angoloso  :  41
aperto    :  41
appoggiato:  41
(Other)   :1271

 > pL$pairL[c(1,42)]
[1] appoggiato dentro
37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
complicato convesso davanti dentro destra ... verticale

 > summary(oppN)
         pairL              pairR         subject            
L                LL                RR               M
a fondo   :  41   a galla    :  41   S1     :  37   Min.   :0.3646    
Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
alto      :  41   acuto      :  41   S10    :  37   1st Qu.:0.5521    
1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
ampio     :  41   arrotondato:  41   S11    :  37   Median :0.6354    
Median :0.47917   Median :0.2708   Median :0.2292
angoloso  :  41   basso      :  41   S12    :  37   Mean   :0.6403    
Mean   :0.46452   Mean   :0.2760   Mean   :0.2598
aperto    :  41   chiuso     :  41   S13    :  37   3rd Qu.:0.7188    
3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
appoggiato:  41   compl      :  41   S14    :  37   Max.   :0.9375    
Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
(Other)   :1271   (Other)    :1271   (Other):
1295                                      NA's   :3.0000   NA's   :
3.0000
       asym             polar            polar_a1          clust
Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:492
1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.902e-01   c2:287
Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 82
Mean   : 0.6265   Mean   : 1.3428   Mean   :-5.745e-02   c4:246
3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.168e-01   c5: 82
Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:328
                    NA's   : 3.0000   NA's   : 3.000e+00

 > oppN$pairL[c(1,42)]
[1] spesso fine
37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
complicato convesso davanti dentro destra ... verticale

 > unique(sort(oppM$pairL)) == unique(sort(pL$pairL))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[26] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In other words I think that pL$pairL and oppN$pairL consists of 37  
blocks of 41 repetitions of names, and that these blocks are  
permutations of each other,

However:

 > summary(m1 <- merge(oppM, pairL, by.x = "pairL", by.y = "pairL",  
sort = F))
         pairL               pairR          subject            
L                LL                RR               M
a fondo   : 1681   a galla    : 1681   S1     : 1517   Min.   :
0.3646   Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
alto      : 1681   acuto      : 1681   S10    : 1517   1st Qu.:
0.5521   1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
ampio     : 1681   arrotondato: 1681   S11    : 1517   Median :
0.6354   Median :0.47917   Median :0.2708   Median :0.2292
angoloso  : 1681   basso      : 1681   S12    : 1517   Mean   :
0.6398   Mean   :0.46402   Mean   :0.2760   Mean   :0.2598
aperto    : 1681   chiuso     : 1681   S13    : 1517   3rd Qu.:
0.7188   3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
appoggiato: 1681   compl      : 1681   S14    : 1517   Max.   :
0.9375   Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
(Other)   :51988   (Other)    :51988   (Other):52972
       asym             polar            polar_a1          clust
Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:20172
1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.904e-01   c2:11644
Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 3362
Mean   : 0.6234   Mean   : 1.3428   Mean   :-5.745e-02   c4:10086
3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.169e-01   c5: 3362
Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:13448

I was expecting pairL to be 41 items longs, not 1681 = 41^2.
_____________________________
Professor Michael Kubovy
University of Virginia
Department of Psychology
USPS:     P.O.Box 400400    Charlottesville, VA 22904-4400
Parcels:    Room 102        Gilmer Hall
         McCormick Road    Charlottesville, VA 22903
Office:    B011    +1-434-982-4729
Lab:        B019    +1-434-982-4751
Fax:        +1-434-982-4766
WWW:    http://www.people.virginia.edu/~mk9y/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty with 'merge'

Christoph Buser
Dear Michael

Please remark that merge calculates all possible combinations if
you have repeated elements as you can see in the example below.

?merge

"... If there is more than one match, all possible matches
contribute one row each. ..."

Maybe you can apply "aggregate" in a reasonable way on your
data.frame first to summarize your repeated values to unique
ones and the proceed with merge, but that depends on your
problem.

Regards,

Christoph

--------------------------------------------------------------
Christoph Buser <[hidden email]>
Seminar fuer Statistik, LEO C13
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-44-632-4673 fax: 632-1228
http://stat.ethz.ch/~buser/
--------------------------------------------------------------

example with repeated values
----------------------------

v1 <- c("a", "b", "a", "b", "a")
n1 <- 1:5
v2 <- c("b", "b", "a", "a", "a")
n2 <- 6:10
(f1  <- data.frame(v1, n1))
(f2 <- data.frame(v2, n2))
(m12 <- merge(f1, f2, by.x = "v1", by.y = "v2", sort = F))





Michael Kubovy writes:
 > Dear R-helpers,
 >
 > Happy New Year to all the helpful members of the list.
 >
 > Here is the behavior I'm looking for:
 >  > v1 <- c("a","b","c")
 >  > n1 <- c(0, 1, 2)
 >  > v2 <- c("c", "a", "b")
 >  > n2 <- c(0, 1 , 2)
 >  > (f1  <- data.frame(v1, n1))
 >    v1 n1
 > 1  a  0
 > 2  b  1
 > 3  c  2
 >  > (f2 <- data.frame(v2, n2))
 >    v2 n2
 > 1  c  0
 > 2  a  1
 > 3  b  2
 >  > (m12 <- merge(f1, f2, by.x = "v1", by.y = "v2", sort = F))
 >    v1 n1 n2
 > 1  c  2  0
 > 2  a  0  1
 > 3  b  1  2
 >
 > Now to my data:
 >  > summary(pL)
 >          pairL
 > a fondo   :  41
 > alto      :  41
 > ampio     :  41
 > angoloso  :  41
 > aperto    :  41
 > appoggiato:  41
 > (Other)   :1271
 >
 >  > pL$pairL[c(1,42)]
 > [1] appoggiato dentro
 > 37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
 > complicato convesso davanti dentro destra ... verticale
 >
 >  > summary(oppN)
 >          pairL              pairR         subject            
 > L                LL                RR               M
 > a fondo   :  41   a galla    :  41   S1     :  37   Min.   :0.3646    
 > Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
 > alto      :  41   acuto      :  41   S10    :  37   1st Qu.:0.5521    
 > 1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
 > ampio     :  41   arrotondato:  41   S11    :  37   Median :0.6354    
 > Median :0.47917   Median :0.2708   Median :0.2292
 > angoloso  :  41   basso      :  41   S12    :  37   Mean   :0.6403    
 > Mean   :0.46452   Mean   :0.2760   Mean   :0.2598
 > aperto    :  41   chiuso     :  41   S13    :  37   3rd Qu.:0.7188    
 > 3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
 > appoggiato:  41   compl      :  41   S14    :  37   Max.   :0.9375    
 > Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
 > (Other)   :1271   (Other)    :1271   (Other):
 > 1295                                      NA's   :3.0000   NA's   :
 > 3.0000
 >        asym             polar            polar_a1          clust
 > Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:492
 > 1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.902e-01   c2:287
 > Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 82
 > Mean   : 0.6265   Mean   : 1.3428   Mean   :-5.745e-02   c4:246
 > 3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.168e-01   c5: 82
 > Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:328
 >                     NA's   : 3.0000   NA's   : 3.000e+00
 >
 >  > oppN$pairL[c(1,42)]
 > [1] spesso fine
 > 37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
 > complicato convesso davanti dentro destra ... verticale
 >
 >  > unique(sort(oppM$pairL)) == unique(sort(pL$pairL))
 > [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  
 > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 > [26] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 >
 > In other words I think that pL$pairL and oppN$pairL consists of 37  
 > blocks of 41 repetitions of names, and that these blocks are  
 > permutations of each other,
 >
 > However:
 >
 >  > summary(m1 <- merge(oppM, pairL, by.x = "pairL", by.y = "pairL",  
 > sort = F))
 >          pairL               pairR          subject            
 > L                LL                RR               M
 > a fondo   : 1681   a galla    : 1681   S1     : 1517   Min.   :
 > 0.3646   Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
 > alto      : 1681   acuto      : 1681   S10    : 1517   1st Qu.:
 > 0.5521   1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
 > ampio     : 1681   arrotondato: 1681   S11    : 1517   Median :
 > 0.6354   Median :0.47917   Median :0.2708   Median :0.2292
 > angoloso  : 1681   basso      : 1681   S12    : 1517   Mean   :
 > 0.6398   Mean   :0.46402   Mean   :0.2760   Mean   :0.2598
 > aperto    : 1681   chiuso     : 1681   S13    : 1517   3rd Qu.:
 > 0.7188   3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
 > appoggiato: 1681   compl      : 1681   S14    : 1517   Max.   :
 > 0.9375   Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
 > (Other)   :51988   (Other)    :51988   (Other):52972
 >        asym             polar            polar_a1          clust
 > Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:20172
 > 1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.904e-01   c2:11644
 > Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 3362
 > Mean   : 0.6234   Mean   : 1.3428   Mean   :-5.745e-02   c4:10086
 > 3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.169e-01   c5: 3362
 > Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:13448
 >
 > I was expecting pairL to be 41 items longs, not 1681 = 41^2.
 > _____________________________
 > Professor Michael Kubovy
 > University of Virginia
 > Department of Psychology
 > USPS:     P.O.Box 400400    Charlottesville, VA 22904-4400
 > Parcels:    Room 102        Gilmer Hall
 >          McCormick Road    Charlottesville, VA 22903
 > Office:    B011    +1-434-982-4729
 > Lab:        B019    +1-434-982-4751
 > Fax:        +1-434-982-4766
 > WWW:    http://www.people.virginia.edu/~mk9y/
 >
 > ______________________________________________
 > [hidden email] mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html