Quantcast

using sample() in data.table

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

using sample() in data.table

Paulo van Breugel
Hi, I am new to this package and not sure how to implement the sample() function with data.table.

I have a data frame SPF with three columns cat, pnvid and wdpaint. The pnvid variables has values 1:3, the wdpaint has values 1:10. I am interested in the count of all combinations of wdpaint and pnvid in my data set, which can be calculated using table or tapply (I use the latter in the example code below).

Normally I would use something like:

c <- tapply(SPF$cat, list(as.factor(SPF$pnvid), as.factor(SPF$wdpaint), function(x) length(x))

If I understand correctly, I would use the below when working with data tables:

f <- SPF[,length(cat),by="wdpaint,pnvid"]

But what if I want to reshuffle the column wdpaint first? When using tapply, it would be something along the lines of:

a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint, replace=F)))
c <- tapply(SPF$cat, a, function(x) length(x))



But how to do this with data.table?

Paulo

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Matthew Dowle

Hi,

Welcome to the list.

Rather than picking a column and calling length() on it, .N is a little
more convenient (and faster if that column isn't otherwise used, as in
this example). Search ?data.table for the string ".N" to find out more.

And to group by expressions of column names, wrap with list().  So,

    SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]

But that won't calculate any different statistics, just return the groups
in a different order. Seems like just an example, rather than the real
task, iiuc, which is fine of course.

Matthew


> Hi, I am new to this package and not sure how to implement the sample()
> function with data.table.
>
> I have a data frame SPF with three columns cat, pnvid and wdpaint. The
> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
> interested in the count of all combinations of wdpaint and pnvid in my
> data
> set, which can be calculated using table or tapply (I use the latter in
> the
> example code below).
>
> Normally I would use something like:
>
> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid), as.factor(SPF$wdpaint),
> function(x) length(x))*
>
> If I understand correctly, I would use the below when working with data
> tables:
>
> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>
> But what if I want to reshuffle the column wdpaint first? When using
> tapply, it would be something along the lines of:
>
> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
> replace=F)))
> c <- tapply(SPF$cat, a, function(x) length(x))*
>
>
> But how to do this with data.table?
>
> Paulo
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Paulo van Breugel
Thanks Matthew

I am not sure I understand the code (actually, I am sure I do not :-( . More specifically, I would expect the two expressions below to yield tables of the same dimension (basically all combinations of wdpaint and pnnid):

aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
dim(aa)
> 254  3
bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
dim(bb)
> 170 3

What I am looking for is creating a cross table of pnvid and wdpaint, i.e., the frequency or number of occurrences of each combination of pnvid and wdpaint. Shuffling wdpaint should give in that case a different frequency distribution, like in the example below:

table(c(1,1,2,2), c(3,3,4,4))
table(c(2,2,1,1), c(3,3,4,4))

Basically what I want to do is run X permutations on a data set which I will then use to create a confidence interval on the frequency distribution of sample points over wdpaint and pnvid

Cheers,

Paulo





On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle <[hidden email]> wrote:

Hi,

Welcome to the list.

Rather than picking a column and calling length() on it, .N is a little
more convenient (and faster if that column isn't otherwise used, as in
this example). Search ?data.table for the string ".N" to find out more.

And to group by expressions of column names, wrap with list().  So,

   SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]

But that won't calculate any different statistics, just return the groups
in a different order. Seems like just an example, rather than the real
task, iiuc, which is fine of course.

Matthew


> Hi, I am new to this package and not sure how to implement the sample()
> function with data.table.
>
> I have a data frame SPF with three columns cat, pnvid and wdpaint. The
> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
> interested in the count of all combinations of wdpaint and pnvid in my
> data
> set, which can be calculated using table or tapply (I use the latter in
> the
> example code below).
>
> Normally I would use something like:
>
> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid), as.factor(SPF$wdpaint),
> function(x) length(x))*
>
> If I understand correctly, I would use the below when working with data
> tables:
>
> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>
> But what if I want to reshuffle the column wdpaint first? When using
> tapply, it would be something along the lines of:
>
> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
> replace=F)))
> c <- tapply(SPF$cat, a, function(x) length(x))*
>
>
> But how to do this with data.table?
>
> Paulo
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Matthew Dowle

The shuffling can form a different number of groups can't it?

table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
table(c(2,1,2,1), c(3,3,4,4))   # 4 groups


> Thanks Matthew
>
> I am not sure I understand the code (actually, I am sure I do not :-( .
> More specifically, I would expect the two expressions below to yield
> tables
> of the same dimension (basically all combinations of wdpaint and pnnid):
>
> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
> dim(aa)
>> 254  3
> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
> dim(bb)
>> 170 3
>
> What I am looking for is creating a cross table of pnvid and wdpaint,
> i.e.,
> the frequency or number of occurrences of each combination of pnvid and
> wdpaint. Shuffling wdpaint should give in that case a different frequency
> distribution, like in the example below:
>
> table(c(1,1,2,2), c(3,3,4,4))
> table(c(2,2,1,1), c(3,3,4,4))
>
> Basically what I want to do is run X permutations on a data set which I
> will then use to create a confidence interval on the frequency
> distribution
> of sample points over wdpaint and pnvid
>
> Cheers,
>
> Paulo
>
>
>
>
>
> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
> <[hidden email]>wrote:
>
>>
>> Hi,
>>
>> Welcome to the list.
>>
>> Rather than picking a column and calling length() on it, .N is a little
>> more convenient (and faster if that column isn't otherwise used, as in
>> this example). Search ?data.table for the string ".N" to find out more.
>>
>> And to group by expressions of column names, wrap with list().  So,
>>
>>    SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>
>> But that won't calculate any different statistics, just return the
>> groups
>> in a different order. Seems like just an example, rather than the real
>> task, iiuc, which is fine of course.
>>
>> Matthew
>>
>>
>> > Hi, I am new to this package and not sure how to implement the
>> sample()
>> > function with data.table.
>> >
>> > I have a data frame SPF with three columns cat, pnvid and wdpaint. The
>> > pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>> > interested in the count of all combinations of wdpaint and pnvid in my
>> > data
>> > set, which can be calculated using table or tapply (I use the latter
>> in
>> > the
>> > example code below).
>> >
>> > Normally I would use something like:
>> >
>> > *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>> as.factor(SPF$wdpaint),
>> > function(x) length(x))*
>> >
>> > If I understand correctly, I would use the below when working with
>> data
>> > tables:
>> >
>> > *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>> >
>> > But what if I want to reshuffle the column wdpaint first? When using
>> > tapply, it would be something along the lines of:
>> >
>> > *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>> > replace=F)))
>> > c <- tapply(SPF$cat, a, function(x) length(x))*
>> >
>> >
>> > But how to do this with data.table?
>> >
>> > Paulo
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Paulo van Breugel
I got some very useful further feed back from Matthew. Let me summarize some key points from his suggestions concerning the code below:

The following code is still fairly slow (although faster then using table or tapply):

  a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
  b <- a[,.N,by=list(V1,V2)]
  c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
  for(i in 1:11){
    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
    b <- a[,.N,by=list(V1,V2)]
    c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))
 }
As pointed out by Matthew, the rbind at the end of the loop will be growing memory use and is generally inefficient. How badly it is impacting performance will depend on the data size though. So step 1 is to get that outside the loop (an useful link he provided is http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r). Based on a hint in R-inferno (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code as follows:

c <- vector('list', 12)
a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
a2 <- as.integer(SPFn$wdpaint)
for(i in 1:12){
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
}
c <- do.call('rbind', c)
This did improve the run time, but only very little bit (16.0 instead of 16.4 seconds). Next step was to profile the code, to see what part is taking most time. This can be done with Rprof(). The results showed that ordernumtol, a data.table function which sorts numeric ('double' floating point) columns was taking a lot of time. As it turns out, the SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to integer does speed up the code a lot.

c <- vector('list', 12)
a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
a2 <- as.integer(SPFn$wdpaint)
for(i in 1:12){
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
}
c <- do.call('rbind', c)
9
The second code took 16.0 seconds. The last attempt 2.4 seconds only! That is a serious (> 6x) improvement. And it shows I really need to be much more careful about my variables...
I checked and it also makes a smaller, but still very significant difference when using table (3x) or tapply (2x).

Big thanks to Matthew Dowle for all his help.. and any further suggestions for improvements are obviously welcome.

Cheers,

Paulo



On 06/19/2012 04:24 PM, Matthew Dowle wrote:
The shuffling can form a different number of groups can't it? 
YES, obvious.. I was half asleep I guess

table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
table(c(2,1,2,1), c(3,3,4,4))   # 4 groups


Thanks Matthew

I am not sure I understand the code (actually, I am sure I do not :-( .
More specifically, I would expect the two expressions below to yield
tables
of the same dimension (basically all combinations of wdpaint and pnnid):

aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
dim(aa)
254  3
bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
dim(bb)
170 3
What I am looking for is creating a cross table of pnvid and wdpaint,
i.e.,
the frequency or number of occurrences of each combination of pnvid and
wdpaint. Shuffling wdpaint should give in that case a different frequency
distribution, like in the example below:

table(c(1,1,2,2), c(3,3,4,4))
table(c(2,2,1,1), c(3,3,4,4))

Basically what I want to do is run X permutations on a data set which I
will then use to create a confidence interval on the frequency
distribution
of sample points over wdpaint and pnvid

Cheers,

Paulo





On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
[hidden email]wrote:

Hi,

Welcome to the list.

Rather than picking a column and calling length() on it, .N is a little
more convenient (and faster if that column isn't otherwise used, as in
this example). Search ?data.table for the string ".N" to find out more.

And to group by expressions of column names, wrap with list().  So,

   SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]

But that won't calculate any different statistics, just return the
groups
in a different order. Seems like just an example, rather than the real
task, iiuc, which is fine of course.

Matthew


Hi, I am new to this package and not sure how to implement the
sample()
function with data.table.

I have a data frame SPF with three columns cat, pnvid and wdpaint. The
pnvid variables has values 1:3, the wdpaint has values 1:10. I am
interested in the count of all combinations of wdpaint and pnvid in my
data
set, which can be calculated using table or tapply (I use the latter
in
the
example code below).

Normally I would use something like:

*c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
as.factor(SPF$wdpaint),
function(x) length(x))*

If I understand correctly, I would use the below when working with
data
tables:

*f <- SPF[,length(cat),by="wdpaint,pnvid"]*

But what if I want to reshuffle the column wdpaint first? When using
tapply, it would be something along the lines of:

*a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
replace=F)))
c <- tapply(SPF$cat, a, function(x) length(x))*


But how to do this with data.table?

Paulo
_______________________________________________
datatable-help mailing list
[hidden email]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




      






_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Matthew Dowle

Great. Thanks for keeping the list updated.

One thing I don't quite see, instead of :

for (i in 1:12) {
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
}

why not :

for (i in 1:12){
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    b2 <- b[,sum(N),by=list(V2,V1)]
    c[[i]] <- b2$V1
}

Idea being to save the tapply and the 2 as.factor. Further, I'm not sure
that sum() will be summing anything will it?  Isn't b2 the same as
b[order(V2,V1)], and if so that will be faster still?

Matthew

> I got some very useful further feed back from Matthew. Let me summarize
> some key points from his suggestions concerning the code below:
>
> The following code is still fairly slow (although faster then using
> table or tapply):
>
>    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>    b <- a[,.N,by=list(V1,V2)]
>
>    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
>    for(i in 1:11){
>
>      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>      b <- a[,.N,by=list(V1,V2)]
>
>      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))
>
>   }
>
> As pointed out by Matthew, the rbind at the end of the loop will be
> growing memory use and is generally inefficient. How badly it is
> impacting performance will depend on the data size though. So step 1 is
> to get that outside the loop (an useful link he provided is
> http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r).
> Based on a hint in R-inferno
> (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code
> as follows:
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> This did improve the run time, but only very little bit (16.0 instead of
> 16.4 seconds). Next step was to profile the code, to see what part is
> taking most time. This can be done with Rprof(). The results showed that
> ordernumtol, a data.table function which sorts numeric ('double'
> floating point) columns was taking a lot of time. As it turns out, the
> SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to
> integer does speed up the code a lot.
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> 9
> The second code took 16.0 seconds. The last attempt 2.4 seconds only!
> That is a serious (> 6x) improvement. And it shows I really need to be
> much more careful about my variables...
> I checked and it also makes a smaller, but still very significant
> difference when using table (3x) or tapply (2x).
>
> Big thanks to Matthew Dowle for all his help.. and any further
> suggestions for improvements are obviously welcome.
>
> Cheers,
>
> Paulo
>
>
>
> On 06/19/2012 04:24 PM, Matthew Dowle wrote:
>> The shuffling can form a different number of groups can't it?
> YES, obvious.. I was half asleep I guess
>>
>> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
>> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
>> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
>>
>>
>>> Thanks Matthew
>>>
>>> I am not sure I understand the code (actually, I am sure I do not :-( .
>>> More specifically, I would expect the two expressions below to yield
>>> tables
>>> of the same dimension (basically all combinations of wdpaint and
>>> pnnid):
>>>
>>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>> dim(aa)
>>>> 254  3
>>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
>>> dim(bb)
>>>> 170 3
>>> What I am looking for is creating a cross table of pnvid and wdpaint,
>>> i.e.,
>>> the frequency or number of occurrences of each combination of pnvid and
>>> wdpaint. Shuffling wdpaint should give in that case a different
>>> frequency
>>> distribution, like in the example below:
>>>
>>> table(c(1,1,2,2), c(3,3,4,4))
>>> table(c(2,2,1,1), c(3,3,4,4))
>>>
>>> Basically what I want to do is run X permutations on a data set which I
>>> will then use to create a confidence interval on the frequency
>>> distribution
>>> of sample points over wdpaint and pnvid
>>>
>>> Cheers,
>>>
>>> Paulo
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
>>> <[hidden email]>wrote:
>>>
>>>> Hi,
>>>>
>>>> Welcome to the list.
>>>>
>>>> Rather than picking a column and calling length() on it, .N is a
>>>> little
>>>> more convenient (and faster if that column isn't otherwise used, as in
>>>> this example). Search ?data.table for the string ".N" to find out
>>>> more.
>>>>
>>>> And to group by expressions of column names, wrap with list().  So,
>>>>
>>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>>
>>>> But that won't calculate any different statistics, just return the
>>>> groups
>>>> in a different order. Seems like just an example, rather than the real
>>>> task, iiuc, which is fine of course.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>> Hi, I am new to this package and not sure how to implement the
>>>> sample()
>>>>> function with data.table.
>>>>>
>>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.
>>>>> The
>>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>>>>> interested in the count of all combinations of wdpaint and pnvid in
>>>>> my
>>>>> data
>>>>> set, which can be calculated using table or tapply (I use the latter
>>>> in
>>>>> the
>>>>> example code below).
>>>>>
>>>>> Normally I would use something like:
>>>>>
>>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>>>> as.factor(SPF$wdpaint),
>>>>> function(x) length(x))*
>>>>>
>>>>> If I understand correctly, I would use the below when working with
>>>> data
>>>>> tables:
>>>>>
>>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>>>>>
>>>>> But what if I want to reshuffle the column wdpaint first? When using
>>>>> tapply, it would be something along the lines of:
>>>>>
>>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>>>>> replace=F)))
>>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
>>>>>
>>>>>
>>>>> But how to do this with data.table?
>>>>>
>>>>> Paulo
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> [hidden email]
>>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>
>>
>>
>
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Matthew Dowle

Oh, and just to mention there is already FR#1636 to add a new internal
function is.reallyreal(), similar to base::is.unsorted().

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1636&group_id=240&atid=978

data.table would use that internally to warn you when sorting or grouping
on numeric data that didn't actually contain any fractional data (so could
be converted to integer). It would return very quickly as soon as it found
any fractional data, so it shouldn't add much to overhead.

If that had been in place, it would have saved your time, and saved you
needing to know, in this case.

All, if there are any more "convenience"/"safety"/"helpfulness" features
like that, please do propose them.

Another example is that 3 of the 5 points in the wiki are no longer needed
since data.table optimizes for those things now (or soon will do),
provided the datatable.optimize option is left at the default value of
Inf.

Matthew


>
> Great. Thanks for keeping the list updated.
>
> One thing I don't quite see, instead of :
>
> for (i in 1:12) {
>     a3 <- a1[,V1:=sample(a2,replace=F)]
>     b <- a3[,.N,by=list(V1,V2)]
>     c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
> }
>
> why not :
>
> for (i in 1:12){
>     a3 <- a1[,V1:=sample(a2,replace=F)]
>     b <- a3[,.N,by=list(V1,V2)]
>     b2 <- b[,sum(N),by=list(V2,V1)]
>     c[[i]] <- b2$V1
> }
>
> Idea being to save the tapply and the 2 as.factor. Further, I'm not sure
> that sum() will be summing anything will it?  Isn't b2 the same as
> b[order(V2,V1)], and if so that will be faster still?
>
> Matthew
>
>> I got some very useful further feed back from Matthew. Let me summarize
>> some key points from his suggestions concerning the code below:
>>
>> The following code is still fairly slow (although faster then using
>> table or tapply):
>>
>>    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>>
>>    b <- a[,.N,by=list(V1,V2)]
>>
>>    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>>
>>    for(i in 1:11){
>>
>>      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>>
>>      b <- a[,.N,by=list(V1,V2)]
>>
>>      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)),
>> sum))
>>
>>   }
>>
>> As pointed out by Matthew, the rbind at the end of the loop will be
>> growing memory use and is generally inefficient. How badly it is
>> impacting performance will depend on the data size though. So step 1 is
>> to get that outside the loop (an useful link he provided is
>> http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r).
>> Based on a hint in R-inferno
>> (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code
>> as follows:
>>
>> c <- vector('list', 12)
>>
>> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>>
>> a2 <- as.integer(SPFn$wdpaint)
>>
>> for(i in 1:12){
>>
>>      a3 <- a1[,V1:=sample(a2,replace=F)]
>>
>>      b <- a3[,.N,by=list(V1,V2)]
>>
>>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>>
>> }
>>
>> c <- do.call('rbind', c)
>>
>> This did improve the run time, but only very little bit (16.0 instead of
>> 16.4 seconds). Next step was to profile the code, to see what part is
>> taking most time. This can be done with Rprof(). The results showed that
>> ordernumtol, a data.table function which sorts numeric ('double'
>> floating point) columns was taking a lot of time. As it turns out, the
>> SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to
>> integer does speed up the code a lot.
>>
>> c <- vector('list', 12)
>>
>> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>>
>> a2 <- as.integer(SPFn$wdpaint)
>>
>> for(i in 1:12){
>>
>>      a3 <- a1[,V1:=sample(a2,replace=F)]
>>
>>      b <- a3[,.N,by=list(V1,V2)]
>>
>>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>>
>> }
>>
>> c <- do.call('rbind', c)
>>
>> 9
>> The second code took 16.0 seconds. The last attempt 2.4 seconds only!
>> That is a serious (> 6x) improvement. And it shows I really need to be
>> much more careful about my variables...
>> I checked and it also makes a smaller, but still very significant
>> difference when using table (3x) or tapply (2x).
>>
>> Big thanks to Matthew Dowle for all his help.. and any further
>> suggestions for improvements are obviously welcome.
>>
>> Cheers,
>>
>> Paulo
>>
>>
>>
>> On 06/19/2012 04:24 PM, Matthew Dowle wrote:
>>> The shuffling can form a different number of groups can't it?
>> YES, obvious.. I was half asleep I guess
>>>
>>> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
>>> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
>>> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
>>>
>>>
>>>> Thanks Matthew
>>>>
>>>> I am not sure I understand the code (actually, I am sure I do not :-(
>>>> .
>>>> More specifically, I would expect the two expressions below to yield
>>>> tables
>>>> of the same dimension (basically all combinations of wdpaint and
>>>> pnnid):
>>>>
>>>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>> dim(aa)
>>>>> 254  3
>>>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
>>>> dim(bb)
>>>>> 170 3
>>>> What I am looking for is creating a cross table of pnvid and wdpaint,
>>>> i.e.,
>>>> the frequency or number of occurrences of each combination of pnvid
>>>> and
>>>> wdpaint. Shuffling wdpaint should give in that case a different
>>>> frequency
>>>> distribution, like in the example below:
>>>>
>>>> table(c(1,1,2,2), c(3,3,4,4))
>>>> table(c(2,2,1,1), c(3,3,4,4))
>>>>
>>>> Basically what I want to do is run X permutations on a data set which
>>>> I
>>>> will then use to create a confidence interval on the frequency
>>>> distribution
>>>> of sample points over wdpaint and pnvid
>>>>
>>>> Cheers,
>>>>
>>>> Paulo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
>>>> <[hidden email]>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Welcome to the list.
>>>>>
>>>>> Rather than picking a column and calling length() on it, .N is a
>>>>> little
>>>>> more convenient (and faster if that column isn't otherwise used, as
>>>>> in
>>>>> this example). Search ?data.table for the string ".N" to find out
>>>>> more.
>>>>>
>>>>> And to group by expressions of column names, wrap with list().  So,
>>>>>
>>>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>>>
>>>>> But that won't calculate any different statistics, just return the
>>>>> groups
>>>>> in a different order. Seems like just an example, rather than the
>>>>> real
>>>>> task, iiuc, which is fine of course.
>>>>>
>>>>> Matthew
>>>>>
>>>>>
>>>>>> Hi, I am new to this package and not sure how to implement the
>>>>> sample()
>>>>>> function with data.table.
>>>>>>
>>>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.
>>>>>> The
>>>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>>>>>> interested in the count of all combinations of wdpaint and pnvid in
>>>>>> my
>>>>>> data
>>>>>> set, which can be calculated using table or tapply (I use the latter
>>>>> in
>>>>>> the
>>>>>> example code below).
>>>>>>
>>>>>> Normally I would use something like:
>>>>>>
>>>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>>>>> as.factor(SPF$wdpaint),
>>>>>> function(x) length(x))*
>>>>>>
>>>>>> If I understand correctly, I would use the below when working with
>>>>> data
>>>>>> tables:
>>>>>>
>>>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>>>>>>
>>>>>> But what if I want to reshuffle the column wdpaint first? When using
>>>>>> tapply, it would be something along the lines of:
>>>>>>
>>>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>>>>>> replace=F)))
>>>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
>>>>>>
>>>>>>
>>>>>> But how to do this with data.table?
>>>>>>
>>>>>> Paulo
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> [hidden email]
>>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>
>>
>>
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: using sample() in data.table

Paulo van Breugel
In reply to this post by Matthew Dowle
Hi Matthew,

Thanks for the suggestions. The tapply in the code below transforms the table from long format to a wide format with wdpaint as columns and pnvid as rows. The main reason is that it includes all combinations of the two variables, including those with 0 observations. The code you are suggesting indeed seems to be the same as ordering the table.

Cheers,

Paulo



On Fri, Jun 22, 2012 at 4:55 PM, Matthew Dowle <[hidden email]> wrote:

Great. Thanks for keeping the list updated.

One thing I don't quite see, instead of :

for (i in 1:12) {
   a3 <- a1[,V1:=sample(a2,replace=F)]
   b <- a3[,.N,by=list(V1,V2)]
   c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
}

why not :

for (i in 1:12){
   a3 <- a1[,V1:=sample(a2,replace=F)]
   b <- a3[,.N,by=list(V1,V2)]
   b2 <- b[,sum(N),by=list(V2,V1)]
   c[[i]] <- b2$V1
}

Idea being to save the tapply and the 2 as.factor. Further, I'm not sure
that sum() will be summing anything will it?  Isn't b2 the same as
b[order(V2,V1)], and if so that will be faster still?

Matthew

> I got some very useful further feed back from Matthew. Let me summarize
> some key points from his suggestions concerning the code below:
>
> The following code is still fairly slow (although faster then using
> table or tapply):
>
>    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>    b <- a[,.N,by=list(V1,V2)]
>
>    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
>    for(i in 1:11){
>
>      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>      b <- a[,.N,by=list(V1,V2)]
>
>      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))
>
>   }
>
> As pointed out by Matthew, the rbind at the end of the loop will be
> growing memory use and is generally inefficient. How badly it is
> impacting performance will depend on the data size though. So step 1 is
> to get that outside the loop (an useful link he provided is
> http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r).
> Based on a hint in R-inferno
> (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code
> as follows:
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> This did improve the run time, but only very little bit (16.0 instead of
> 16.4 seconds). Next step was to profile the code, to see what part is
> taking most time. This can be done with Rprof(). The results showed that
> ordernumtol, a data.table function which sorts numeric ('double'
> floating point) columns was taking a lot of time. As it turns out, the
> SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to
> integer does speed up the code a lot.
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> 9
> The second code took 16.0 seconds. The last attempt 2.4 seconds only!
> That is a serious (> 6x) improvement. And it shows I really need to be
> much more careful about my variables...
> I checked and it also makes a smaller, but still very significant
> difference when using table (3x) or tapply (2x).
>
> Big thanks to Matthew Dowle for all his help.. and any further
> suggestions for improvements are obviously welcome.
>
> Cheers,
>
> Paulo
>
>
>
> On 06/19/2012 04:24 PM, Matthew Dowle wrote:
>> The shuffling can form a different number of groups can't it?
> YES, obvious.. I was half asleep I guess
>>
>> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
>> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
>> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
>>
>>
>>> Thanks Matthew
>>>
>>> I am not sure I understand the code (actually, I am sure I do not :-( .
>>> More specifically, I would expect the two expressions below to yield
>>> tables
>>> of the same dimension (basically all combinations of wdpaint and
>>> pnnid):
>>>
>>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>> dim(aa)
>>>> 254  3
>>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
>>> dim(bb)
>>>> 170 3
>>> What I am looking for is creating a cross table of pnvid and wdpaint,
>>> i.e.,
>>> the frequency or number of occurrences of each combination of pnvid and
>>> wdpaint. Shuffling wdpaint should give in that case a different
>>> frequency
>>> distribution, like in the example below:
>>>
>>> table(c(1,1,2,2), c(3,3,4,4))
>>> table(c(2,2,1,1), c(3,3,4,4))
>>>
>>> Basically what I want to do is run X permutations on a data set which I
>>> will then use to create a confidence interval on the frequency
>>> distribution
>>> of sample points over wdpaint and pnvid
>>>
>>> Cheers,
>>>
>>> Paulo
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
>>> <[hidden email]>wrote:
>>>
>>>> Hi,
>>>>
>>>> Welcome to the list.
>>>>
>>>> Rather than picking a column and calling length() on it, .N is a
>>>> little
>>>> more convenient (and faster if that column isn't otherwise used, as in
>>>> this example). Search ?data.table for the string ".N" to find out
>>>> more.
>>>>
>>>> And to group by expressions of column names, wrap with list().  So,
>>>>
>>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>>
>>>> But that won't calculate any different statistics, just return the
>>>> groups
>>>> in a different order. Seems like just an example, rather than the real
>>>> task, iiuc, which is fine of course.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>> Hi, I am new to this package and not sure how to implement the
>>>> sample()
>>>>> function with data.table.
>>>>>
>>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.
>>>>> The
>>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>>>>> interested in the count of all combinations of wdpaint and pnvid in
>>>>> my
>>>>> data
>>>>> set, which can be calculated using table or tapply (I use the latter
>>>> in
>>>>> the
>>>>> example code below).
>>>>>
>>>>> Normally I would use something like:
>>>>>
>>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>>>> as.factor(SPF$wdpaint),
>>>>> function(x) length(x))*
>>>>>
>>>>> If I understand correctly, I would use the below when working with
>>>> data
>>>>> tables:
>>>>>
>>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>>>>>
>>>>> But what if I want to reshuffle the column wdpaint first? When using
>>>>> tapply, it would be something along the lines of:
>>>>>
>>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>>>>> replace=F)))
>>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
>>>>>
>>>>>
>>>>> But how to do this with data.table?
>>>>>
>>>>> Paulo
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> [hidden email]
>>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>
>>
>>
>
>
>




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...