Splitting data.frame into a list of small data.frames given indices

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Splitting data.frame into a list of small data.frames given indices

Witold E Wolski
It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


 > nrows <- c(1e5,1e6,2e6,3e6,5e6)
> timing <- list()
> for(i in nrows){
+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }
> names(timing)<- nrows
> timing
$`1e+05`
   user  system elapsed
   0.05    0.00    0.05

$`1e+06`
   user  system elapsed
   1.48    2.98    4.46

$`2e+06`
   user  system elapsed
   7.25   11.39   18.65

$`3e+06`
   user  system elapsed
  16.15   25.81   41.99

$`5e+06`
   user  system elapsed
  43.22   74.72  118.09





--
Witold Eryk Wolski

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Rolf Turner
On 29/06/16 21:16, Witold E Wolski wrote:

> It's the inverse problem to merging a list of data.frames into a large
> data.frame just discussed in the "performance of do.call("rbind")"
> thread
>
> I would like to split a data.frame into a list of data.frames
> according to first column.
> This SEEMS to be easily possible with the function base::by. However,
> as soon as the data.frame has a few million rows this function CAN NOT
> BE USED (except you have A PLENTY OF TIME).
>
> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>
> So basically I am looking for a similar function with better complexity.
>
>
>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>> timing <- list()
>> for(i in nrows){
> + dum <- peaks[1:i,]
> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
> + }
>> names(timing)<- nrows
>> timing
> $`1e+05`
>    user  system elapsed
>    0.05    0.00    0.05
>
> $`1e+06`
>    user  system elapsed
>    1.48    2.98    4.46
>
> $`2e+06`
>    user  system elapsed
>    7.25   11.39   18.65
>
> $`3e+06`
>    user  system elapsed
>   16.15   25.81   41.99
>
> $`5e+06`
>    user  system elapsed
>   43.22   74.72  118.09

I'm not sure that I follow what you're doing, and your example is not
reproducible, since we have no idea what "peaks" is, but on a toy
example with 5e6 rows in the data frame I got a timing result of

    user  system elapsed
   0.379   0.025   0.406

when I applied split().  Is this adequately fast? Seems to me that if
you want to split something, split() would be a good place to start.

cheers,

Rolf Turner

--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Witold E Wolski
Hi,

Here is an complete example which shows the the complexity of split or
by is O(n^2)

nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
res<-list()

for(i in nrows){
  dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
  res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
}
res <- do.call("rbind",res)
plot(nrows^2, res[,"elapsed"])

And I can't see a reason why this has to be so slow.


cheers







On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:

> On 29/06/16 21:16, Witold E Wolski wrote:
>>
>> It's the inverse problem to merging a list of data.frames into a large
>> data.frame just discussed in the "performance of do.call("rbind")"
>> thread
>>
>> I would like to split a data.frame into a list of data.frames
>> according to first column.
>> This SEEMS to be easily possible with the function base::by. However,
>> as soon as the data.frame has a few million rows this function CAN NOT
>> BE USED (except you have A PLENTY OF TIME).
>>
>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>
>> So basically I am looking for a similar function with better complexity.
>>
>>
>>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>
>>> timing <- list()
>>> for(i in nrows){
>>
>> + dum <- peaks[1:i,]
>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>> + }
>>>
>>> names(timing)<- nrows
>>> timing
>>
>> $`1e+05`
>>    user  system elapsed
>>    0.05    0.00    0.05
>>
>> $`1e+06`
>>    user  system elapsed
>>    1.48    2.98    4.46
>>
>> $`2e+06`
>>    user  system elapsed
>>    7.25   11.39   18.65
>>
>> $`3e+06`
>>    user  system elapsed
>>   16.15   25.81   41.99
>>
>> $`5e+06`
>>    user  system elapsed
>>   43.22   74.72  118.09
>
>
> I'm not sure that I follow what you're doing, and your example is not
> reproducible, since we have no idea what "peaks" is, but on a toy example
> with 5e6 rows in the data frame I got a timing result of
>
>    user  system elapsed
>   0.379 0.025 0.406
>
> when I applied split().  Is this adequately fast? Seems to me that if you
> want to split something, split() would be a good place to start.
>
> cheers,
>
> Rolf Turner
>
> --
> Technical Editor ANZJS
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276



--
Witold Eryk Wolski

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Ivan Calandra-4
Hi,

I don't really understand why you split every row... This makes it very
slow. Try with a more realistic example (with a factor to split).

Ivan

--
Ivan Calandra, PhD
Scientific Mediator
University of Reims Champagne-Ardenne
GEGENAA - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
[hidden email]
--
https://www.researchgate.net/profile/Ivan_Calandra
https://publons.com/author/705639/

Le 29/06/2016 à 15:21, Witold E Wolski a écrit :

> Hi,
>
> Here is an complete example which shows the the complexity of split or
> by is O(n^2)
>
> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
> res<-list()
>
> for(i in nrows){
>    dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>    res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
> }
> res <- do.call("rbind",res)
> plot(nrows^2, res[,"elapsed"])
>
> And I can't see a reason why this has to be so slow.
>
>
> cheers
>
>
>
>
>
>
>
> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
>> On 29/06/16 21:16, Witold E Wolski wrote:
>>> It's the inverse problem to merging a list of data.frames into a large
>>> data.frame just discussed in the "performance of do.call("rbind")"
>>> thread
>>>
>>> I would like to split a data.frame into a list of data.frames
>>> according to first column.
>>> This SEEMS to be easily possible with the function base::by. However,
>>> as soon as the data.frame has a few million rows this function CAN NOT
>>> BE USED (except you have A PLENTY OF TIME).
>>>
>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>>
>>> So basically I am looking for a similar function with better complexity.
>>>
>>>
>>>   > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>> timing <- list()
>>>> for(i in nrows){
>>> + dum <- peaks[1:i,]
>>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>> + }
>>>> names(timing)<- nrows
>>>> timing
>>> $`1e+05`
>>>     user  system elapsed
>>>     0.05    0.00    0.05
>>>
>>> $`1e+06`
>>>     user  system elapsed
>>>     1.48    2.98    4.46
>>>
>>> $`2e+06`
>>>     user  system elapsed
>>>     7.25   11.39   18.65
>>>
>>> $`3e+06`
>>>     user  system elapsed
>>>    16.15   25.81   41.99
>>>
>>> $`5e+06`
>>>     user  system elapsed
>>>    43.22   74.72  118.09
>>
>> I'm not sure that I follow what you're doing, and your example is not
>> reproducible, since we have no idea what "peaks" is, but on a toy example
>> with 5e6 rows in the data frame I got a timing result of
>>
>>     user  system elapsed
>>    0.379 0.025 0.406
>>
>> when I applied split().  Is this adequately fast? Seems to me that if you
>> want to split something, split() would be a good place to start.
>>
>> cheers,
>>
>> Rolf Turner
>>
>> --
>> Technical Editor ANZJS
>> Department of Statistics
>> University of Auckland
>> Phone: +64-9-373-7599 ext. 88276
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

R help mailing list-2
In reply to this post by Witold E Wolski
I won't go into why splitting data.frames (or factors) uses time
proportional to the number of input rows times the number of
levels in the splitting factor, but you will get much better mileage
if you call split individually on each 'atomic' (numeric, character, ...)
variable and use mapply on the resulting lists.

The plyr and dplyr packages were developed to deal with this
sort of problem.  Check them out.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <[hidden email]> wrote:

> Hi,
>
> Here is an complete example which shows the the complexity of split or
> by is O(n^2)
>
> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
> res<-list()
>
> for(i in nrows){
>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
> }
> res <- do.call("rbind",res)
> plot(nrows^2, res[,"elapsed"])
>
> And I can't see a reason why this has to be so slow.
>
>
> cheers
>
>
>
>
>
>
>
> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
> > On 29/06/16 21:16, Witold E Wolski wrote:
> >>
> >> It's the inverse problem to merging a list of data.frames into a large
> >> data.frame just discussed in the "performance of do.call("rbind")"
> >> thread
> >>
> >> I would like to split a data.frame into a list of data.frames
> >> according to first column.
> >> This SEEMS to be easily possible with the function base::by. However,
> >> as soon as the data.frame has a few million rows this function CAN NOT
> >> BE USED (except you have A PLENTY OF TIME).
> >>
> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
> >>
> >> So basically I am looking for a similar function with better complexity.
> >>
> >>
> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
> >>>
> >>> timing <- list()
> >>> for(i in nrows){
> >>
> >> + dum <- peaks[1:i,]
> >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
> >> + }
> >>>
> >>> names(timing)<- nrows
> >>> timing
> >>
> >> $`1e+05`
> >>    user  system elapsed
> >>    0.05    0.00    0.05
> >>
> >> $`1e+06`
> >>    user  system elapsed
> >>    1.48    2.98    4.46
> >>
> >> $`2e+06`
> >>    user  system elapsed
> >>    7.25   11.39   18.65
> >>
> >> $`3e+06`
> >>    user  system elapsed
> >>   16.15   25.81   41.99
> >>
> >> $`5e+06`
> >>    user  system elapsed
> >>   43.22   74.72  118.09
> >
> >
> > I'm not sure that I follow what you're doing, and your example is not
> > reproducible, since we have no idea what "peaks" is, but on a toy example
> > with 5e6 rows in the data frame I got a timing result of
> >
> >    user  system elapsed
> >   0.379 0.025 0.406
> >
> > when I applied split().  Is this adequately fast? Seems to me that if you
> > want to split something, split() would be a good place to start.
> >
> > cheers,
> >
> > Rolf Turner
> >
> > --
> > Technical Editor ANZJS
> > Department of Statistics
> > University of Auckland
> > Phone: +64-9-373-7599 ext. 88276
>
>
>
> --
> Witold Eryk Wolski
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Ivan Calandra-4
In reply to this post by Ivan Calandra-4
Your answer did not make it to the list...

Le 29/06/2016 à 16:06, Witold E Wolski a écrit :

> If you do not understand than why do you reply?
>
>
> On 29 June 2016 at 15:54, Ivan Calandra <[hidden email]> wrote:
>> Hi,
>>
>> I don't really understand why you split every row... This makes it very
>> slow. Try with a more realistic example (with a factor to split).
>>
>> Ivan
>>
>> --
>> Ivan Calandra, PhD
>> Scientific Mediator
>> University of Reims Champagne-Ardenne
>> GEGENAA - EA 3795
>> CREA - 2 esplanade Roland Garros
>> 51100 Reims, France
>> +33(0)3 26 77 36 89
>> [hidden email]
>> --
>> https://www.researchgate.net/profile/Ivan_Calandra
>> https://publons.com/author/705639/
>>
>>
>> Le 29/06/2016 à 15:21, Witold E Wolski a écrit :
>>> Hi,
>>>
>>> Here is an complete example which shows the the complexity of split or
>>> by is O(n^2)
>>>
>>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>>> res<-list()
>>>
>>> for(i in nrows){
>>>     dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>>     res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
>>> }
>>> res <- do.call("rbind",res)
>>> plot(nrows^2, res[,"elapsed"])
>>>
>>> And I can't see a reason why this has to be so slow.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
>>>> On 29/06/16 21:16, Witold E Wolski wrote:
>>>>> It's the inverse problem to merging a list of data.frames into a large
>>>>> data.frame just discussed in the "performance of do.call("rbind")"
>>>>> thread
>>>>>
>>>>> I would like to split a data.frame into a list of data.frames
>>>>> according to first column.
>>>>> This SEEMS to be easily possible with the function base::by. However,
>>>>> as soon as the data.frame has a few million rows this function CAN NOT
>>>>> BE USED (except you have A PLENTY OF TIME).
>>>>>
>>>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>>>>
>>>>> So basically I am looking for a similar function with better complexity.
>>>>>
>>>>>
>>>>>    > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>>>> timing <- list()
>>>>>> for(i in nrows){
>>>>> + dum <- peaks[1:i,]
>>>>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>>>> + }
>>>>>> names(timing)<- nrows
>>>>>> timing
>>>>> $`1e+05`
>>>>>      user  system elapsed
>>>>>      0.05    0.00    0.05
>>>>>
>>>>> $`1e+06`
>>>>>      user  system elapsed
>>>>>      1.48    2.98    4.46
>>>>>
>>>>> $`2e+06`
>>>>>      user  system elapsed
>>>>>      7.25   11.39   18.65
>>>>>
>>>>> $`3e+06`
>>>>>      user  system elapsed
>>>>>     16.15   25.81   41.99
>>>>>
>>>>> $`5e+06`
>>>>>      user  system elapsed
>>>>>     43.22   74.72  118.09
>>>>
>>>> I'm not sure that I follow what you're doing, and your example is not
>>>> reproducible, since we have no idea what "peaks" is, but on a toy example
>>>> with 5e6 rows in the data frame I got a timing result of
>>>>
>>>>      user  system elapsed
>>>>     0.379 0.025 0.406
>>>>
>>>> when I applied split().  Is this adequately fast? Seems to me that if you
>>>> want to split something, split() would be a good place to start.
>>>>
>>>> cheers,
>>>>
>>>> Rolf Turner
>>>>
>>>> --
>>>> Technical Editor ANZJS
>>>> Department of Statistics
>>>> University of Auckland
>>>> Phone: +64-9-373-7599 ext. 88276
>>>
>>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Witold E Wolski
In reply to this post by R help mailing list-2
Hi William,

I tested plyrs dlply function, and it seems to have  have an O(N*
log(R)) complexity (tested for R=N) so I do not know if N is the
number of rows or nr of categories.

For the data.frame example with 2e5 rows and 2e5 categories it is
approx. 10 times faster than split. Still, it is 10 seconds on an
i7-5930K 3.5GHz Intel.

It would be nice if the documentation would contain runtime
complexity information and the documentation of base package function
would point to function which should be used instead.

Thanks




On 29 June 2016 at 16:13, William Dunlap <[hidden email]> wrote:

> I won't go into why splitting data.frames (or factors) uses time
> proportional to the number of input rows times the number of
> levels in the splitting factor, but you will get much better mileage
> if you call split individually on each 'atomic' (numeric, character, ...)
> variable and use mapply on the resulting lists.
>
> The plyr and dplyr packages were developed to deal with this
> sort of problem.  Check them out.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <[hidden email]> wrote:
>>
>> Hi,
>>
>> Here is an complete example which shows the the complexity of split or
>> by is O(n^2)
>>
>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>> res<-list()
>>
>> for(i in nrows){
>>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
>> }
>> res <- do.call("rbind",res)
>> plot(nrows^2, res[,"elapsed"])
>>
>> And I can't see a reason why this has to be so slow.
>>
>>
>> cheers
>>
>>
>>
>>
>>
>>
>>
>> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
>> > On 29/06/16 21:16, Witold E Wolski wrote:
>> >>
>> >> It's the inverse problem to merging a list of data.frames into a large
>> >> data.frame just discussed in the "performance of do.call("rbind")"
>> >> thread
>> >>
>> >> I would like to split a data.frame into a list of data.frames
>> >> according to first column.
>> >> This SEEMS to be easily possible with the function base::by. However,
>> >> as soon as the data.frame has a few million rows this function CAN NOT
>> >> BE USED (except you have A PLENTY OF TIME).
>> >>
>> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>> >>
>> >> So basically I am looking for a similar function with better
>> >> complexity.
>> >>
>> >>
>> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>> >>>
>> >>> timing <- list()
>> >>> for(i in nrows){
>> >>
>> >> + dum <- peaks[1:i,]
>> >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>> >> + }
>> >>>
>> >>> names(timing)<- nrows
>> >>> timing
>> >>
>> >> $`1e+05`
>> >>    user  system elapsed
>> >>    0.05    0.00    0.05
>> >>
>> >> $`1e+06`
>> >>    user  system elapsed
>> >>    1.48    2.98    4.46
>> >>
>> >> $`2e+06`
>> >>    user  system elapsed
>> >>    7.25   11.39   18.65
>> >>
>> >> $`3e+06`
>> >>    user  system elapsed
>> >>   16.15   25.81   41.99
>> >>
>> >> $`5e+06`
>> >>    user  system elapsed
>> >>   43.22   74.72  118.09
>> >
>> >
>> > I'm not sure that I follow what you're doing, and your example is not
>> > reproducible, since we have no idea what "peaks" is, but on a toy
>> > example
>> > with 5e6 rows in the data frame I got a timing result of
>> >
>> >    user  system elapsed
>> >   0.379 0.025 0.406
>> >
>> > when I applied split().  Is this adequately fast? Seems to me that if
>> > you
>> > want to split something, split() would be a good place to start.
>> >
>> > cheers,
>> >
>> > Rolf Turner
>> >
>> > --
>> > Technical Editor ANZJS
>> > Department of Statistics
>> > University of Auckland
>> > Phone: +64-9-373-7599 ext. 88276
>>
>>
>>
>> --
>> Witold Eryk Wolski
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>



--
Witold Eryk Wolski

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Bert Gunter-2
Inline.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Jul 1, 2016 at 7:40 AM, Witold E Wolski <[hidden email]> wrote:

> Hi William,
>
> I tested plyrs dlply function, and it seems to have  have an O(N*
> log(R)) complexity (tested for R=N) so I do not know if N is the
> number of rows or nr of categories.
>
> For the data.frame example with 2e5 rows and 2e5 categories it is
> approx. 10 times faster than split. Still, it is 10 seconds on an
> i7-5930K 3.5GHz Intel.
>
> It would be nice if the documentation would contain runtime
> complexity information and the documentation of base package function
> would point to function which should be used instead.

It would, indeed -- but these things are very dependent on exact data
context, the hardware in use, the OS, etc.  Moreover, how could base
documentation possibly keep track of what all other packages are
doing?! -- that seems unreasonable on the face of it.  I know that
from time to time R docs do mention base algorithmic complexity (e.g.
?sort and the data.table package, I believe), but generally it is
preferable to omit such details, imho: access to a function should be
through its relatively fixed API, while the underlying machinery may
be subject to considerable change.  Obviously, there are circumstances
where something still could (and perhaps should) be said about
efficiency -- and R docs often do say it -- but I think the level of
detail you request is unrealistic and might often even mislead.

Obviously, just my opinion, so contrary views welcome.

Cheers,
Bert


>
> Thanks
>
>
>
>
> On 29 June 2016 at 16:13, William Dunlap <[hidden email]> wrote:
>> I won't go into why splitting data.frames (or factors) uses time
>> proportional to the number of input rows times the number of
>> levels in the splitting factor, but you will get much better mileage
>> if you call split individually on each 'atomic' (numeric, character, ...)
>> variable and use mapply on the resulting lists.
>>
>> The plyr and dplyr packages were developed to deal with this
>> sort of problem.  Check them out.
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> Here is an complete example which shows the the complexity of split or
>>> by is O(n^2)
>>>
>>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>>> res<-list()
>>>
>>> for(i in nrows){
>>>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>>   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
>>> }
>>> res <- do.call("rbind",res)
>>> plot(nrows^2, res[,"elapsed"])
>>>
>>> And I can't see a reason why this has to be so slow.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
>>> > On 29/06/16 21:16, Witold E Wolski wrote:
>>> >>
>>> >> It's the inverse problem to merging a list of data.frames into a large
>>> >> data.frame just discussed in the "performance of do.call("rbind")"
>>> >> thread
>>> >>
>>> >> I would like to split a data.frame into a list of data.frames
>>> >> according to first column.
>>> >> This SEEMS to be easily possible with the function base::by. However,
>>> >> as soon as the data.frame has a few million rows this function CAN NOT
>>> >> BE USED (except you have A PLENTY OF TIME).
>>> >>
>>> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>> >>
>>> >> So basically I am looking for a similar function with better
>>> >> complexity.
>>> >>
>>> >>
>>> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>> >>>
>>> >>> timing <- list()
>>> >>> for(i in nrows){
>>> >>
>>> >> + dum <- peaks[1:i,]
>>> >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>> >> + }
>>> >>>
>>> >>> names(timing)<- nrows
>>> >>> timing
>>> >>
>>> >> $`1e+05`
>>> >>    user  system elapsed
>>> >>    0.05    0.00    0.05
>>> >>
>>> >> $`1e+06`
>>> >>    user  system elapsed
>>> >>    1.48    2.98    4.46
>>> >>
>>> >> $`2e+06`
>>> >>    user  system elapsed
>>> >>    7.25   11.39   18.65
>>> >>
>>> >> $`3e+06`
>>> >>    user  system elapsed
>>> >>   16.15   25.81   41.99
>>> >>
>>> >> $`5e+06`
>>> >>    user  system elapsed
>>> >>   43.22   74.72  118.09
>>> >
>>> >
>>> > I'm not sure that I follow what you're doing, and your example is not
>>> > reproducible, since we have no idea what "peaks" is, but on a toy
>>> > example
>>> > with 5e6 rows in the data frame I got a timing result of
>>> >
>>> >    user  system elapsed
>>> >   0.379 0.025 0.406
>>> >
>>> > when I applied split().  Is this adequately fast? Seems to me that if
>>> > you
>>> > want to split something, split() would be a good place to start.
>>> >
>>> > cheers,
>>> >
>>> > Rolf Turner
>>> >
>>> > --
>>> > Technical Editor ANZJS
>>> > Department of Statistics
>>> > University of Auckland
>>> > Phone: +64-9-373-7599 ext. 88276
>>>
>>>
>>>
>>> --
>>> Witold Eryk Wolski
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Witold Eryk Wolski
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Splitting data.frame into a list of small data.frames given indices

Witold E Wolski
Hi Bert,

My reply is inline:

On 1 July 2016 at 17:02, Bert Gunter <[hidden email]> wrote:

> Inline.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Jul 1, 2016 at 7:40 AM, Witold E Wolski <[hidden email]> wrote:
>> Hi William,
>>
>> I tested plyrs dlply function, and it seems to have  have an O(N*
>> log(R)) complexity (tested for R=N) so I do not know if N is the
>> number of rows or nr of categories.
>>
>> For the data.frame example with 2e5 rows and 2e5 categories it is
>> approx. 10 times faster than split. Still, it is 10 seconds on an
>> i7-5930K 3.5GHz Intel.
>>
>> It would be nice if the documentation would contain runtime
>> complexity information and the documentation of base package function
>> would point to function which should be used instead.
>
> It would, indeed -- but these things are very dependent on exact data
> context, the hardware in use, the OS, etc.  Moreover, how could base
> documentation possibly keep track of what all other packages are
> doing?! -- that seems unreasonable on the face of it.

Old methods need to stay for backward compatibility but why not
include new packages, which are a clear improvement over the old ones,
into the default packages and reference them?

 I know that
> from time to time R docs do mention base algorithmic complexity (e.g.
> ?sort and the data.table package, I believe), but generally it is
> preferable to omit such details, imho: access to a function should be
> through its relatively fixed API, while the underlying machinery may
> be subject to considerable change.

Just take a look how the C++ stl or the python documents functions for
working with containers. They do provide information about complexity
of operations, or algorithms. Such information is needed to make
reasonable decisions, including not to use a function or a container
type.

If there is a new implementation that changes complexity this
information needs to be updated.


Obviously, there are circumstances

> where something still could (and perhaps should) be said about
> efficiency -- and R docs often do say it -- but I think the level of
> detail you request is unrealistic and might often even mislead.
>
> Obviously, just my opinion, so contrary views welcome.
>
> Cheers,
> Bert
>
>
>>
>> Thanks
>>
>>
>>
>>
>> On 29 June 2016 at 16:13, William Dunlap <[hidden email]> wrote:
>>> I won't go into why splitting data.frames (or factors) uses time
>>> proportional to the number of input rows times the number of
>>> levels in the splitting factor, but you will get much better mileage
>>> if you call split individually on each 'atomic' (numeric, character, ...)
>>> variable and use mapply on the resulting lists.
>>>
>>> The plyr and dplyr packages were developed to deal with this
>>> sort of problem.  Check them out.
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>> On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Here is an complete example which shows the the complexity of split or
>>>> by is O(n^2)
>>>>
>>>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>>>> res<-list()
>>>>
>>>> for(i in nrows){
>>>>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>>>   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
>>>> }
>>>> res <- do.call("rbind",res)
>>>> plot(nrows^2, res[,"elapsed"])
>>>>
>>>> And I can't see a reason why this has to be so slow.
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 29 June 2016 at 12:00, Rolf Turner <[hidden email]> wrote:
>>>> > On 29/06/16 21:16, Witold E Wolski wrote:
>>>> >>
>>>> >> It's the inverse problem to merging a list of data.frames into a large
>>>> >> data.frame just discussed in the "performance of do.call("rbind")"
>>>> >> thread
>>>> >>
>>>> >> I would like to split a data.frame into a list of data.frames
>>>> >> according to first column.
>>>> >> This SEEMS to be easily possible with the function base::by. However,
>>>> >> as soon as the data.frame has a few million rows this function CAN NOT
>>>> >> BE USED (except you have A PLENTY OF TIME).
>>>> >>
>>>> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>>> >>
>>>> >> So basically I am looking for a similar function with better
>>>> >> complexity.
>>>> >>
>>>> >>
>>>> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>> >>>
>>>> >>> timing <- list()
>>>> >>> for(i in nrows){
>>>> >>
>>>> >> + dum <- peaks[1:i,]
>>>> >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>>> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>>> >> + }
>>>> >>>
>>>> >>> names(timing)<- nrows
>>>> >>> timing
>>>> >>
>>>> >> $`1e+05`
>>>> >>    user  system elapsed
>>>> >>    0.05    0.00    0.05
>>>> >>
>>>> >> $`1e+06`
>>>> >>    user  system elapsed
>>>> >>    1.48    2.98    4.46
>>>> >>
>>>> >> $`2e+06`
>>>> >>    user  system elapsed
>>>> >>    7.25   11.39   18.65
>>>> >>
>>>> >> $`3e+06`
>>>> >>    user  system elapsed
>>>> >>   16.15   25.81   41.99
>>>> >>
>>>> >> $`5e+06`
>>>> >>    user  system elapsed
>>>> >>   43.22   74.72  118.09
>>>> >
>>>> >
>>>> > I'm not sure that I follow what you're doing, and your example is not
>>>> > reproducible, since we have no idea what "peaks" is, but on a toy
>>>> > example
>>>> > with 5e6 rows in the data frame I got a timing result of
>>>> >
>>>> >    user  system elapsed
>>>> >   0.379 0.025 0.406
>>>> >
>>>> > when I applied split().  Is this adequately fast? Seems to me that if
>>>> > you
>>>> > want to split something, split() would be a good place to start.
>>>> >
>>>> > cheers,
>>>> >
>>>> > Rolf Turner
>>>> >
>>>> > --
>>>> > Technical Editor ANZJS
>>>> > Department of Statistics
>>>> > University of Auckland
>>>> > Phone: +64-9-373-7599 ext. 88276
>>>>
>>>>
>>>>
>>>> --
>>>> Witold Eryk Wolski
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>>
>> --
>> Witold Eryk Wolski
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



--
Witold Eryk Wolski

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.