alternative to rbind within a loop

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

alternative to rbind within a loop

Denis Chabot-2
Hi,

I often have to do this:

select a folder (directory) containing a few hundred data files in csv  
format (up to 1000 files, in fact)

open each file, transform some character variables in date-tiime format

make into a dataframe (involves getting rid of a few variables I don't  
need

concatenate to the master dataframe that will eventually contain the  
data from all the files in the folder.

I use a loop going from 1 to the number of files. I have added a  
command to print an incrementing number to the R console each time the  
loop completes one iteration, to judge the speed of the process.

At the beginning, 3-4 files are processed each second. After a few  
hundred iterations it slows down to about 1 file per second. Before I  
reach the last file (898 in the case at hand), it has become much  
slower, about 1 file every 2-3 seconds.

This progressive slowing down suggests the problem is linked to the  
size of the growing "master" dataframe that rbind combines with each  
new file.

In fact, the small script below confirms this as nothing at all  
happens within the loop but rbind. You can cut the size of this  
example not to waste to much of your time:


# create a dummy data.frame and copy it in a large number of csv files

test  <- file.path("test")

a <- 1:350
b <- rnorm(350,100,10)
c <- runif(350, 0, 100)
d <- month.name[runif(350,1,12)]

the.data <- data.frame(a,b,c,d)

for(i in 1:850){
        write.csv(the.data, file=paste(test, "/file_", i, ".csv", sep=""))
}

# now lets make a single dataframe from all these csv files

all.files <- list.files(path=test,full.names=T,pattern=".csv")

new.data <- NULL

system.time({
        for(i in all.files){
        in.data <- read.csv(i)
        if (is.null(new.data)) {new.data = in.data} else {new.data =  
rbind(new.data, in.data)}
        cat(paste(i, ", ", sep=""))
} # end for
}) # end system.time

utilisateur     système      écoulé
     156.206      44.859     202.150
This is with

sessionInfo()
R version 2.9.1 Patched (2009-07-16 r48939)
x86_64-apple-darwin9.7.0

locale:
fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] doBy_3.7        chron_2.3-30    timeDate_290.84

loaded via a namespace (and not attached):
[1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2     lattice_0.17-25  
tools_2.9.1


Would it be better to somehow save all 850 files in one dataframe  
each, and then rbind them all in a single operation?

Can I combine all my files without using a loop? I've never quite  
mastered the "apply" family of functions but have not seen examples to  
read files.

Thanks in advance,

Denis Chabot

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: alternative to rbind within a loop

Greg Snow-2
Try something like (untested):

> mylist <- lapply(all.files, function(i) read.csv(i) )
> mydf <- do.call('rbind', mylist)

If all the csv files are conformable that rbind works on them (if the loop method works then that should be the case) then this will read in each file, store the data frames as a list, then rbind them all together.

It seems that this should be faster than the loop, but testing will be needed to be sure.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Denis Chabot
> Sent: Thursday, July 23, 2009 1:54 PM
> To: list R
> Subject: [R] alternative to rbind within a loop
>
> Hi,
>
> I often have to do this:
>
> select a folder (directory) containing a few hundred data files in csv
> format (up to 1000 files, in fact)
>
> open each file, transform some character variables in date-tiime format
>
> make into a dataframe (involves getting rid of a few variables I don't
> need
>
> concatenate to the master dataframe that will eventually contain the
> data from all the files in the folder.
>
> I use a loop going from 1 to the number of files. I have added a
> command to print an incrementing number to the R console each time the
> loop completes one iteration, to judge the speed of the process.
>
> At the beginning, 3-4 files are processed each second. After a few
> hundred iterations it slows down to about 1 file per second. Before I
> reach the last file (898 in the case at hand), it has become much
> slower, about 1 file every 2-3 seconds.
>
> This progressive slowing down suggests the problem is linked to the
> size of the growing "master" dataframe that rbind combines with each
> new file.
>
> In fact, the small script below confirms this as nothing at all
> happens within the loop but rbind. You can cut the size of this
> example not to waste to much of your time:
>
>
> # create a dummy data.frame and copy it in a large number of csv files
>
> test  <- file.path("test")
>
> a <- 1:350
> b <- rnorm(350,100,10)
> c <- runif(350, 0, 100)
> d <- month.name[runif(350,1,12)]
>
> the.data <- data.frame(a,b,c,d)
>
> for(i in 1:850){
> write.csv(the.data, file=paste(test, "/file_", i, ".csv",
> sep=""))
> }
>
> # now lets make a single dataframe from all these csv files
>
> all.files <- list.files(path=test,full.names=T,pattern=".csv")
>
> new.data <- NULL
>
> system.time({
> for(i in all.files){
> in.data <- read.csv(i)
> if (is.null(new.data)) {new.data = in.data} else {new.data =
> rbind(new.data, in.data)}
> cat(paste(i, ", ", sep=""))
> } # end for
> }) # end system.time
>
> utilisateur     système      écoulé
>      156.206      44.859     202.150
> This is with
>
> sessionInfo()
> R version 2.9.1 Patched (2009-07-16 r48939)
> x86_64-apple-darwin9.7.0
>
> locale:
> fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] doBy_3.7        chron_2.3-30    timeDate_290.84
>
> loaded via a namespace (and not attached):
> [1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2     lattice_0.17-25
> tools_2.9.1
>
>
> Would it be better to somehow save all 850 files in one dataframe
> each, and then rbind them all in a single operation?
>
> Can I combine all my files without using a loop? I've never quite
> mastered the "apply" family of functions but have not seen examples to
> read files.
>
> Thanks in advance,
>
> Denis Chabot
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: alternative to rbind within a loop

Denis Chabot-2
Hi Greg,

Thanks, very encouraging: with my example, this is 10x more efficient  
than my loop:
utilisateur     système      écoulé
      13.819       5.510      20.204
>> utilisateur     système      écoulé
>>     156.206      44.859     202.150

In real life, I did some work on each file before doing rbind. I'll  
see if this work can be put in a custom-built function that would go  
into the lapply call you suggested.

Denis

Le 09-07-23 à 17:27, Greg Snow a écrit :

> Try something like (untested):
>
>> mylist <- lapply(all.files, function(i) read.csv(i) )
>> mydf <- do.call('rbind', mylist)
>
> If all the csv files are conformable that rbind works on them (if  
> the loop method works then that should be the case) then this will  
> read in each file, store the data frames as a list, then rbind them  
> all together.
>
> It seems that this should be faster than the loop, but testing will  
> be needed to be sure.
>
> Hope this helps,
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> [hidden email]
> 801.408.8111
>
>
>> -----Original Message-----
>> From: [hidden email] [mailto:r-help-bounces@r-
>> project.org] On Behalf Of Denis Chabot
>> Sent: Thursday, July 23, 2009 1:54 PM
>> To: list R
>> Subject: [R] alternative to rbind within a loop
>>
>> Hi,
>>
>> I often have to do this:
>>
>> select a folder (directory) containing a few hundred data files in  
>> csv
>> format (up to 1000 files, in fact)
>>
>> open each file, transform some character variables in date-tiime  
>> format
>>
>> make into a dataframe (involves getting rid of a few variables I  
>> don't
>> need
>>
>> concatenate to the master dataframe that will eventually contain the
>> data from all the files in the folder.
>>
>> I use a loop going from 1 to the number of files. I have added a
>> command to print an incrementing number to the R console each time  
>> the
>> loop completes one iteration, to judge the speed of the process.
>>
>> At the beginning, 3-4 files are processed each second. After a few
>> hundred iterations it slows down to about 1 file per second. Before I
>> reach the last file (898 in the case at hand), it has become much
>> slower, about 1 file every 2-3 seconds.
>>
>> This progressive slowing down suggests the problem is linked to the
>> size of the growing "master" dataframe that rbind combines with each
>> new file.
>>
>> In fact, the small script below confirms this as nothing at all
>> happens within the loop but rbind. You can cut the size of this
>> example not to waste to much of your time:
>>
>>
>> # create a dummy data.frame and copy it in a large number of csv  
>> files
>>
>> test  <- file.path("test")
>>
>> a <- 1:350
>> b <- rnorm(350,100,10)
>> c <- runif(350, 0, 100)
>> d <- month.name[runif(350,1,12)]
>>
>> the.data <- data.frame(a,b,c,d)
>>
>> for(i in 1:850){
>> write.csv(the.data, file=paste(test, "/file_", i, ".csv",
>> sep=""))
>> }
>>
>> # now lets make a single dataframe from all these csv files
>>
>> all.files <- list.files(path=test,full.names=T,pattern=".csv")
>>
>> new.data <- NULL
>>
>> system.time({
>> for(i in all.files){
>> in.data <- read.csv(i)
>> if (is.null(new.data)) {new.data = in.data} else {new.data =
>> rbind(new.data, in.data)}
>> cat(paste(i, ", ", sep=""))
>> } # end for
>> }) # end system.time
>>
>> utilisateur     système      écoulé
>>     156.206      44.859     202.150
>> This is with
>>
>> sessionInfo()
>> R version 2.9.1 Patched (2009-07-16 r48939)
>> x86_64-apple-darwin9.7.0
>>
>> locale:
>> fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] doBy_3.7        chron_2.3-30    timeDate_290.84
>>
>> loaded via a namespace (and not attached):
>> [1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2     lattice_0.17-25
>> tools_2.9.1
>>
>>
>> Would it be better to somehow save all 850 files in one dataframe
>> each, and then rbind them all in a single operation?
>>
>> Can I combine all my files without using a loop? I've never quite
>> mastered the "apply" family of functions but have not seen examples  
>> to
>> read files.
>>
>> Thanks in advance,
>>
>> Denis Chabot
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: alternative to rbind within a loop

Don MacQueen
In reply to this post by Denis Chabot-2
Another approach that might be worth trying is to
create an empty data frame with lots and lots of
rows before looping, and then replace rather than
append. Of course, this requires knowing at least
approximately how many rows total you will have.
This suggestion comes from the help page for
read.table(), which says;

     Using 'nrows', even as a mild over-estimate, will help memory usage.

You may be doing a lot of unnecessary processing
if you are allowing your character variables to
be automatically converted to factors. This would
especially be the case if each data frame has new
character values not in the previous ones, since
more levels would be added to the factor
variables each time a data frame is appended.

Another approach would be to concatenate the
files outside of R (in unix, this would be the
"cat" command) and then read the single large
file into R. This can be controlled from within
R, i.e., using the system() command. It can even
be done without actually writing the extra file,
with something like

   read.csv( pipe( 'cat *.csv') )

Despite those ideas, I like Greg Snow's approach;
I'd try it before any of these.

Finally, if you really want to find out where the
cpu time is being spent, look into profiling; see
?Rprof.

-Don

At 3:53 PM -0400 7/23/09, Denis Chabot wrote:

>Hi,
>
>I often have to do this:
>
>select a folder (directory) containing a few
>hundred data files in csv format (up to 1000
>files, in fact)
>
>open each file, transform some character variables in date-tiime format
>
>make into a dataframe (involves getting rid of a few variables I don't need
>
>concatenate to the master dataframe that will
>eventually contain the data from all the files
>in the folder.
>
>I use a loop going from 1 to the number of
>files. I have added a command to print an
>incrementing number to the R console each time
>the loop completes one iteration, to judge the
>speed of the process.
>
>At the beginning, 3-4 files are processed each
>second. After a few hundred iterations it slows
>down to about 1 file per second. Before I reach
>the last file (898 in the case at hand), it has
>become much slower, about 1 file every 2-3
>seconds.
>
>This progressive slowing down suggests the
>problem is linked to the size of the growing
>"master" dataframe that rbind combines with each
>new file.
>
>In fact, the small script below confirms this as
>nothing at all happens within the loop but
>rbind. You can cut the size of this example not
>to waste to much of your time:
>
>
># create a dummy data.frame and copy it in a large number of csv files
>
>test  <- file.path("test")
>
>a <- 1:350
>b <- rnorm(350,100,10)
>c <- runif(350, 0, 100)
>d <- month.name[runif(350,1,12)]
>
>the.data <- data.frame(a,b,c,d)
>
>for(i in 1:850){
> write.csv(the.data, file=paste(test, "/file_", i, ".csv", sep=""))
>}
>
># now lets make a single dataframe from all these csv files
>
>all.files <- list.files(path=test,full.names=T,pattern=".csv")
>
>new.data <- NULL
>
>system.time({
> for(i in all.files){
> in.data <- read.csv(i)
> if (is.null(new.data)) {new.data =
>in.data} else {new.data = rbind(new.data,
>in.data)}
> cat(paste(i, ", ", sep=""))
>} # end for
>}) # end system.time
>
>utilisateur     système      écoulé
>     156.206      44.859     202.150
>This is with
>
>sessionInfo()
>R version 2.9.1 Patched (2009-07-16 r48939)
>x86_64-apple-darwin9.7.0
>
>locale:
>fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>
>attached base packages:
>[1] stats     graphics  grDevices utils     datasets  methods   base
>
>other attached packages:
>[1] doBy_3.7        chron_2.3-30    timeDate_290.84
>
>loaded via a namespace (and not attached):
>[1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2
>lattice_0.17-25 tools_2.9.1
>
>
>Would it be better to somehow save all 850 files
>in one dataframe each, and then rbind them all
>in a single operation?
>
>Can I combine all my files without using a loop?
>I've never quite mastered the "apply" family of
>functions but have not seen examples to read
>files.
>
>Thanks in advance,
>
>Denis Chabot
>
>______________________________________________
>[hidden email] mailing list
>https://*stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.


--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.