Quantcast

Reading large files quickly

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Reading large files quickly

Rob Steele-6
I'm finding that readLines() and read.fwf() take nearly two hours to
work through a 3.5 GB file, even when reading in large (100 MB) chunks.
 The unix command wc by contrast processes the same file in three
minutes.  Is there a faster way to read files in R?

Thanks!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

Gabor Grothendieck
You could try it with sqldf and see if that is any faster.
It use RSQLite/sqlite to read the data into a database without
going through R and from there it reads all or a portion as
specified into R.  It requires two lines of code of the form:

f < file("myfile.dat")
DF <- sqldf("select * from f", dbname = tempfile())

with appropriate modification to specify the format of your file and
possibly to indicate a portion only.  See example 6 on the sqldf
home page: http://sqldf.googlecode.com
and ?sqldf


On Sat, May 9, 2009 at 12:25 PM, Rob Steele
<[hidden email]> wrote:

> I'm finding that readLines() and read.fwf() take nearly two hours to
> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>  The unix command wc by contrast processes the same file in three
> minutes.  Is there a faster way to read files in R?
>
> Thanks!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

jholtman
In reply to this post by Rob Steele-6
First 'wc' and readLines are doing vastly different functions.  'wc' is just
reading through the file without having to allocate memory to it;
'readLines' is actually storing the data in memory.

I have a 150MB file I was trying it on, and here is what 'wc' did on my
Windows system:

/cygdrive/c: time wc tempxx.txt
  1055808  13718468 151012320 tempxx.txt
real    0m2.343s
user    0m1.702s
sys     0m0.436s
/cygdrive/c:

If I multiply that by 25 to extrapolate to a 3.5GB file, it should take
about a little less than one minute to process on my relatively slow laptop.

'readLines' on the same file takes:

> system.time(x <- readLines('/tempxx.txt'))
   user  system elapsed
  37.82    0.47   39.23
If I extrapolate that to 3.5GB, it would take about 16 minutes.  Now
considering that I only have 2GB on my system, I would not be able to read
the whole file in at once.

You never did specify what type of system you were running on and how much
memory you had.  Were you 'paging' due to lack of memory?

> system.time(x <- readLines('/tempxx.txt'))
   user  system elapsed
  37.82    0.47   39.23
> object.size(x)
84814016 bytes



On Sat, May 9, 2009 at 12:25 PM, Rob Steele <[hidden email]>wrote:

> I'm finding that readLines() and read.fwf() take nearly two hours to
> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>  The unix command wc by contrast processes the same file in three
> minutes.  Is there a faster way to read files in R?
>
> Thanks!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

Jakson A. Aquino
In reply to this post by Rob Steele-6
Rob Steele wrote:
> I'm finding that readLines() and read.fwf() take nearly two hours to
> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>  The unix command wc by contrast processes the same file in three
> minutes.  Is there a faster way to read files in R?

I use statist to convert the fixed width data file into a csv file
because read.table() is considerably faster than read.fwf(). For example:

system("statist --na-string NA --xcols collist big.txt big.csv")
bigdf <- read.table(file = "big.csv", header=T, as.is=T)

The file collist is a text file whose lines contain the following
information:

variable begin end

where "variable" is the column name, and "begin" and "end" are integer
numbers indicating where in big.txt the columns begin and end.

Statist can be downloaded from: http://statist.wald.intevation.org/

--
Jakson Aquino
Social Sciences Department
Federal University of Ceará, Brazil

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

Rob Steele-6
In reply to this post by Rob Steele-6
Thanks guys, good suggestions.  To clarify, I'm running on a fast
multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1.
Paging shouldn't be an issue since I'm reading in chunks and not trying
to store the whole file in memory at once.  Thanks again.

Rob Steele wrote:
> I'm finding that readLines() and read.fwf() take nearly two hours to
> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>  The unix command wc by contrast processes the same file in three
> minutes.  Is there a faster way to read files in R?
>
> Thanks!
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

jholtman
Since you are reading it in chunks, I assume that you are writing out each
segment as you read it in.  How are you writing it out to save it?  Is the
time you are quoting both the reading and the writing?  If so, can you break
down the differences in what these operations are taking?

How do you plan to use the data?  Is it all numeric?  Are you keeping it in
a dataframe?  Have you considered using 'scan' to read in the data and to
specify what the columns are?  If you would like some more help, the answer
to these questions will help.

On Sat, May 9, 2009 at 10:09 PM, Rob Steele <[hidden email]>wrote:

> Thanks guys, good suggestions.  To clarify, I'm running on a fast
> multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1.
> Paging shouldn't be an issue since I'm reading in chunks and not trying
> to store the whole file in memory at once.  Thanks again.
>
> Rob Steele wrote:
> > I'm finding that readLines() and read.fwf() take nearly two hours to
> > work through a 3.5 GB file, even when reading in large (100 MB) chunks.
> >  The unix command wc by contrast processes the same file in three
> > minutes.  Is there a faster way to read files in R?
> >
> > Thanks!
>  >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly

Rob Steele-6
At the moment I'm just reading the large file to see how fast it goes.
Eventually, if I can get the read time down, I'll write out a processed
version.  Thanks for suggesting scan(); I'll try it.

Rob

jim holtman wrote:

> Since you are reading it in chunks, I assume that you are writing out each
> segment as you read it in.  How are you writing it out to save it?  Is the
> time you are quoting both the reading and the writing?  If so, can you break
> down the differences in what these operations are taking?
>
> How do you plan to use the data?  Is it all numeric?  Are you keeping it in
> a dataframe?  Have you considered using 'scan' to read in the data and to
> specify what the columns are?  If you would like some more help, the answer
> to these questions will help.
>
> On Sat, May 9, 2009 at 10:09 PM, Rob Steele <[hidden email]>wrote:
>
>> Thanks guys, good suggestions.  To clarify, I'm running on a fast
>> multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1.
>> Paging shouldn't be an issue since I'm reading in chunks and not trying
>> to store the whole file in memory at once.  Thanks again.
>>
>> Rob Steele wrote:
>>> I'm finding that readLines() and read.fwf() take nearly two hours to
>>> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>>>  The unix command wc by contrast processes the same file in three
>>> minutes.  Is there a faster way to read files in R?
>>>
>>> Thanks!
>>  >
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading large files quickly; resolved

Rob Steele-6
In reply to this post by Rob Steele-6
Rob Steele wrote:
> I'm finding that readLines() and read.fwf() take nearly two hours to
> work through a 3.5 GB file, even when reading in large (100 MB) chunks.
>  The unix command wc by contrast processes the same file in three
> minutes.  Is there a faster way to read files in R?
>
> Thanks!
>

readChar() is fast.  I use strsplit(..., fixed = TRUE) to separate the
input data into lines and then use substr() to separate the lines into
fields.  I do a little light processing and write the result back out
with writeChar().  The whole thing takes thirty minutes where read.fwf()
took nearly two hours just to read the data.

Thanks for the help!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...