large dataset

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

large dataset

n.vialma@libero.it
Hi I have a question,
as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems?
What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
thanks a lot!!!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Orvalho Augusto
I do not know what is the limit for R. But on your problem you may try this:
- Install MySQL server (download somewhere on www.mysql.com)
- From inside MySQL you may import that CSV into a MySQL table
- Then using RMySQL or ROBDC you will choose the Fields to use and
import them to R.

Good luck
Caveman


On Sat, Mar 27, 2010 at 11:19 AM, [hidden email] <[hidden email]> wrote:

> Hi I have a question,
> as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems?
> What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
> thanks a lot!!!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Databases, Data Analysis and
OpenSource Software Consultant
CENFOSS (www.cenfoss.co.mz)
email: [hidden email]
cell: +258828810980

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Stefan Grosse-2
In reply to this post by n.vialma@libero.it
Am 27.03.2010 10:19, schrieb [hidden email]:
> What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
> thanks a lot!!!
>
>  

Maybe your physical memory is too limited. R uses this and if your data
are to large Linux and windows start to use the swap file which slows
not only R but your computer down.

hth
Stefan

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Jay Emerson
In reply to this post by n.vialma@libero.it
A little more information would help, such as the number of columns?  I
imagine it must
be large, because 100,000 rows isn't overwhelming.  Second, does the
read.csv() fail,
or does it work but only after a long time?  And third, how much RAM do you
have
available?

R Core provides some guidelines in the Installation and Administration
documentation
that suggests that a single object around 10% of your RAM is reasonable, but
beyond
that things can become challenging, particularly once you start working with
your data.

There are a wide range of packages to help with large data sets.  For
example,
RMySQL supports MySQL databases.  At the other end of the spectrum, there
are
possibilities discussed on a nice page by Dirk Eddelbuettel which you might
look at:

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Jay

--
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

(original message below)
------------------------------

Message: 128
Date: Sat, 27 Mar 2010 10:19:33 +0100
From: "n\.vialma\@libero\.it" <[hidden email]>
To: "r-help" <[hidden email]>
Subject: [R] large dataset
Message-ID: <KZXOKL$[hidden email]>
Content-Type: text/plain; charset=iso-8859-1

Hi I have a question,
as im not able to import a csv file which contains a big dataset(100.000
records) someone knows how many records R can handle without giving
problems?
What im facing when i try to import the file is that R generates more than
100.000 records and is very slow...
thanks a lot!!!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Gabor Grothendieck
In reply to this post by n.vialma@libero.it
Try using read.csv.sql in sqldf    See example 13 on the sqldf home page:
http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
Also read ?read.csv.sql

On Sat, Mar 27, 2010 at 5:19 AM, [hidden email] <[hidden email]> wrote:

> Hi I have a question,
> as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems?
> What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
> thanks a lot!!!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Khanh Nguyen-2
In reply to this post by n.vialma@libero.it
This was *very* useful for me when I dealt with a 1.5Gb text file

http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_large_data/


On Sat, Mar 27, 2010 at 5:19 AM, [hidden email] <[hidden email]> wrote:

> Hi I have a question,
> as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems?
> What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
> thanks a lot!!!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

kman-4
>This was *very* useful for me when I dealt with a 1.5Gb text file
>http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
rge_data/

Two hours is a *very* long time to transfer a csv file to a db. The author
of the linked article has not documented how to use scan() arguments
appropriately for the task. I take particular issue with the authors
statement that "R is said to be slow, memory hungry and only capable of
handling small datasets," indicating he/she has crummy informants and not
challenged the notion him/herself.

n.vialma, 100,000 records is likely not a lot of data. If it is taking more
than two or three minutes, something is wrong. Knowing the record limits in
R is a good starting point, but will only get you part of the way. How many
records does your file contain? Do you know how to find out? What are the
data types of the records? What is the call you are using to import the
records into R? What OS are you using? How much RAM does your system have?
What is the size of the R-environment on your system? Do you have resource
intensive applications running (such as MS-Office)?

A lot of folks on this list have been through what you are now dealing with,
so there is plenty of help. I find myself smiling inside & wanting to say
"welcome!"

Sincerely,
KeithC.

-----Original Message-----
From: Khanh Nguyen [mailto:[hidden email]]
Sent: Saturday, March 27, 2010 8:59 AM
To: [hidden email]
Cc: r-help
Subject: Re: [R] large dataset

This was *very* useful for me when I dealt with a 1.5Gb text file

http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_lar
ge_data/


On Sat, Mar 27, 2010 at 5:19 AM, [hidden email] <[hidden email]>
wrote:
> Hi I have a question,
> as im not able to import a csv file which contains a big dataset(100.000
records) someone knows how many records R can handle without giving
problems?
> What im facing when i try to import the file is that R generates more than
100.000 records and is very slow...
> thanks a lot!!!
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Thomas Lumley
On Sun, 28 Mar 2010, kMan wrote:

>> This was *very* useful for me when I dealt with a 1.5Gb text file
>> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
> rge_data/
>
> Two hours is a *very* long time to transfer a csv file to a db. The author
> of the linked article has not documented how to use scan() arguments
> appropriately for the task. I take particular issue with the authors
> statement that "R is said to be slow, memory hungry and only capable of
> handling small datasets," indicating he/she has crummy informants and not
> challenged the notion him/herself.


Ahem.

I believe that *I* am the author of the particular statement you take issue with (although not the of the rest of the page).

However, when I wrote it, it continued:
---------
"R (and S) are accused of being slow, memory-hungry, and able to handle only small data sets.

This is completely true.

Fortunately, computers are fast and have lots of memory. Data sets with  a few tens of thousands of observations can be handled in 256Mb of memory, and quite large data sets with 1Gb of memory.  Workstations with 32Gb or more to handle millions of observations are still expensive (but in a few years Moore's Law should catch up).

Tools for interfacing R with databases allow very large data sets, but this isn't transparent to the user."
------------

I think this is a perfectly reasonable summary and has been (with appropriate changes to the memory numbers) for the nearly ten years I've been saying it.


      -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Gabor Grothendieck
On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <[hidden email]> wrote:

> On Sun, 28 Mar 2010, kMan wrote:
>
>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>
>>> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>
>> rge_data/
>>
>> Two hours is a *very* long time to transfer a csv file to a db. The author
>> of the linked article has not documented how to use scan() arguments
>> appropriately for the task. I take particular issue with the authors
>> statement that "R is said to be slow, memory hungry and only capable of
>> handling small datasets," indicating he/she has crummy informants and not
>> challenged the notion him/herself.
>
>
> Ahem.
>
> I believe that *I* am the author of the particular statement you take issue
> with (although not the of the rest of the page).
>
> However, when I wrote it, it continued:
> ---------
> "R (and S) are accused of being slow, memory-hungry, and able to handle only
> small data sets.
>
> This is completely true.
>
> Fortunately, computers are fast and have lots of memory. Data sets with  a
> few tens of thousands of observations can be handled in 256Mb of memory, and
> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more to
> handle millions of observations are still expensive (but in a few years
> Moore's Law should catch up).
>
> Tools for interfacing R with databases allow very large data sets, but this
> isn't transparent to the user."

I don`t think the last sentence is true if you use sqldf.   Assuming
the standard type of csv file accepted by sqldf:

install.packages("sqldf")
library(sqldf)
DF <- read.csv.sql("myfile.csv")

is all you need.  The install.packages statement downloads and
installs sqldf, DBI and RSQLite (which in turn installs SQLite
itself), and then read.csv.sql sets up the database and table layouts,
reads the file into the database, reads the data from the database
into R (bypassing R's read routines) and then destroys the database
all transparently.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Thomas Lumley
On Mon, 29 Mar 2010, Gabor Grothendieck wrote:

> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <[hidden email]> wrote:
>> On Sun, 28 Mar 2010, kMan wrote:
>>
>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>
>>>> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>
>>> rge_data/
>>>
>>> Two hours is a *very* long time to transfer a csv file to a db. The author
>>> of the linked article has not documented how to use scan() arguments
>>> appropriately for the task. I take particular issue with the authors
>>> statement that "R is said to be slow, memory hungry and only capable of
>>> handling small datasets," indicating he/she has crummy informants and not
>>> challenged the notion him/herself.
>>
>>
>> Ahem.
>>
>> I believe that *I* am the author of the particular statement you take issue
>> with (although not the of the rest of the page).
>>
>> However, when I wrote it, it continued:
>> ---------
>> "R (and S) are accused of being slow, memory-hungry, and able to handle only
>> small data sets.
>>
>> This is completely true.
>>
>> Fortunately, computers are fast and have lots of memory. Data sets with  a
>> few tens of thousands of observations can be handled in 256Mb of memory, and
>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more to
>> handle millions of observations are still expensive (but in a few years
>> Moore's Law should catch up).
>>
>> Tools for interfacing R with databases allow very large data sets, but this
>> isn't transparent to the user."
>
> I don`t think the last sentence is true if you use sqldf.   Assuming
> the standard type of csv file accepted by sqldf:
>
> install.packages("sqldf")
> library(sqldf)
> DF <- read.csv.sql("myfile.csv")
>
> is all you need.  The install.packages statement downloads and
> installs sqldf, DBI and RSQLite (which in turn installs SQLite
> itself), and then read.csv.sql sets up the database and table layouts,
> reads the file into the database, reads the data from the database
> into R (bypassing R's read routines) and then destroys the database
> all transparently.
It's not the data reading that's the problem. As you say, sqldf handles that nicely.  It's using a data set larger than memory that is not transparent -- you need special packages and can still only do a quite limited set of operations.

      -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Douglas Bates-2
In reply to this post by n.vialma@libero.it
On Sat, Mar 27, 2010 at 4:19 AM, [hidden email] <[hidden email]> wrote:
> Hi I have a question,
> as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems?
> What im facing when i try to import the file is that R generates more than 100.000 records and is very slow...
> thanks a lot!!!

Did you read the sections of the "R Data Import/Export" manual (check
the Help menu item under manuals) relating to reading large data sets?
 There are many things you can do to make read.csv faster on files
with a large number of records.  You can pre-specify the number of
records so that vectors are not continually being resized and,
probably the most important, you can specify the column types.

You can read the whole manual in less than the R is taking to read the file.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

kman-4
In reply to this post by Thomas Lumley
Dear Thomas,

While it may be true that "R (and S) are *accused* of being slow,
memory-hungry, and able to handle only
small data sets" (emphasis added), the accusation is  false, rendering the
*accusers* misinformed. Transparency is another, perhaps more interesting
matter. R-users can *experience* R as limited in the ways described above (a
functional limitation) while making a false technical assertion, without
generating a dichotomy. It is a bit like a cell phone example from
human-computer interaction circles in the 90s. The phone could technically
work, provided one is an engineer so as to make sense out of its interface,
while for most people, it may *functionally* be nothing more than a
paperweight. R is not "technically" limited in the way the accusation reads
(the point I was making), though many users are functionally limited so (the
point you seem to have made or at least passed along).

An R user can get far more data into memory as single objects with R than
with other stats packages; including matlab, JMP, and, obviously, excel.
This is just a simple comparison of the programs' documented environment
size and object limits. The difference in the same read/scan operation
between R and JMP on 600 Mb of data could easily be 25+ minutes (R perhaps
taking 5-7 minutes, with JMP taking 30+ minutes, assuming 1.8GHz & 3GB RAM I
used back when I made the comparison that sold me on R). R can do formal
operations with all that data in memory, assuming the environment is given
enough space to work with, while JMP will do the same operation in several
smaller chunks, reference the disk several times, AND on windows machines,
cause the OS to page. In that case, the differences can be upwards of a day.
With the ability to handle larger chunks at once, and direct control over
preventing one's OS from paging, R users should be able to crank out
analyses on very large datasets faster than other programs.

I am perfectly willing to accept that consumers of statistical software may
*experience* R as more limiting, in keeping with the accusations, that the
effect may be larger for newcomers, and even larger for newcomers after
controlling for transparency. I'd expect the effect  to reverse at around 3
years of experience, controlling for transparency or not. Large scale data
may present technical problems many users choose simply to avoid using R
for, so the effect may not reverse for these issues. Even when R is more
than capable of outperforming other programs, its usability (or access to
suitable documentation/training material) apparently isn't currently up to
the challenge. This is something the R community should be gnawing at the
bit to address.

I'd think a consortium of sorts showcasing large-scale data support in R
would be a stellar contribution, and perhaps an issue of R-journal devoted
to the topic, say, of near worst-case scenario - 10Gb of data containing
different data types (categorical, numeric, & embedded matrices), in a .csv
file, header information somewhere else. Now how do the authors explain to
the beginner (say, <1 year experience with I/O) how to tackle getting the
data into a more suitable format, and then how did they analyze it 300Mb at
a time, all using R, in a non-cluster/single user environment, 32 bit, while
controlling for the environment size, missing data, and preventing paging?
How was their solution different when moving to 64 bit? Moving to a cluster?
One of the demos would certainly have to use scan() exclusively for I/O,
perhaps also demonstrating why the 'bad practice' part of working with raw
text files is something more than mere prescription.

Sincerely,
KeithC.

-----Original Message-----
From: Thomas Lumley [mailto:[hidden email]]
Sent: Monday, March 29, 2010 2:56 PM
To: Gabor Grothendieck
Cc: kMan; r-help; [hidden email]
Subject: Re: [R] large dataset

On Mon, 29 Mar 2010, Gabor Grothendieck wrote:

> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <[hidden email]>
wrote:
>> On Sun, 28 Mar 2010, kMan wrote:
>>
>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>
>>>>
http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>
>>> rge_data/
>>>
>>> Two hours is a *very* long time to transfer a csv file to a db. The
author
>>> of the linked article has not documented how to use scan() arguments
>>> appropriately for the task. I take particular issue with the authors
>>> statement that "R is said to be slow, memory hungry and only capable of
>>> handling small datasets," indicating he/she has crummy informants and
not
>>> challenged the notion him/herself.
>>
>>
>> Ahem.
>>
>> I believe that *I* am the author of the particular statement you take
issue
>> with (although not the of the rest of the page).
>>
>> However, when I wrote it, it continued:
>> ---------
>> "R (and S) are accused of being slow, memory-hungry, and able to handle
only
>> small data sets.
>>
>> This is completely true.
>>
>> Fortunately, computers are fast and have lots of memory. Data sets with
 a
>> few tens of thousands of observations can be handled in 256Mb of memory,
and
>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more
to
>> handle millions of observations are still expensive (but in a few years
>> Moore's Law should catch up).
>>
>> Tools for interfacing R with databases allow very large data sets, but
this

>> isn't transparent to the user."
>
> I don`t think the last sentence is true if you use sqldf.   Assuming
> the standard type of csv file accepted by sqldf:
>
> install.packages("sqldf")
> library(sqldf)
> DF <- read.csv.sql("myfile.csv")
>
> is all you need.  The install.packages statement downloads and
> installs sqldf, DBI and RSQLite (which in turn installs SQLite
> itself), and then read.csv.sql sets up the database and table layouts,
> reads the file into the database, reads the data from the database
> into R (bypassing R's read routines) and then destroys the database
> all transparently.

It's not the data reading that's the problem. As you say, sqldf handles that
nicely.  It's using a data set larger than memory that is not transparent --
you need special packages and can still only do a quite limited set of
operations.

      -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: large dataset

Thomas Lumley
KeithC,

If you're arguing that there should be more documentation and examples explaining how to use very large data sets with R, then I agree. Feel free to write some.

I've been giving tutorials on this for years now.  I wrote the first netCDF interface package for R because I needed to use data that wouldn't fit on a 64Mb system. I wrote the biglm package to handle out-of-core regression. My presentation at the last useR meeting was on how to automatically load variables on demand from a SQL connection.

It's still true that you can't treat large data sets and small data sets the same way, and I still think that it's even more important to point out that nearly everyone doesn't have large data and doesn't need to worry about these issues.

    -thomas


On Mon, 29 Mar 2010, kMan wrote:

> Dear Thomas,
>
> While it may be true that "R (and S) are *accused* of being slow,
> memory-hungry, and able to handle only
> small data sets" (emphasis added), the accusation is  false, rendering the
> *accusers* misinformed. Transparency is another, perhaps more interesting
> matter. R-users can *experience* R as limited in the ways described above (a
> functional limitation) while making a false technical assertion, without
> generating a dichotomy. It is a bit like a cell phone example from
> human-computer interaction circles in the 90s. The phone could technically
> work, provided one is an engineer so as to make sense out of its interface,
> while for most people, it may *functionally* be nothing more than a
> paperweight. R is not "technically" limited in the way the accusation reads
> (the point I was making), though many users are functionally limited so (the
> point you seem to have made or at least passed along).
>
> An R user can get far more data into memory as single objects with R than
> with other stats packages; including matlab, JMP, and, obviously, excel.
> This is just a simple comparison of the programs' documented environment
> size and object limits. The difference in the same read/scan operation
> between R and JMP on 600 Mb of data could easily be 25+ minutes (R perhaps
> taking 5-7 minutes, with JMP taking 30+ minutes, assuming 1.8GHz & 3GB RAM I
> used back when I made the comparison that sold me on R). R can do formal
> operations with all that data in memory, assuming the environment is given
> enough space to work with, while JMP will do the same operation in several
> smaller chunks, reference the disk several times, AND on windows machines,
> cause the OS to page. In that case, the differences can be upwards of a day.
> With the ability to handle larger chunks at once, and direct control over
> preventing one's OS from paging, R users should be able to crank out
> analyses on very large datasets faster than other programs.
>
> I am perfectly willing to accept that consumers of statistical software may
> *experience* R as more limiting, in keeping with the accusations, that the
> effect may be larger for newcomers, and even larger for newcomers after
> controlling for transparency. I'd expect the effect  to reverse at around 3
> years of experience, controlling for transparency or not. Large scale data
> may present technical problems many users choose simply to avoid using R
> for, so the effect may not reverse for these issues. Even when R is more
> than capable of outperforming other programs, its usability (or access to
> suitable documentation/training material) apparently isn't currently up to
> the challenge. This is something the R community should be gnawing at the
> bit to address.
>
> I'd think a consortium of sorts showcasing large-scale data support in R
> would be a stellar contribution, and perhaps an issue of R-journal devoted
> to the topic, say, of near worst-case scenario - 10Gb of data containing
> different data types (categorical, numeric, & embedded matrices), in a .csv
> file, header information somewhere else. Now how do the authors explain to
> the beginner (say, <1 year experience with I/O) how to tackle getting the
> data into a more suitable format, and then how did they analyze it 300Mb at
> a time, all using R, in a non-cluster/single user environment, 32 bit, while
> controlling for the environment size, missing data, and preventing paging?
> How was their solution different when moving to 64 bit? Moving to a cluster?
> One of the demos would certainly have to use scan() exclusively for I/O,
> perhaps also demonstrating why the 'bad practice' part of working with raw
> text files is something more than mere prescription.
>
> Sincerely,
> KeithC.
>
> -----Original Message-----
> From: Thomas Lumley [mailto:[hidden email]]
> Sent: Monday, March 29, 2010 2:56 PM
> To: Gabor Grothendieck
> Cc: kMan; r-help; [hidden email]
> Subject: Re: [R] large dataset
>
> On Mon, 29 Mar 2010, Gabor Grothendieck wrote:
>
>> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <[hidden email]>
> wrote:
>>> On Sun, 28 Mar 2010, kMan wrote:
>>>
>>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>>
>>>>>
> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>>
>>>> rge_data/
>>>>
>>>> Two hours is a *very* long time to transfer a csv file to a db. The
> author
>>>> of the linked article has not documented how to use scan() arguments
>>>> appropriately for the task. I take particular issue with the authors
>>>> statement that "R is said to be slow, memory hungry and only capable of
>>>> handling small datasets," indicating he/she has crummy informants and
> not
>>>> challenged the notion him/herself.
>>>
>>>
>>> Ahem.
>>>
>>> I believe that *I* am the author of the particular statement you take
> issue
>>> with (although not the of the rest of the page).
>>>
>>> However, when I wrote it, it continued:
>>> ---------
>>> "R (and S) are accused of being slow, memory-hungry, and able to handle
> only
>>> small data sets.
>>>
>>> This is completely true.
>>>
>>> Fortunately, computers are fast and have lots of memory. Data sets with
>  a
>>> few tens of thousands of observations can be handled in 256Mb of memory,
> and
>>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more
> to
>>> handle millions of observations are still expensive (but in a few years
>>> Moore's Law should catch up).
>>>
>>> Tools for interfacing R with databases allow very large data sets, but
> this
>>> isn't transparent to the user."
>>
>> I don`t think the last sentence is true if you use sqldf.   Assuming
>> the standard type of csv file accepted by sqldf:
>>
>> install.packages("sqldf")
>> library(sqldf)
>> DF <- read.csv.sql("myfile.csv")
>>
>> is all you need.  The install.packages statement downloads and
>> installs sqldf, DBI and RSQLite (which in turn installs SQLite
>> itself), and then read.csv.sql sets up the database and table layouts,
>> reads the file into the database, reads the data from the database
>> into R (bypassing R's read routines) and then destroys the database
>> all transparently.
>
> It's not the data reading that's the problem. As you say, sqldf handles that
> nicely.  It's using a data set larger than memory that is not transparent --
> you need special packages and can still only do a quite limited set of
> operations.
>
>      -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> [hidden email] University of Washington, Seattle
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.