Suggestion for big files [was: Re: A comment about R:]

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion for big files [was: Re: A comment about R:]

François Pinard
[ronggui]

>R's week when handling large data file.  I has a data file : 807 vars,
>118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
>my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.

Just (another) thought.  I used to use SPSS, many, many years ago, on
CDC machines, where the CPU had limited memory and no kind of paging
architecture.  Files did not need to be very large for being too large.

SPSS had a feature that was then useful, about the capability of
sampling a big dataset directly at file read time, quite before
processing starts.  Maybe something similar could help in R (that is,
instead of reading the whole data in memory, _then_ sampling it.)

One can read records from a file, up to a preset amount of them.  If the
file happens to contain more records than that preset number (the number
of records in the whole file is not known beforehand), already read
records may be dropped at random and replaced by other records coming
from the file being read.  If the random selection algorithm is properly
chosen, it can be made so that all records in the original file have
equal probability of being kept in the final subset.

If such a sampling facility was built right within usual R reading
routines (triggered by an extra argument, say), it could offer
a compromise for processing large files, and also sometimes accelerate
computations for big problems, even when memory is not at stake.

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Kort, Eric
> -----Original Message-----
>
> [ronggui]
>
> >R's week when handling large data file.  I has a data file : 807 vars,
> >118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
> >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
>
> Just (another) thought.  I used to use SPSS, many, many years ago, on
> CDC machines, where the CPU had limited memory and no kind of paging
> architecture.  Files did not need to be very large for being too large.
>
> SPSS had a feature that was then useful, about the capability of
> sampling a big dataset directly at file read time, quite before
> processing starts.  Maybe something similar could help in R (that is,
> instead of reading the whole data in memory, _then_ sampling it.)
>
> One can read records from a file, up to a preset amount of them.  If the
> file happens to contain more records than that preset number (the number
> of records in the whole file is not known beforehand), already read
> records may be dropped at random and replaced by other records coming
> from the file being read.  If the random selection algorithm is properly
> chosen, it can be made so that all records in the original file have
> equal probability of being kept in the final subset.
>
> If such a sampling facility was built right within usual R reading
> routines (triggered by an extra argument, say), it could offer
> a compromise for processing large files, and also sometimes accelerate
> computations for big problems, even when memory is not at stake.
>

Since I often work with images and other large data sets, I have been thinking about a "BLOb" (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed.

So I see 3 possibilities:

1. The sort of functionality you describe is implemented in the R internals (by people other than me).
2. Some individuals (perhaps myself included) write such a package.
3. This thread fizzles out and we do nothing.

I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing.

> --
> François Pinard   http://pinard.progiciels-bpi.ca
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html
This email message, including any attachments, is for the so...{{dropped}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Prof Brian Ripley
Another possibility is to make use of the several DBMS interfaces already
available for R.  It is very easy to pull in a sample from one of those,
and surely keeping such large data files as ASCII not good practice.

One problem with Francois Pinard's suggestion (the credit has got lost) is
that R's I/O is not line-oriented but stream-oriented.  So selecting lines
is not particularly easy in R.  That's a deliberate design decision, given
the DBMS interfaces.

I rather thought that using a DBMS was standard practice in the R
community for those using large datasets: it gets discussed rather often.

On Thu, 5 Jan 2006, Kort, Eric wrote:

>> -----Original Message-----
>>
>> [ronggui]
>>
>>> R's week when handling large data file.  I has a data file : 807 vars,
>>> 118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
>>> my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
>>
>> Just (another) thought.  I used to use SPSS, many, many years ago, on
>> CDC machines, where the CPU had limited memory and no kind of paging
>> architecture.  Files did not need to be very large for being too large.
>>
>> SPSS had a feature that was then useful, about the capability of
>> sampling a big dataset directly at file read time, quite before
>> processing starts.  Maybe something similar could help in R (that is,
>> instead of reading the whole data in memory, _then_ sampling it.)
>>
>> One can read records from a file, up to a preset amount of them.  If the
>> file happens to contain more records than that preset number (the number
>> of records in the whole file is not known beforehand), already read
>> records may be dropped at random and replaced by other records coming
>> from the file being read.  If the random selection algorithm is properly
>> chosen, it can be made so that all records in the original file have
>> equal probability of being kept in the final subset.
>>
>> If such a sampling facility was built right within usual R reading
>> routines (triggered by an extra argument, say), it could offer
>> a compromise for processing large files, and also sometimes accelerate
>> computations for big problems, even when memory is not at stake.
>>
>
> Since I often work with images and other large data sets, I have been thinking about a "BLOb" (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed.
>
> So I see 3 possibilities:
>
> 1. The sort of functionality you describe is implemented in the R internals (by people other than me).
> 2. Some individuals (perhaps myself included) write such a package.
> 3. This thread fizzles out and we do nothing.
>
> I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing.
>
>> --
>> François Pinard   http://pinard.progiciels-bpi.ca
--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

jholtman
In reply to this post by François Pinard
If what you are reading in is numeric data, then it would require (807 *
118519 * 8) 760MB just to store a single copy of the object -- more memory
than you have on your computer.  If you were reading it in, then the problem
is the paging that was occurring.

You have to look at storing this in a database and working on a subset of
the data.  Do you really need to have all 807 variables in memory at the
same time?

If you use 'scan', you could specify that you do not want some of the
variables read in so it might make a more reasonably sized objects.


On 1/5/06, François Pinard <[hidden email]> wrote:

>
> [ronggui]
>
> >R's week when handling large data file.  I has a data file : 807 vars,
> >118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
> >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
>
> Just (another) thought.  I used to use SPSS, many, many years ago, on
> CDC machines, where the CPU had limited memory and no kind of paging
> architecture.  Files did not need to be very large for being too large.
>
> SPSS had a feature that was then useful, about the capability of
> sampling a big dataset directly at file read time, quite before
> processing starts.  Maybe something similar could help in R (that is,
> instead of reading the whole data in memory, _then_ sampling it.)
>
> One can read records from a file, up to a preset amount of them.  If the
> file happens to contain more records than that preset number (the number
> of records in the whole file is not known beforehand), already read
> records may be dropped at random and replaced by other records coming
> from the file being read.  If the random selection algorithm is properly
> chosen, it can be made so that all records in the original file have
> equal probability of being kept in the final subset.
>
> If such a sampling facility was built right within usual R reading
> routines (triggered by an extra argument, say), it could offer
> a compromise for processing large files, and also sometimes accelerate
> computations for big problems, even when memory is not at stake.
>
> --
> François Pinard   http://pinard.progiciels-bpi.ca
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>


--
Jim Holtman
Cincinnati, OH
+1 513 247 0281

What the problem you are trying to solve?

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Wincent
2006/1/6, jim holtman <[hidden email]>:
> If what you are reading in is numeric data, then it would require (807 *
> 118519 * 8) 760MB just to store a single copy of the object -- more memory
> than you have on your computer.  If you were reading it in, then the problem
> is the paging that was occurring.
In fact,If I read it in 3 pieces, each is about 170M.

>
> You have to look at storing this in a database and working on a subset of
> the data.  Do you really need to have all 807 variables in memory at the
> same time?

Yip,I don't need all the variables.But I don't know how to get the
necessary  variables into R.

At last I  read the data in piece and use RSQLite package to write it
to a database.and do then do the analysis. If i am familiar with
database software, using database (and R) is the best choice,but
convert the file into database format is not an easy job for me.I ask
for help in SQLite list,but the solution is not satisfying as that
required the knowledge about the third script language.After searching
the internet,I get this solution:

#begin
rm(list=ls())
f<-file("D:\wvsevs_sb_v4.csv","r")
i <- 0
done <- FALSE
library(RSQLite)
con<-dbConnect("SQLite","c:\sqlite\database.db3")
tim1<-Sys.time()

while(!done){
i<-i+1
tt<-readLines(f,2500)
if (length(tt)<2500) done <- TRUE
tt<-textConnection(tt)
if (i==1) {
           assign("dat",read.table(tt,head=T,sep=",",quote=""));
         }
else assign("dat",read.table(tt,head=F,sep=",",quote=""))
close(tt)
ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
  dbWriteTable(con,"wvs",dat) )
}
close(f)
#end
It's not the best solution,but it works.



> If you use 'scan', you could specify that you do not want some of the
> variables read in so it might make a more reasonably sized objects.
>
>
> On 1/5/06, François Pinard <[hidden email]> wrote:
> > [ronggui]
> >
> > >R's week when handling large data file.  I has a data file : 807 vars,
> > >118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
> > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> >
> > Just (another) thought.  I used to use SPSS, many, many years ago, on
> > CDC machines, where the CPU had limited memory and no kind of paging
> > architecture.  Files did not need to be very large for being too large.
> >
> > SPSS had a feature that was then useful, about the capability of
> > sampling a big dataset directly at file read time, quite before
> > processing starts.  Maybe something similar could help in R (that is,
> > instead of reading the whole data in memory, _then_ sampling it.)
> >
> > One can read records from a file, up to a preset amount of them.  If the
> > file happens to contain more records than that preset number (the number
> > of records in the whole file is not known beforehand), already read
> > records may be dropped at random and replaced by other records coming
> > from the file being read.  If the random selection algorithm is properly
> > chosen, it can be made so that all records in the original file have
> > equal probability of being kept in the final subset.
> >
> > If such a sampling facility was built right within usual R reading
> > routines (triggered by an extra argument, say), it could offer
> > a compromise for processing large files, and also sometimes accelerate
> > computations for big problems, even when memory is not at stake.
> >
> > --
> > François Pinard   http://pinard.progiciels-bpi.ca
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 247 0281
>
> What the problem you are trying to solve?


--
黄荣贵
Deparment of Sociology
Fudan University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

bogdan romocea
In reply to this post by François Pinard
ronggui wrote:
> If i am familiar with
> database software, using database (and R) is the best choice,but
> convert the file into database format is not an easy job for me.

Good working knowledge of a DBMS is almost invaluable when it comes to
working with very large data sets. In addition, learning SQL is piece
of cake compared to learning R. On top of that, knowledge of another
(SQL) scripting language is not needed (except perhaps for special
tasks): you can easily use R to generate the SQL syntax to import and
work with arbitrarily wide tables. (I'm not familiar with SQLite, but
MySQL comes with a command line tool that can run syntax files.)
Better start learning SQL today.


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of ronggui
> Sent: Thursday, January 05, 2006 12:48 PM
> To: jim holtman
> Cc: [hidden email]
> Subject: Re: [R] Suggestion for big files [was: Re: A comment
> about R:]
>
>
> 2006/1/6, jim holtman <[hidden email]>:
> > If what you are reading in is numeric data, then it would
> require (807 *
> > 118519 * 8) 760MB just to store a single copy of the object
> -- more memory
> > than you have on your computer.  If you were reading it in,
> then the problem
> > is the paging that was occurring.
> In fact,If I read it in 3 pieces, each is about 170M.
>
> >
> > You have to look at storing this in a database and working
> on a subset of
> > the data.  Do you really need to have all 807 variables in
> memory at the
> > same time?
>
> Yip,I don't need all the variables.But I don't know how to get the
> necessary  variables into R.
>
> At last I  read the data in piece and use RSQLite package to write it
> to a database.and do then do the analysis. If i am familiar with
> database software, using database (and R) is the best choice,but
> convert the file into database format is not an easy job for me.I ask
> for help in SQLite list,but the solution is not satisfying as that
> required the knowledge about the third script language.After searching
> the internet,I get this solution:
>
> #begin
> rm(list=ls())
> f<-file("D:\wvsevs_sb_v4.csv","r")
> i <- 0
> done <- FALSE
> library(RSQLite)
> con<-dbConnect("SQLite","c:\sqlite\database.db3")
> tim1<-Sys.time()
>
> while(!done){
> i<-i+1
> tt<-readLines(f,2500)
> if (length(tt)<2500) done <- TRUE
> tt<-textConnection(tt)
> if (i==1) {
>            assign("dat",read.table(tt,head=T,sep=",",quote=""));
>          }
> else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> close(tt)
> ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
>   dbWriteTable(con,"wvs",dat) )
> }
> close(f)
> #end
> It's not the best solution,but it works.
>
>
>
> > If you use 'scan', you could specify that you do not want
> some of the
> > variables read in so it might make a more reasonably sized objects.
> >
> >
> > On 1/5/06, François Pinard <[hidden email]> wrote:
> > > [ronggui]
> > >
> > > >R's week when handling large data file.  I has a data
> file : 807 vars,
> > > >118519 obs.and its CVS format.  Stata can read it in in
> 2 minus,but In
> > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > >
> > > Just (another) thought.  I used to use SPSS, many, many
> years ago, on
> > > CDC machines, where the CPU had limited memory and no
> kind of paging
> > > architecture.  Files did not need to be very large for
> being too large.
> > >
> > > SPSS had a feature that was then useful, about the capability of
> > > sampling a big dataset directly at file read time, quite before
> > > processing starts.  Maybe something similar could help in
> R (that is,
> > > instead of reading the whole data in memory, _then_ sampling it.)
> > >
> > > One can read records from a file, up to a preset amount
> of them.  If the
> > > file happens to contain more records than that preset
> number (the number
> > > of records in the whole file is not known beforehand),
> already read
> > > records may be dropped at random and replaced by other
> records coming
> > > from the file being read.  If the random selection
> algorithm is properly
> > > chosen, it can be made so that all records in the
> original file have
> > > equal probability of being kept in the final subset.
> > >
> > > If such a sampling facility was built right within usual R reading
> > > routines (triggered by an extra argument, say), it could offer
> > > a compromise for processing large files, and also
> sometimes accelerate
> > > computations for big problems, even when memory is not at stake.
> > >
> > > --
> > > François Pinard   http://pinard.progiciels-bpi.ca
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 247 0281
> >
> > What the problem you are trying to solve?
>
>
> --
> 黄荣贵
> Deparment of Sociology
> Fudan University
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Neuro LeSuperHéros
Rongui,

I'm not familiar with SQLite, but using MySQL would solve your problem.

MySQL has a "LOAD DATA INFILE" statement that loads text/csv files rapidly.

In R, assuming a test table exists in MySQL (blank table is fine), something
like this would load the data directly in MySQL.

library(DBI)
library(RMySQL)
dbSendQuery(mycon,"LOAD DATA INFILE 'C:/textfile.csv'
INTO TABLE test3 FIELDS TERMINATED BY ',' ") #for csv files

Then a normal SQL query would allow you to work with a manageable size of
data.



>From: bogdan romocea <[hidden email]>
>To: [hidden email]
>CC: r-help <[hidden email]>
>Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:]
>Date: Thu, 5 Jan 2006 15:26:51 -0500
>
>ronggui wrote:
> > If i am familiar with
> > database software, using database (and R) is the best choice,but
> > convert the file into database format is not an easy job for me.
>
>Good working knowledge of a DBMS is almost invaluable when it comes to
>working with very large data sets. In addition, learning SQL is piece
>of cake compared to learning R. On top of that, knowledge of another
>(SQL) scripting language is not needed (except perhaps for special
>tasks): you can easily use R to generate the SQL syntax to import and
>work with arbitrarily wide tables. (I'm not familiar with SQLite, but
>MySQL comes with a command line tool that can run syntax files.)
>Better start learning SQL today.
>
>
> > -----Original Message-----
> > From: [hidden email]
> > [mailto:[hidden email]] On Behalf Of ronggui
> > Sent: Thursday, January 05, 2006 12:48 PM
> > To: jim holtman
> > Cc: [hidden email]
> > Subject: Re: [R] Suggestion for big files [was: Re: A comment
> > about R:]
> >
> >
> > 2006/1/6, jim holtman <[hidden email]>:
> > > If what you are reading in is numeric data, then it would
> > require (807 *
> > > 118519 * 8) 760MB just to store a single copy of the object
> > -- more memory
> > > than you have on your computer.  If you were reading it in,
> > then the problem
> > > is the paging that was occurring.
> > In fact,If I read it in 3 pieces, each is about 170M.
> >
> > >
> > > You have to look at storing this in a database and working
> > on a subset of
> > > the data.  Do you really need to have all 807 variables in
> > memory at the
> > > same time?
> >
> > Yip,I don't need all the variables.But I don't know how to get the
> > necessary  variables into R.
> >
> > At last I  read the data in piece and use RSQLite package to write it
> > to a database.and do then do the analysis. If i am familiar with
> > database software, using database (and R) is the best choice,but
> > convert the file into database format is not an easy job for me.I ask
> > for help in SQLite list,but the solution is not satisfying as that
> > required the knowledge about the third script language.After searching
> > the internet,I get this solution:
> >
> > #begin
> > rm(list=ls())
> > f<-file("D:\wvsevs_sb_v4.csv","r")
> > i <- 0
> > done <- FALSE
> > library(RSQLite)
> > con<-dbConnect("SQLite","c:\sqlite\database.db3")
> > tim1<-Sys.time()
> >
> > while(!done){
> > i<-i+1
> > tt<-readLines(f,2500)
> > if (length(tt)<2500) done <- TRUE
> > tt<-textConnection(tt)
> > if (i==1) {
> >            assign("dat",read.table(tt,head=T,sep=",",quote=""));
> >          }
> > else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> > close(tt)
> > ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
> >   dbWriteTable(con,"wvs",dat) )
> > }
> > close(f)
> > #end
> > It's not the best solution,but it works.
> >
> >
> >
> > > If you use 'scan', you could specify that you do not want
> > some of the
> > > variables read in so it might make a more reasonably sized objects.
> > >
> > >
> > > On 1/5/06, François Pinard <[hidden email]> wrote:
> > > > [ronggui]
> > > >
> > > > >R's week when handling large data file.  I has a data
> > file : 807 vars,
> > > > >118519 obs.and its CVS format.  Stata can read it in in
> > 2 minus,but In
> > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > > >
> > > > Just (another) thought.  I used to use SPSS, many, many
> > years ago, on
> > > > CDC machines, where the CPU had limited memory and no
> > kind of paging
> > > > architecture.  Files did not need to be very large for
> > being too large.
> > > >
> > > > SPSS had a feature that was then useful, about the capability of
> > > > sampling a big dataset directly at file read time, quite before
> > > > processing starts.  Maybe something similar could help in
> > R (that is,
> > > > instead of reading the whole data in memory, _then_ sampling it.)
> > > >
> > > > One can read records from a file, up to a preset amount
> > of them.  If the
> > > > file happens to contain more records than that preset
> > number (the number
> > > > of records in the whole file is not known beforehand),
> > already read
> > > > records may be dropped at random and replaced by other
> > records coming
> > > > from the file being read.  If the random selection
> > algorithm is properly
> > > > chosen, it can be made so that all records in the
> > original file have
> > > > equal probability of being kept in the final subset.
> > > >
> > > > If such a sampling facility was built right within usual R reading
> > > > routines (triggered by an extra argument, say), it could offer
> > > > a compromise for processing large files, and also
> > sometimes accelerate
> > > > computations for big problems, even when memory is not at stake.
> > > >
> > > > --
> > > > François Pinard   http://pinard.progiciels-bpi.ca
> > > >
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 247 0281
> > >
> > > What the problem you are trying to solve?
> >
> >
> > --
> > 黄荣贵
> > Deparment of Sociology
> > Fudan University
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

François Pinard
In reply to this post by Prof Brian Ripley
[Brian Ripley]

>I rather thought that using a DBMS was standard practice in the
>R community for those using large datasets: it gets discussed rather
>often.

Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

>Another possibility is to make use of the several DBMS interfaces already
>available for R.  It is very easy to pull in a sample from one of those,
>and surely keeping such large data files as ASCII not good practice.

Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
easily selecting a _random_ sample of the records of a given table.  On
the other hand, I'm no SQL specialist, others might know better.

We do not have a need yet for samples where I work, but if we ever need
such, they will have to be random, or else, I will always fear biases.

>One problem with Francois Pinard's suggestion (the credit has got lost)
>is that R's I/O is not line-oriented but stream-oriented.  So selecting
>lines is not particularly easy in R.

I understand that you mean random access to lines, instead of random
selection of lines.  Once again, this chat comes out of reading someone
else's problem, this is not a problem I actually have.  SPSS was not
randomly accessing lines, as data files could well be hold on magnetic
tapes, where random access is not possible on average practice.  SPSS
reads (or was reading) lines sequentially from beginning to end, and the
_random_ sample is built while the reading goes.

Suppose the file (or tape) holds N records (N is not known in advance),
from which we want a sample of M records at most.  If N <= M, then we
use the whole file, no sampling is possible nor necessary.  Otherwise,
we first initialise M records with the first M records of the file.  
Then, for each record in the file after the M'th, the algorithm has to
decide if the record just read will be discarded or if it will replace
one of the M records already saved, and in the latter case, which of
those records will be replaced.  If the algorithm is carefully designed,
when the last (N'th) record of the file will have been processed this
way, we may then have M records randomly selected from N records, in
such a a way that each of the N records had an equal probability to end
up in the selection of M records.  I may seek out for details if needed.

This is my suggestion, or in fact, more a thought that a suggestion.  It
might represent something useful either for flat ASCII files or even for
a stream of records coming out of a database, if those effectively do
not offer ready random sampling devices.


P.S. - In the (rather unlikely, I admit) case the gang I'm part of would
have the need described above, and if I then dared implementing it
myself, would it be welcome?

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

hadley wickham
> Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
> easily selecting a _random_ sample of the records of a given table.  On
> the other hand, I'm no SQL specialist, others might know better.

There are a number of such devices, which tend to be rather SQL
variant specific.  Try googling for select random rows mysql, select
random rows pgsql, etc.

Another possibility is to generate a large table of randomly
distributed ids and then use that (with randomly generated limits) to
select the appropriate number of records.

Hadley

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Prof Brian Ripley
In reply to this post by François Pinard
[Just one point extracted: Hadley Wickham has answered the random sample
one]

On Thu, 5 Jan 2006, François Pinard wrote:

> [Brian Ripley]
>> One problem with Francois Pinard's suggestion (the credit has got lost)
>> is that R's I/O is not line-oriented but stream-oriented.  So selecting
>> lines is not particularly easy in R.
>
> I understand that you mean random access to lines, instead of random
> selection of lines.  Once again, this chat comes out of reading someone
> else's problem, this is not a problem I actually have.  SPSS was not
> randomly accessing lines, as data files could well be hold on magnetic
> tapes, where random access is not possible on average practice.  SPSS
> reads (or was reading) lines sequentially from beginning to end, and the
> _random_ sample is built while the reading goes.
That was not my point.  R's standard I/O is through connections, which
allow for pushbacks, changing line endings and re-encoding character sets.
That does add overhead compared to C/Fortran line-buffered reading of a
file.  Skipping lines you do not need will take longer than you might
guess (based on some limited experience).

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Martin Maechler
In reply to this post by François Pinard
>>>>> "FrPi" == François Pinard <[hidden email]>
>>>>>     on Thu, 5 Jan 2006 22:41:21 -0500 writes:

    FrPi> [Brian Ripley]
    >> I rather thought that using a DBMS was standard practice in the
    >> R community for those using large datasets: it gets discussed rather
    >> often.

    FrPi> Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

    >> Another possibility is to make use of the several DBMS interfaces already
    >> available for R.  It is very easy to pull in a sample from one of those,
    >> and surely keeping such large data files as ASCII not good practice.

    FrPi> Selecting a sample is easy.  Yet, I'm not aware of any
    FrPi> SQL device for easily selecting a _random_ sample of
    FrPi> the records of a given table.  On the other hand, I'm
    FrPi> no SQL specialist, others might know better.

    FrPi> We do not have a need yet for samples where I work,
    FrPi> but if we ever need such, they will have to be random,
    FrPi> or else, I will always fear biases.

    >> One problem with Francois Pinard's suggestion (the credit has got lost)
    >> is that R's I/O is not line-oriented but stream-oriented.  So selecting
    >> lines is not particularly easy in R.

    FrPi> I understand that you mean random access to lines,
    FrPi> instead of random selection of lines.  Once again,
    FrPi> this chat comes out of reading someone else's problem,
    FrPi> this is not a problem I actually have.  SPSS was not
    FrPi> randomly accessing lines, as data files could well be
    FrPi> hold on magnetic tapes, where random access is not
    FrPi> possible on average practice.  SPSS reads (or was
    FrPi> reading) lines sequentially from beginning to end, and
    FrPi> the _random_ sample is built while the reading goes.

    FrPi> Suppose the file (or tape) holds N records (N is not
    FrPi> known in advance), from which we want a sample of M
    FrPi> records at most.  If N <= M, then we use the whole
    FrPi> file, no sampling is possible nor necessary.
    FrPi> Otherwise, we first initialise M records with the
    FrPi> first M records of the file.  Then, for each record in
    FrPi> the file after the M'th, the algorithm has to decide
    FrPi> if the record just read will be discarded or if it
    FrPi> will replace one of the M records already saved, and
    FrPi> in the latter case, which of those records will be
    FrPi> replaced.  If the algorithm is carefully designed,
    FrPi> when the last (N'th) record of the file will have been
    FrPi> processed this way, we may then have M records
    FrPi> randomly selected from N records, in such a a way that
    FrPi> each of the N records had an equal probability to end
    FrPi> up in the selection of M records.  I may seek out for
    FrPi> details if needed.

    FrPi> This is my suggestion, or in fact, more a thought that
    FrPi> a suggestion.  It might represent something useful
    FrPi> either for flat ASCII files or even for a stream of
    FrPi> records coming out of a database, if those effectively
    FrPi> do not offer ready random sampling devices.


    FrPi> P.S. - In the (rather unlikely, I admit) case the gang
    FrPi> I'm part of would have the need described above, and
    FrPi> if I then dared implementing it myself, would it be welcome?

I think this would be a very interesting tool and
I'm also intrigued about the details of the algorithm you
outline above.

If it would be made to work on all kind of read.table()-readable
files, (i.e. of course including *.csv);   that might be a valuable
tool for all those -- and there are many -- for whom working
with DBMs is too daunting initially.

Martin Maechler, ETH Zurich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Prof Brian Ripley
On Fri, 6 Jan 2006, Martin Maechler wrote:

>>>>>> "FrPi" == François Pinard <[hidden email]>
>>>>>>     on Thu, 5 Jan 2006 22:41:21 -0500 writes:
>
>    FrPi> [Brian Ripley]
>    >> I rather thought that using a DBMS was standard practice in the
>    >> R community for those using large datasets: it gets discussed rather
>    >> often.
>
>    FrPi> Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)
>
>    >> Another possibility is to make use of the several DBMS interfaces already
>    >> available for R.  It is very easy to pull in a sample from one of those,
>    >> and surely keeping such large data files as ASCII not good practice.
>
>    FrPi> Selecting a sample is easy.  Yet, I'm not aware of any
>    FrPi> SQL device for easily selecting a _random_ sample of
>    FrPi> the records of a given table.  On the other hand, I'm
>    FrPi> no SQL specialist, others might know better.
>
>    FrPi> We do not have a need yet for samples where I work,
>    FrPi> but if we ever need such, they will have to be random,
>    FrPi> or else, I will always fear biases.
>
>    >> One problem with Francois Pinard's suggestion (the credit has got lost)
>    >> is that R's I/O is not line-oriented but stream-oriented.  So selecting
>    >> lines is not particularly easy in R.
>
>    FrPi> I understand that you mean random access to lines,
>    FrPi> instead of random selection of lines.  Once again,
>    FrPi> this chat comes out of reading someone else's problem,
>    FrPi> this is not a problem I actually have.  SPSS was not
>    FrPi> randomly accessing lines, as data files could well be
>    FrPi> hold on magnetic tapes, where random access is not
>    FrPi> possible on average practice.  SPSS reads (or was
>    FrPi> reading) lines sequentially from beginning to end, and
>    FrPi> the _random_ sample is built while the reading goes.
>
>    FrPi> Suppose the file (or tape) holds N records (N is not
>    FrPi> known in advance), from which we want a sample of M
>    FrPi> records at most.  If N <= M, then we use the whole
>    FrPi> file, no sampling is possible nor necessary.
>    FrPi> Otherwise, we first initialise M records with the
>    FrPi> first M records of the file.  Then, for each record in
>    FrPi> the file after the M'th, the algorithm has to decide
>    FrPi> if the record just read will be discarded or if it
>    FrPi> will replace one of the M records already saved, and
>    FrPi> in the latter case, which of those records will be
>    FrPi> replaced.  If the algorithm is carefully designed,
>    FrPi> when the last (N'th) record of the file will have been
>    FrPi> processed this way, we may then have M records
>    FrPi> randomly selected from N records, in such a a way that
>    FrPi> each of the N records had an equal probability to end
>    FrPi> up in the selection of M records.  I may seek out for
>    FrPi> details if needed.
>
>    FrPi> This is my suggestion, or in fact, more a thought that
>    FrPi> a suggestion.  It might represent something useful
>    FrPi> either for flat ASCII files or even for a stream of
>    FrPi> records coming out of a database, if those effectively
>    FrPi> do not offer ready random sampling devices.
>
>
>    FrPi> P.S. - In the (rather unlikely, I admit) case the gang
>    FrPi> I'm part of would have the need described above, and
>    FrPi> if I then dared implementing it myself, would it be welcome?
>
> I think this would be a very interesting tool and
> I'm also intrigued about the details of the algorithm you
> outline above.
It's called `reservoir sampling' and is described in my simulation book
and Knuth and elsewhere.

> If it would be made to work on all kind of read.table()-readable
> files, (i.e. of course including *.csv);   that might be a valuable
> tool for all those -- and there are many -- for whom working
> with DBMs is too daunting initially.

It would be better (for the reasons I gave) to do this in a separate file
preprocessor: read.table reads from a connection not a file, of course.

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Wensui Liu
In reply to this post by Wincent
RG,

Actually, SQLite provides a solution to read *.csv file directly into db.

Just for your consideration.

On 1/5/06, ronggui <[hidden email]> wrote:

>
> 2006/1/6, jim holtman <[hidden email]>:
> > If what you are reading in is numeric data, then it would require (807 *
> > 118519 * 8) 760MB just to store a single copy of the object -- more
> memory
> > than you have on your computer.  If you were reading it in, then the
> problem
> > is the paging that was occurring.
> In fact,If I read it in 3 pieces, each is about 170M.
>
> >
> > You have to look at storing this in a database and working on a subset
> of
> > the data.  Do you really need to have all 807 variables in memory at the
> > same time?
>
> Yip,I don't need all the variables.But I don't know how to get the
> necessary  variables into R.
>
> At last I  read the data in piece and use RSQLite package to write it
> to a database.and do then do the analysis. If i am familiar with
> database software, using database (and R) is the best choice,but
> convert the file into database format is not an easy job for me.I ask
> for help in SQLite list,but the solution is not satisfying as that
> required the knowledge about the third script language.After searching
> the internet,I get this solution:
>
> #begin
> rm(list=ls())
> f<-file("D:\wvsevs_sb_v4.csv","r")
> i <- 0
> done <- FALSE
> library(RSQLite)
> con<-dbConnect("SQLite","c:\sqlite\database.db3")
> tim1<-Sys.time()
>
> while(!done){
> i<-i+1
> tt<-readLines(f,2500)
> if (length(tt)<2500) done <- TRUE
> tt<-textConnection(tt)
> if (i==1) {
>            assign("dat",read.table(tt,head=T,sep=",",quote=""));
>          }
> else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> close(tt)
> ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
>   dbWriteTable(con,"wvs",dat) )
> }
> close(f)
> #end
> It's not the best solution,but it works.
>
>
>
> > If you use 'scan', you could specify that you do not want some of the
> > variables read in so it might make a more reasonably sized objects.
> >
> >
> > On 1/5/06, François Pinard <[hidden email]> wrote:
> > > [ronggui]
> > >
> > > >R's week when handling large data file.  I has a data file : 807
> vars,
> > > >118519 obs.and its CVS format.  Stata can read it in in 2 minus,but
> In
> > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > >
> > > Just (another) thought.  I used to use SPSS, many, many years ago, on
> > > CDC machines, where the CPU had limited memory and no kind of paging
> > > architecture.  Files did not need to be very large for being too
> large.
> > >
> > > SPSS had a feature that was then useful, about the capability of
> > > sampling a big dataset directly at file read time, quite before
> > > processing starts.  Maybe something similar could help in R (that is,
> > > instead of reading the whole data in memory, _then_ sampling it.)
> > >
> > > One can read records from a file, up to a preset amount of them.  If
> the
> > > file happens to contain more records than that preset number (the
> number
> > > of records in the whole file is not known beforehand), already read
> > > records may be dropped at random and replaced by other records coming
> > > from the file being read.  If the random selection algorithm is
> properly
> > > chosen, it can be made so that all records in the original file have
> > > equal probability of being kept in the final subset.
> > >
> > > If such a sampling facility was built right within usual R reading
> > > routines (triggered by an extra argument, say), it could offer
> > > a compromise for processing large files, and also sometimes accelerate
> > > computations for big problems, even when memory is not at stake.
> > >
> > > --
> > > François Pinard   http://pinard.progiciels-bpi.ca
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 247 0281
> >
> > What the problem you are trying to solve?
>
>
> --
> 黄荣贵
> Deparment of Sociology
> Fudan University
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html



--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness
Cincinnati Children Hospital Medical Center

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

Wensui Liu
RG,

I think .import command in sqlite should work. plus, sqlite browser (
http://sqlitebrowser.sourceforge.net) might do the work as well.

On 1/6/06, ronggui <[hidden email]> wrote:

>
> Can you give me some hints? or let me know how to do ?
>
> Thank you !
>
> 2006/1/6, Wensui Liu <[hidden email]>:
> > RG,
> >
> >  Actually, SQLite provides a solution to read *.csv file directly into
> db.
> >
> >  Just for your consideration.
> >
> >
> > On 1/5/06, ronggui <[hidden email]> wrote:
> > > 2006/1/6, jim holtman <[hidden email]>:
> > > > If what you are reading in is numeric data, then it would require
> (807 *
> > > > 118519 * 8) 760MB just to store a single copy of the object -- more
> > memory
> > > > than you have on your computer.  If you were reading it in, then the
> > problem
> > > > is the paging that was occurring.
> > > In fact,If I read it in 3 pieces, each is about 170M.
> > >
> > > >
> > > > You have to look at storing this in a database and working on a
> subset
> > of
> > > > the data.  Do you really need to have all 807 variables in memory at
> the
> > > > same time?
> > >
> > > Yip,I don't need all the variables.But I don't know how to get the
> > > necessary  variables into R.
> > >
> > > At last I  read the data in piece and use RSQLite package to write it
> > > to a database.and do then do the analysis. If i am familiar with
> > > database software, using database (and R) is the best choice,but
> > > convert the file into database format is not an easy job for me.I ask
> > > for help in SQLite list,but the solution is not satisfying as that
> > > required the knowledge about the third script language.After searching
> > > the internet,I get this solution:
> > >
> > > #begin
> > > rm(list=ls())
> > > f<-file("D:\wvsevs_sb_v4.csv","r")
> > > i <- 0
> > > done <- FALSE
> > > library(RSQLite)
> > > con<-dbConnect("SQLite","c:\sqlite\database.db3")
> > > tim1<-Sys.time()
> > >
> > > while(!done){
> > > i<-i+1
> > > tt<-readLines(f,2500)
> > > if (length(tt)<2500) done <- TRUE
> > > tt<-textConnection(tt)
> > > if (i==1) {
> > >            assign("dat",read.table(tt,head=T,sep=",",quote=""));
> > >          }
> > > else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> > > close(tt)
> > > ifelse(dbExistsTable(con,
> > "wvs"),dbWriteTable(con,"wvs",dat,append=T),
> > >   dbWriteTable(con,"wvs",dat) )
> > > }
> > > close(f)
> > > #end
> > > It's not the best solution,but it works.
> > >
> > >
> > >
> > > > If you use 'scan', you could specify that you do not want some of
> the
> > > > variables read in so it might make a more reasonably sized objects.
> > > >
> > > >
> > > > On 1/5/06, François Pinard < [hidden email]> wrote:
> > > > > [ronggui]
> > > > >
> > > > > >R's week when handling large data file.  I has a data file : 807
> > vars,
> > > > > >118519 obs.and its CVS format.  Stata can read it in in 2
> minus,but
> > In
> > > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > > > >
> > > > > Just (another) thought.  I used to use SPSS, many, many years ago,
> on
> > > > > CDC machines, where the CPU had limited memory and no kind of
> paging
> > > > > architecture.  Files did not need to be very large for being too
> > large.
> > > > >
> > > > > SPSS had a feature that was then useful, about the capability of
> > > > > sampling a big dataset directly at file read time, quite before
> > > > > processing starts.  Maybe something similar could help in R (that
> is,
> > > > > instead of reading the whole data in memory, _then_ sampling it.)
> > > > >
> > > > > One can read records from a file, up to a preset amount of
> them.  If
> > the
> > > > > file happens to contain more records than that preset number (the
> > number
> > > > > of records in the whole file is not known beforehand), already
> read
> > > > > records may be dropped at random and replaced by other records
> coming
> > > > > from the file being read.  If the random selection algorithm is
> > properly
> > > > > chosen, it can be made so that all records in the original file
> have
> > > > > equal probability of being kept in the final subset.
> > > > >
> > > > > If such a sampling facility was built right within usual R reading
> > > > > routines (triggered by an extra argument, say), it could offer
> > > > > a compromise for processing large files, and also sometimes
> accelerate
> > > > > computations for big problems, even when memory is not at stake.
> > > > >
> > > > > --
> > > > > François Pinard   http://pinard.progiciels-bpi.ca
> > > > >
> > > > > ______________________________________________
> > > > > [hidden email] mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide!
> > > > http://www.R-project.org/posting-guide.html
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jim Holtman
> > > > Cincinnati, OH
> > > > +1 513 247 0281
> > > >
> > > > What the problem you are trying to solve?
> > >
> > >
> > > --
> > > 黄荣贵
> > > Deparment of Sociology
> > > Fudan University
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
> >
> > --
> > WenSui Liu
> > (http://statcompute.blogspot.com)
> > Senior Decision Support Analyst
> > Health Policy and Clinical Effectiveness
> > Cincinnati Children Hospital Medical Center
> >
>
>
> --
> 黄荣贵
> Deparment of Sociology
> Fudan University
>


--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness
Cincinnati Children Hospital Medical Center

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

François Pinard
In reply to this post by Prof Brian Ripley
[Brian Ripley]
>[François Pinard]
>>[Brian Ripley]

>>>One problem [...] is that R's I/O is not line-oriented but
>>>stream-oriented.  So selecting lines is not particularly easy in R.

>>I understand that you mean random access to lines, instead of random
>>selection of lines.

>That was not my point. [...] Skipping lines you do not need will take
>longer than you might guess (based on some limited experience).

Thanks for telling (and also for the expression "reservoir sampling").
OK, then.  All summarized, if I ever need this for bigger datasets,
selection might better be done outside of R.

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

François Pinard
In reply to this post by hadley wickham
[hadley wickham]

>[François Pinard]

>> Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
>> easily selecting a _random_ sample of the records of a given table.
>> On the other hand, I'm no SQL specialist, others might know better.

>There are a number of such devices, which tend to be rather SQL variant
>specific.  Try googling for select random rows mysql, select random
>rows pgsql, etc.

Thanks as well for these hints.  Googling around as your suggested (yet
keeping my eyes in the MySQL direction, because this is what we use),
getting MySQL itself to do the selection is a bit discouraging, as
according to comments I've read, MySQL does not seem to scale well with
the database size according to the comments I've read, especially when
records have to be decorated with random numbers and later sorted.

Yet, I did not drive any benchmark myself, and would not blindly take
everything I read for granted, given that MySQL developers have speed in
mind, and there are ways to interrupt a sort before running it to full
completion, when only a few sorted records are wanted.

>Another possibility is to generate a large table of randomly
>distributed ids and then use that (with randomly generated limits) to
>select the appropriate number of records.

I'm not sure I understand your idea (what mixes me in the "randomly
generated limits" part).  If the "large table" is much larger than the
size of the wanted sample, we might not be gaining much.

Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

All in all, if I ever have such a problem, a practical solution probably
has to be outside of R, and maybe outside SQL as well.

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

François Pinard
In reply to this post by Martin Maechler
[Martin Maechler]

>    FrPi> Suppose the file (or tape) holds N records (N is not known
>    FrPi> in advance), from which we want a sample of M records at
>    FrPi> most. [...] If the algorithm is carefully designed, when
>    FrPi> the last (N'th) record of the file will have been processed
>    FrPi> this way, we may then have M records randomly selected from
>    FrPi> N records, in such a a way that each of the N records had an
>    FrPi> equal probability to end up in the selection of M records.  I
>    FrPi> may seek out for details if needed.

>[...] I'm also intrigued about the details of the algorithm you
>outline above.

I went into my old SPSS books and related references to find it for you,
to no avail (yet I confess I did not try very hard).  I vaguely remember
it was related to Spearman's correlation computation: I did find notes
about the "severe memory limitation" of this computation, but nothing
about the implemented workaround.  I did find other sampling devices,
but not the very one I remember having read about, many years ago.

On the other hand, Googling tells that this topic has been much studied,
and that Vitter's algorithm Z seems to be popular nowadays (even if not
the simplest) because it is more efficient than others.  Google found
a copy of the paper:

   http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf

Here is an implementation for Postgres:

   http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php

yet I do not find it very readable -- but this is only an opinion: I'm
rather demanding in the area of legibility, while many or most people
are more courageous than me! :-).

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

hadley wickham
In reply to this post by François Pinard
> Thanks as well for these hints.  Googling around as your suggested (yet
> keeping my eyes in the MySQL direction, because this is what we use),
> getting MySQL itself to do the selection is a bit discouraging, as
> according to comments I've read, MySQL does not seem to scale well with
> the database size according to the comments I've read, especially when
> records have to be decorated with random numbers and later sorted.

With SQL there is always a way to do what you want quickly, but you
need to think carefully about what operations are most common in your
database.  For example, the problem is much easier if you can assume
that the rows are numbered sequentially from 1 to n.  This could be
enfored using a trigger whenever a record is added/deleted.  This
would slow insertions/deletions but speed selects.

> Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

This is another example where greater knowledge of problem can yield
speed increases.  Here (where the number of selections is much smaller
than the total number of objects) you are better off generating 10
numbers with runif(10, 0, 1000000) and then checking that they are
unique

> >Another possibility is to generate a large table of randomly
> >distributed ids and then use that (with randomly generated limits) to
> >select the appropriate number of records.
>
> I'm not sure I understand your idea (what mixes me in the "randomly
> generated limits" part).  If the "large table" is much larger than the
> size of the wanted sample, we might not be gaining much.

Think about using a table of random numbers.  They are pregenerated
for you, you just choose a starting and ending index.  It will be slow
to generate the table the first time, but then it will be fast.  It
will also take up quite a bit of space, but space is cheap (and time
is not!)

Hadley

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

François Pinard
[hadley wickham]

>> [...] according to comments I've read, MySQL does not seem to scale
>> well with the database size according to the comments I've read,
>> especially when records have to be decorated with random numbers and
>> later sorted.

>With SQL there is always a way to do what you want quickly, but you
>need to think carefully about what operations are most common in your
>database.  For example, the problem is much easier if you can assume
>that the rows are numbered sequentially from 1 to n.  This could be
>enfored using a trigger whenever a record is added/deleted.  This would
>slow insertions/deletions but speed selects.

Sure (for a caricature example) that if database records are already
decorated with random numbers, and an index is built over the
decoration, random sampling may indeed be done quicker :-). The fact is
that (at least our) databases are not especially designed for random
sampling, and people in charge would resist redesigning them merely
because there would be a few needs for random sampling.

What would be ideal is being able to build random samples out of any big
database or file, with equal ease.  The fact is that it's doable.  
(Brian Ripley points out that R textual I/O has too much overhead for
being usable, so one should rather say, sadly: "It's doable outside R".)

>> Just for fun: here, "sample(100000000, 10)" in R is slowish already
>> :-).

>This is another example where greater knowledge of problem can yield
>speed increases.  Here (where the number of selections is much smaller
>than the total number of objects) you are better off generating 10
>numbers with runif(10, 0, 1000000) and then checking that they are
>unique

Of course, my remark about "sample()" is related to the previous
discussion.  If "sample(N, M)" was more on the O(M) side than being on
the O(N) side (both memory-wise and cpu-wise), it could be used for
preselecting which rows of a big database to include in a random sample,
so building on your idea of using a set of IDs.  As the sample of
M records will have to be processed in-memory by R anyway, computing
a vector of M indices does not (or should not) increase complexity.

However, "sample(N, M)" is likely less usable for randomly sampling
a database, if it is O(N) to start with.  About your suggestion of using
"runif" and later checking uniqueness, "sample()" could well be
implemented this way, when the arguments are proper.  The "greater
knowledge of the problem" could be built in right into the routine meant
to solve it.  "sample(N, M)" could even know how to take advantage of
some simplified case of a "reservoir sampling" technique :-).

>> >[...] a large table of randomly distributed ids [...] (with randomly
>> >generated limits) to select the appropriate number of records.

>[...] a table of random numbers [...] pregenerated for you, you just
>choose a starting and ending index.  It will be slow to generate the
>table the first time, but then it will be fast.  It will also take up
>quite a bit of space, but space is cheap (and time is not!)

Thanks for the explanation.

In the case under consideration here (random sampling of a big file or
database), I would be tempted to guess that the time required for
generating pseudo-random numbers is negligible when compared to the
overall input/output time, so it might be that pregenerating randomized
IDs is not worth the trouble.  Also given that whenever the database
size changes, the list of pregenerated IDs is not valid anymore.

--
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for big files [was: Re: A comment about R:]

r.ghezzo
In reply to this post by François Pinard
I found Reservoir-Sampling algorithms of time complexity O(n(1+log(N/n))) by
Kim-Hung Li , ACM Transactions on Mathematical Software Vol 20 No 4 Dec 94
p481-492.
He mentions algorithm Z and K and proposed 2 improved versions alg L and M.
Algorith L is really easy to implement but relatively slow, M doesn't look very
difficult and is the fastest.
Heberto Ghezzo
McGill University
Montreal - Canada

Quoting François Pinard <[hidden email]>:

> [Martin Maechler]
>
> >    FrPi> Suppose the file (or tape) holds N records (N is not known
> >    FrPi> in advance), from which we want a sample of M records at
> >    FrPi> most. [...] If the algorithm is carefully designed, when
> >    FrPi> the last (N'th) record of the file will have been processed
> >    FrPi> this way, we may then have M records randomly selected from
> >    FrPi> N records, in such a a way that each of the N records had an
> >    FrPi> equal probability to end up in the selection of M records.  I
> >    FrPi> may seek out for details if needed.
>
> >[...] I'm also intrigued about the details of the algorithm you
> >outline above.
>
> I went into my old SPSS books and related references to find it for you,
> to no avail (yet I confess I did not try very hard).  I vaguely remember
> it was related to Spearman's correlation computation: I did find notes
> about the "severe memory limitation" of this computation, but nothing
> about the implemented workaround.  I did find other sampling devices,
> but not the very one I remember having read about, many years ago.
>
> On the other hand, Googling tells that this topic has been much studied,
> and that Vitter's algorithm Z seems to be popular nowadays (even if not
> the simplest) because it is more efficient than others.  Google found
> a copy of the paper:
>
>    http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf
>
> Here is an implementation for Postgres:
>
>    http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php
>
> yet I do not find it very readable -- but this is only an opinion: I'm
> rather demanding in the area of legibility, while many or most people
> are more courageous than me! :-).
>
> --
> François Pinard   http://pinard.progiciels-bpi.ca
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html