Large database help

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Large database help

Rogerio Porto
Hello all.

I have a large .txt file whose variables are fixed-columns,
ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
This is a 60GB file with 90 variables and 60 million observations.

I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
I tried the following code just to see if I could work with 2 variables
but it seems not possible:
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.2.1  (2005-12-20 r36812)
ISBN 3-900051-07-0
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 169011  4.6     350000  9.4   350000  9.4
Vcells  62418  0.5     786432  6.0   289957  2.3
> memory.limit(size=4090)
NULL
> memory.limit()
[1] 4288675840
> system.time(a<-matrix(runif(1e6),nrow=1))
[1] 0.28 0.02 2.42   NA   NA
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  171344  4.6     350000  9.4   350000  9.4
Vcells 1063212  8.2    3454398 26.4  4063230 31.0
> rm(a)
> ls()
character(0)
> system.time(a<-matrix(runif(60e6),nrow=1))
Error: not possible to alocate vector of size 468750 Kb
Timing stopped at: 7.32 1.95 83.55 NA NA
> memory.limit(size=5000)
Erro em memory.size(size) : .....4GB

So my questions are:
1) (newbie) how can I read fixed-columns text files like this?
2) is there a way I can analyze (statistics like correlations, cluster etc)
    such a large database neither increasing RAM nor changing to 64bit
    machine but still using R and not using a sample? How?

Thanks in advance.

Rogerio.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Large database help

Uwe Ligges
Rogerio Porto wrote:

> Hello all.
>
> I have a large .txt file whose variables are fixed-columns,
> ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
> This is a 60GB file with 90 variables and 60 million observations.
>
> I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
> I tried the following code just to see if I could work with 2 variables
> but it seems not possible:
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.2.1  (2005-12-20 r36812)
> ISBN 3-900051-07-0
>
>>gc()
>
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 169011  4.6     350000  9.4   350000  9.4
> Vcells  62418  0.5     786432  6.0   289957  2.3
>
>>memory.limit(size=4090)
>
> NULL
>
>>memory.limit()
>
> [1] 4288675840
>
>>system.time(a<-matrix(runif(1e6),nrow=1))
>
> [1] 0.28 0.02 2.42   NA   NA
>
>>gc()
>
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  171344  4.6     350000  9.4   350000  9.4
> Vcells 1063212  8.2    3454398 26.4  4063230 31.0
>
>>rm(a)
>>ls()
>
> character(0)
>
>>system.time(a<-matrix(runif(60e6),nrow=1))
>
> Error: not possible to alocate vector of size 468750 Kb
> Timing stopped at: 7.32 1.95 83.55 NA NA
>
>>memory.limit(size=5000)
>
> Erro em memory.size(size) : .....4GB
>
> So my questions are:
> 1) (newbie) how can I read fixed-columns text files like this?
> 2) is there a way I can analyze (statistics like correlations, cluster etc)
>     such a large database neither increasing RAM nor changing to 64bit
>     machine but still using R and not using a sample? How?


Use what you are already suggesting in your subject: a database.
Then you can access the variables separately and you have no problems
reading the file.

Even with a real database, if you want to calculate on 60 million
observations (~500Mb) at once, you are near the limit, but only works if
you do not need several variables at once and depends on the methods you
are going to apply.

Uwe Ligges


> Thanks in advance.
>
> Rogerio.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Large database help

Roger Peng-2
In reply to this post by Rogerio Porto
You can read fixed-width-files with read.fwf().  But my rough calculation says
that your dataset will require 40GB of RAM.  I don't think you'll be able to
read the entire thing into R.  Maybe look at a subset?

-roger

Rogerio Porto wrote:

> Hello all.
>
> I have a large .txt file whose variables are fixed-columns,
> ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
> This is a 60GB file with 90 variables and 60 million observations.
>
> I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
> I tried the following code just to see if I could work with 2 variables
> but it seems not possible:
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.2.1  (2005-12-20 r36812)
> ISBN 3-900051-07-0
>> gc()
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 169011  4.6     350000  9.4   350000  9.4
> Vcells  62418  0.5     786432  6.0   289957  2.3
>> memory.limit(size=4090)
> NULL
>> memory.limit()
> [1] 4288675840
>> system.time(a<-matrix(runif(1e6),nrow=1))
> [1] 0.28 0.02 2.42   NA   NA
>> gc()
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  171344  4.6     350000  9.4   350000  9.4
> Vcells 1063212  8.2    3454398 26.4  4063230 31.0
>> rm(a)
>> ls()
> character(0)
>> system.time(a<-matrix(runif(60e6),nrow=1))
> Error: not possible to alocate vector of size 468750 Kb
> Timing stopped at: 7.32 1.95 83.55 NA NA
>> memory.limit(size=5000)
> Erro em memory.size(size) : .....4GB
>
> So my questions are:
> 1) (newbie) how can I read fixed-columns text files like this?
> 2) is there a way I can analyze (statistics like correlations, cluster etc)
>     such a large database neither increasing RAM nor changing to 64bit
>     machine but still using R and not using a sample? How?
>
> Thanks in advance.
>
> Rogerio.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re : Large database help

justin bem
Try to open your db with MySQL and use RMySQL

----- Message d'origine ----
De : Roger D. Peng <[hidden email]>
À : Rogerio Porto <[hidden email]>
Cc : [hidden email]
Envoyé le : Mardi, 16 Mai 2006, 1h55mn 41s
Objet : Re: [R] Large database help

You can read fixed-width-files with read.fwf().  But my rough calculation says
that your dataset will require 40GB of RAM.  I don't think you'll be able to
read the entire thing into R.  Maybe look at a subset?

-roger

Rogerio Porto wrote:

> Hello all.
>
> I have a large .txt file whose variables are fixed-columns,
> ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
> This is a 60GB file with 90 variables and 60 million observations.
>
> I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
> I tried the following code just to see if I could work with 2 variables
> but it seems not possible:
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.2.1  (2005-12-20 r36812)
> ISBN 3-900051-07-0
>> gc()
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 169011  4.6     350000  9.4   350000  9.4
> Vcells  62418  0.5     786432  6.0   289957  2.3
>> memory.limit(size=4090)
> NULL
>> memory.limit()
> [1] 4288675840
>> system.time(a<-matrix(runif(1e6),nrow=1))
> [1] 0.28 0.02 2.42   NA   NA
>> gc()
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  171344  4.6     350000  9.4   350000  9.4
> Vcells 1063212  8.2    3454398 26.4  4063230 31.0
>> rm(a)
>> ls()
> character(0)
>> system.time(a<-matrix(runif(60e6),nrow=1))
> Error: not possible to alocate vector of size 468750 Kb
> Timing stopped at: 7.32 1.95 83.55 NA NA
>> memory.limit(size=5000)
> Erro em memory.size(size) : .....4GB
>
> So my questions are:
> 1) (newbie) how can I read fixed-columns text files like this?
> 2) is there a way I can analyze (statistics like correlations, cluster etc)
>     such a large database neither increasing RAM nor changing to 64bit
>     machine but still using R and not using a sample? How?
>
> Thanks in advance.
>
> Rogerio.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Robert Citek

On May 16, 2006, at 8:15 AM, justin bem wrote:

> Try to open your db with MySQL and use RMySQL

I've seen this offered up as a suggestion a few times but with little  
detail.  In my experience, even using SQL to pull in data from a  
MySQL DB, R would need to load the entire data set into RAM before  
doing some calculations.  But perhaps I'm using RMySQL incorrectly[1].

As a toy problem, let's imagine a data set (foo) with a single  
numerical field (bar) and 1 billion records (1e9).  In MySQL one  
would do the following to calculate the mean:

   select avg(bar) from foo ;

For a smaller data set I would issue a select statement and then  
fetch the entire set into a data frame before calculating the mean.  
Given such a large data set, how would one calculate the mean using R  
connected to this MySQL database?  How would one calculate the median  
using R connected to this MySQL database?

Pointers to references appreciated.

[1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Prof Brian Ripley
On Tue, 16 May 2006, Robert Citek wrote:

>
> On May 16, 2006, at 8:15 AM, justin bem wrote:
>
>> Try to open your db with MySQL and use RMySQL
>
> I've seen this offered up as a suggestion a few times but with little
> detail.  In my experience, even using SQL to pull in data from a
> MySQL DB, R would need to load the entire data set into RAM before
> doing some calculations.  But perhaps I'm using RMySQL incorrectly[1].
>
> As a toy problem, let's imagine a data set (foo) with a single
> numerical field (bar) and 1 billion records (1e9).  In MySQL one
> would do the following to calculate the mean:
>
>   select avg(bar) from foo ;
>
> For a smaller data set I would issue a select statement and then
> fetch the entire set into a data frame before calculating the mean.
> Given such a large data set, how would one calculate the mean using R
> connected to this MySQL database?  How would one calculate the median
> using R connected to this MySQL database?
>
> Pointers to references appreciated.

Well, there *is* a manual about R Data Import/Export, and this does
discuss using R with DBMSs with examples.  How about reading it?

The point being made is that you can import just the columns you need, and
indeed summaries of those columns.

> [1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Robert Citek

On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote:
> Well, there *is* a manual about R Data Import/Export, and this does
> discuss using R with DBMSs with examples.  How about reading it?

Thanks for the pointer:

   http://cran.r-project.org/doc/manuals/R-data.html#Relational- 
databases

Unfortunately, that manual doesn't really answer my question.  My  
question is not about how do I make R interact with a database, but  
rather how do I make R interact with a database containing large sets.

> The point being made is that you can import just the columns you  
> need, and indeed summaries of those columns.

That sounds great in theory.  Now I want to reduce it to practice.  
In the toy problem from the previous post, how can one compute the  
mean of a set of 1e9 numbers?  R has some difficulty generating a  
billion (1e9) number set let alone taking the mean of that set.  To wit:

   bigset <- runif(1e9,0,1e9)

runs out of memory on my system.  I realize that I can do some fancy  
data shuffling and hand-waving to calculate the mean.  But I was  
wondering if R has a module that already abstracts out that magic,  
perhaps using a database.

Any pointers to more detailed reading is greatly appreciated.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

RKoenker
In ancient times, 1999 or so, Alvaro Novo and I experimented with an
interface to mysql that brought chunks of data into R and accumulated  
results.
This is still described and available on the web in its original form at

        http://www.econ.uiuc.edu/~roger/research/rq/LM.html

Despite claims of "future developments" nothing emerged, so anyone
considering further explorations with it may need training in  
Rchaeology.

The toy problem we were solving was a large least squares problem,
which was a stalking horse for large quantile regression  problems.  
Around the same
time I discovered sparse linear algebra and realized that virtually all
large problems that I was interested in were better handled in from
that perspective.

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    [hidden email]            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On May 16, 2006, at 3:57 PM, Robert Citek wrote:

>
> On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote:
>> Well, there *is* a manual about R Data Import/Export, and this does
>> discuss using R with DBMSs with examples.  How about reading it?
>
> Thanks for the pointer:
>
>    http://cran.r-project.org/doc/manuals/R-data.html#Relational-
> databases
>
> Unfortunately, that manual doesn't really answer my question.  My
> question is not about how do I make R interact with a database, but
> rather how do I make R interact with a database containing large sets.
>
>> The point being made is that you can import just the columns you
>> need, and indeed summaries of those columns.
>
> That sounds great in theory.  Now I want to reduce it to practice.
> In the toy problem from the previous post, how can one compute the
> mean of a set of 1e9 numbers?  R has some difficulty generating a
> billion (1e9) number set let alone taking the mean of that set.  To  
> wit:
>
>    bigset <- runif(1e9,0,1e9)
>
> runs out of memory on my system.  I realize that I can do some fancy
> data shuffling and hand-waving to calculate the mean.  But I was
> wondering if R has a module that already abstracts out that magic,
> perhaps using a database.
>
> Any pointers to more detailed reading is greatly appreciated.
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Thomas Lumley
On Tue, 16 May 2006, roger koenker wrote:

> In ancient times, 1999 or so, Alvaro Novo and I experimented with an
> interface to mysql that brought chunks of data into R and accumulated
> results.
> This is still described and available on the web in its original form at
>
> http://www.econ.uiuc.edu/~roger/research/rq/LM.html
>
> Despite claims of "future developments" nothing emerged, so anyone
> considering further explorations with it may need training in
> Rchaeology.

A few hours ago I submitted to CRAN a package "biglm" that does large
linear regression models using a similar strategy (it uses incremental QR
decomposition rather than accumalating the crossproduct matrix). It also
computes the Huber/White sandwich variance estimate in the same single
pass over the data.

Assuming I haven't messed up the package checking it will appear
in the next couple of day on CRAN. The syntax looks like
   a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
   a <- update(a, chunk2)
   a <- update(a, chunk3)
   summary(a)

where chunk1, chunk2, chunk3 are chunks of the data.


  -thomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

(no subject)

Roger Koenker-2
an upgrade:   from the flintstones -- to the michelin  man...


On May 16, 2006, at 4:40 PM, Thomas Lumley wrote:

> On Tue, 16 May 2006, roger koenker wrote:
>
>> In ancient times, 1999 or so, Alvaro Novo and I experimented with an
>> interface to mysql that brought chunks of data into R and accumulated
>> results.
>> This is still described and available on the web in its original  
>> form at
>>
>> http://www.econ.uiuc.edu/~roger/research/rq/LM.html
>>
>> Despite claims of "future developments" nothing emerged, so anyone
>> considering further explorations with it may need training in
>> Rchaeology.
>
> A few hours ago I submitted to CRAN a package "biglm" that does large
> linear regression models using a similar strategy (it uses  
> incremental QR
> decomposition rather than accumalating the crossproduct matrix). It  
> also
> computes the Huber/White sandwich variance estimate in the same single
> pass over the data.
>
> Assuming I haven't messed up the package checking it will appear
> in the next couple of day on CRAN. The syntax looks like
>    a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
>    a <- update(a, chunk2)
>    a <- update(a, chunk3)
>    summary(a)
>
> where chunk1, chunk2, chunk3 are chunks of the data.
>
>
>   -thomas
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Gregory Snow
In reply to this post by justin bem
Thanks for doing this Thomas, I have been thinking about what it would
take to do this, but if it were left to me, it would have taken a lot
longer.

Back in the 80's there was a statistical package called RUMMAGE that did
all computations based on sufficient statistics and did not keep the
actual data in memory.  Memory for computers became cheap before
datasets turned huge so there wasn't much demand for the program (and it
never had a nice GUI to help make it popular).  It looks like things are
switching back to that model now though.

Here are a couple of thought that I had that maybe could help with some
future development:

Another function that could be helpful is bigplot which I imagine would
be best based on the hexbin package, just accumulating the counts in
chunks like your biglm function.  Once I see the code for biglm I may be
able to contribute this piece.  I guess bigbarplot and bigboxplot may
also be useful (accumulating counts for the barplot will be easy, but
does anyone have ideas on the best way to get quantiles for the boxplots
efficiently (the best approach I can think of so far is to have the
database sort the variables, but sorting tends to be slow)).

Another general approach that I thought of would be to read the data in
in chunks, compute the statistic(s) of interest on each chunk (vector of
coefficients for regression models) then average the estimates across
chunks.  Each chunk could be treated as a cluster in a cluster sample
for the averaging and estimating variances for the estimates (if only we
can get the author of the survey package involved :-).  This would
probably be less accurate than your biglm function for regression, but
it would have the flavor of the bootstrapping routines in that it would
work for many cases that don't have their own big methods written yet
(logistic and other glm models, correlations, ...).

Any other thoughts anyone?


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
(801) 408-8111
 

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Thomas Lumley
Sent: Tuesday, May 16, 2006 3:40 PM
To: roger koenker
Cc: r-help list; Robert Citek
Subject: Re: [R] Re : Large database help

On Tue, 16 May 2006, roger koenker wrote:

> In ancient times, 1999 or so, Alvaro Novo and I experimented with an
> interface to mysql that brought chunks of data into R and accumulated
> results.
> This is still described and available on the web in its original form
> at
>
> http://www.econ.uiuc.edu/~roger/research/rq/LM.html
>
> Despite claims of "future developments" nothing emerged, so anyone
> considering further explorations with it may need training in
> Rchaeology.

A few hours ago I submitted to CRAN a package "biglm" that does large
linear regression models using a similar strategy (it uses incremental
QR decomposition rather than accumalating the crossproduct matrix). It
also computes the Huber/White sandwich variance estimate in the same
single pass over the data.

Assuming I haven't messed up the package checking it will appear in the
next couple of day on CRAN. The syntax looks like
   a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
   a <- update(a, chunk2)
   a <- update(a, chunk3)
   summary(a)

where chunk1, chunk2, chunk3 are chunks of the data.


  -thomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Richard M. Heiberger
In reply to this post by justin bem
You might want to follow up by looking at the Data Squashing
that Bill DuMouchel has done

http://citeseer.ist.psu.edu/dumouchel99squashing.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Re : Large database help

Rogerio Porto
In reply to this post by Prof Brian Ripley
Thank you all for the discussion.

I'll try to summarize the suggestions and give some partial conclusions
for sake of completeness of this thread.

First, I had read the I/O manual but had forgotten the function read.fwf as
suggested by Roger Peng. I'm sorry. But, following manual orientation, this
function is not recommended for large files and I need to discover how to
read fixed-width-format files using scan function, since there isn't such an
example in that manual neither in ?scan. At a glance, it seems the function
read.fwf writes blank spaces among column pointers in order to read the
file using a simple scan() function.

I've also read the I/O manual, mainly chapter 4 about using Relational
Databases.
This suggestion was appointed by Uwe Ligges and Justin Bem who advocated
the use of MySQL with RMySQL package. I'm still installing MySQL to try
to convert my fixed-width-format file to that database but, from the I/O
manual, it seems I can only calculate five descriptive statistics (aggregate
functions). So I couldn't calculate medians or more advanced statistics
like a cluster analysis.
This point was one from Robert Citek and thus, I'm not sure that working
with MySQL will help to solve my problem. RMySQL has dbApply function
that apply R functions to groups (chunks) of database rows.

There was a suggestion to subset the file, by Roger Peng.
Almost all participants in this thread noted the need of lots of RAM to work
with a few variables as suggested by Prof. Brian Ripley.

The future looks promising through a collection *big* of packages specially
designed to handle big data files in almost any hardwarea and OS
configuration although time-demanding in some cases. It seems the first one
in this collection is the biglm package by Thomas Lumley cited by Greg Snow.
The obvious drawback is that one hat to re-write every package that can't
handle big data files or, al least, their most memory demanding operations.
This last point could be implemented by an option like big.file=TRUE to be
incorporated at some functions. This point of view is one of *scaling up*
the methods.

Another promising way is to *scale down* the dataset. Statisticians are
aware of these techniques from non-hierarquical cluster analysis and
principal component analysis among others (mainly sampling). Engineers
and signal processing people know them from data compression techniques.
Computer scientists work with training sets and dataming wich use methods
to scale down datasets. An example was given by Richard M. Heiberger
who cites a paper from William DuMouchel et al. on Squashing Flat Files.
Maybe could be some R functions specialized in these methods that, using
DBMS, could retrieve significant data (records and variables) that could be
handled by R.

That's all, for a while!

Rogerio.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html