Handling 8GB .txt file in R?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling 8GB .txt file in R?

iliketurtles
This post was updated on .
Hi,

I am mediocre at R, maybe 1000 hours experience, but I received an 8GB dataset and I don't know what to do with it. I have to do extensive analysis over it for my Honours thesis.

I can't even import it. I've tried;
- Splitting it up using the free csv-splitter-1.1.zip that seems to be working for everyone else (it doesn't work for me, it just outputs 1 single line).
- Splitting it with Text Splitter doesn't work because you have to load it into memory first.
- Importing using BigMemory's read.big.matrix(), however my computer just freezes.
- Importing using ff's read.table.ffdf() as well as read.csv.ffdf(), however for both I get the error message
"
Error in read.table.ffdf(FUN = "read.csv", ...) :
  only ffdf objects can be used for appending (and skipping the first.row chunk)
"

The dataset looks like this when I view its head with Large Text Viewer 5.2;

PERMNO          DATE    TICKER        PERMCO             PRC           VOL    NUMTRD        vwretd        ewretd

   10000    06/01/1986                             7952           .                              .         .     -0.000138      0.001926
   10000    07/01/1986    OMFGA           7952        -2.56250          1000         .      0.013809      0.011061
   10000    08/01/1986    OMFGA           7952        -2.50000         12800         .     -0.020744     -0.005117
   10000    09/01/1986    OMFGA           7952        -2.50000          1400         .     -0.011219     -0.011588
   10000    10/01/1986    OMFGA           7952        -2.50000          8500         .      0.000083      0.003651
   10000    13/01/1986    OMFGA           7952        -2.62500          5450         .      0.002749      0.002433
   10000    14/01/1986    OMFGA           7952        -2.75000          2075         .      0.000366      0.004474
   10000    15/01/1986    OMFGA           7952        -2.87500         22490         .      0.008206      0.007693
   10000    16/01/1986    OMFGA           7952        -3.00000         10900         .      0.004702      0.005670
   10000    17/01/1986    OMFGA           7952        -3.00000          8470         .     -0.001741      0.003297
   10000    20/01/1986    OMFGA           7952        -3.00000          1000         .     -0.003735     -0.001355
   10000    21/01/1986    OMFGA           7952        -3.00000          1000         .     -0.006992     -0.003472
   10000    22/01/1986    OMFGA           7952        -3.00000          2700         .     -0.009593     -0.004588
   10000    23/01/1986    OMFGA           7952        -3.75000         24000         .      0.002664      0.001397
   10000    24/01/1986    OMFGA           7952        -4.18750         11372         .      0.009684      0.006771
   10000    27/01/1986    OMFGA           7952        -4.43750         16570         .      0.004343      0.002140
   10000    28/01/1986    OMFGA           7952        -4.43750          9600         .      0.009632      0.003179

Can R do this on a computer with 4 GB of memory and a dual core i5xx ?
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|

Re: Handling 8GB .txt file in R?

Michael Weylandt
Despair not! Malcom Gladwell would say you are 1/10 of the way to
becoming the next MozaRt!

You need to say how your data set is designed. Your problem with ff
seems to be that the lines are not of constant length: if they aren't
of a consistent CSV format, I wouldn't be surprised if a CSV splitter
had problems with them as well. If you are on a Unix-alike system,
this (the splitting) could be pretty easily done with awk/sed/perl,
but you need to define your problem much more clearly. If things
aren't nicely structured, you will almost certainly benefit from doing
a little bit of data preparation work with Unix utilities before
loading into R.

Michael

On Sat, Mar 24, 2012 at 4:08 AM, iliketurtles <[hidden email]> wrote:

> Hi,
>
> I am mediocre at R, maybe 1000 hours experience, but I received an 8GB
> dataset and I don't know what to do with it. I have to do extensive analysis
> over it for my Honours thesis.
>
> I can't even import it. I've tried;
> - Splitting it up using the free csv-splitter-1.1.zip that seems to be
> working for everyone else (it doesn't work for me, it just outputs 1 single
> line).
> - Splitting it with Text Splitter doesn't work because you have to load it
> into memory first.
> - Importing using BigMemory's big.matrix(), however my computer just
> freezes.
> - Importing using ff's read.table.ffdf(), however I get the error message
> " in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>  line 5 did not have 9 elements"
>
> Thanks for any ideas and assistance.
>
> Can R do this on a computer with 4 GB of memory and a dual core i5xx ?
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context: http://r.789695.n4.nabble.com/Handling-8GB-txt-file-in-R-tp4500971p4500971.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Handling 8GB .txt file in R?

iliketurtles
In reply to this post by iliketurtles
Thanks to all the suggestions. To the first individual that replied, I can't do any stuff with unix or perl. All I know is R.

@KEN:
I'm using Windows 7, 64 bit.

@Steve:
Here's the readLines output.. As we can see, lines 1-3 are empty and line 5 is empty, and there's also empty elements after line 5!.

 [1] " "                                                                                                                                                                                                                                                              
  [2] "                                                                                                                                                                                                                                                                "
  [3] " "                                                                                                                                                                                                                                                              
  [4] "  PERMNO          DATE    TICKER        PERMCO             PRC           VOL    NUMTRD        vwretd        ewretd"                                                                                                                                              
  [5] ""                                                                                                                                                                                                                                                                
  [6] "   10000    06/01/1986                    7952          .                  .         .     -0.000138      0.001926"                                                                                                                                              
  [7] "   10000    07/01/1986    OMFGA           7952        -2.56250          1000         .      0.013809      0.011061"                                                                                                                                              
  [8] "   10000    08/01/1986    OMFGA           7952        -2.50000         12800         .     -0.020744     -0.005117"                                                                                                                                              
  [9] "   10000    09/01/1986    OMFGA           7952        -2.50000          1400         .     -0.011219     -0.011588"                                                                                                                                              
 [10] "   10000    10/01/1986    OMFGA           7952        -2.50000          8500         .      0.000083      0.003651"                                                                                                                                              
 [11] "   10000    13/01/1986    OMFGA           7952        -2.62500          5450         .      0.002749      0.002433"                                        
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|

Re: Handling 8GB .txt file in R?

Jan van der LAan-2

What you could try to do is skip the first 5 lines. After that the file
seems to be 'normal'. With read.table.ffdf you could try something like

# open a connection to the file
con <- file('yourfile', 'rt')
# skip first 5 lines
tmp <- readLines(con, n=5)
# read the remainder using read.table.ffdf
ffdf <- read.table.ffdf(file=con)
# close connection
close(con)

HTH

Jan

On 03/25/2012 06:20 AM, iliketurtles wrote:

> Thanks to all the suggestions. To the first individual that replied, I can't
> do any stuff with unix or perl. All I know is R.
>
> @KEN:
> I'm using Windows 7, 64 bit.
>
> @Steve:
> Here's the readLines output.. As we can see, lines 1-3 are empty and line 5
> is empty, and there's also empty elements after line 5!.
>
>   [1] " "
>    [2] "
> "
>    [3] " "
>    [4] "  PERMNO          DATE    TICKER        PERMCO             PRC
> VOL    NUMTRD        vwretd        ewretd"
>    [5] ""
>    [6] "   10000    06/01/1986                    7952          .
> .         .     -0.000138      0.001926"
>    [7] "   10000    07/01/1986    OMFGA           7952        -2.56250
> 1000         .      0.013809      0.011061"
>    [8] "   10000    08/01/1986    OMFGA           7952        -2.50000
> 12800         .     -0.020744     -0.005117"
>    [9] "   10000    09/01/1986    OMFGA           7952        -2.50000
> 1400         .     -0.011219     -0.011588"
>   [10] "   10000    10/01/1986    OMFGA           7952        -2.50000
> 8500         .      0.000083      0.003651"
>   [11] "   10000    13/01/1986    OMFGA           7952        -2.62500
> 5450         .      0.002749      0.002433"
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context: http://r.789695.n4.nabble.com/Handling-8GB-txt-file-in-R-tp4500971p4502706.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Handling 8GB .txt file in R?

steven mosher
In reply to this post by iliketurtles
As the other poster noted, you can just skip lines.

Big matrix should work just fine, except I am not sure how the dates will
be handled

Here is some sample code from my stuff
txtName   is the file name of the file you are reading
Directory   is the path where you want to write the file.backed matrix
filename    is the file.backed big matrix
dname      is a filename for describing the data
sep           What's your separator, comma or space?  below I use tab,
because my file is tab delimited

replace my column names with yours

PERMNO          DATE    TICKER        PERMCO             PRC
VOL    NUMTRD        vwretd        ewretd"

Your dates may be coerced in factors. Not sure how that will work.
You can also try ff

  options(bigmemory.allow.dimnames=TRUE)
 D    <- read.big.matrix(txtName, skip = 5,
                             backingpath = Directory,
                              backingfile = filename,
                             descriptorfile = dname,
                             sep = "\t",
                             type = "double",
                             col.names =
 c("Id","SeriesNo","Date","Temp","Unc","Obs","Tobs")
                             )



On Sat, Mar 24, 2012 at 9:20 PM, iliketurtles <[hidden email]> wrote:

> Thanks to all the suggestions. To the first individual that replied, I
> can't
> do any stuff with unix or perl. All I know is R.
>
> @KEN:
> I'm using Windows 7, 64 bit.
>
> @Steve:
> Here's the readLines output.. As we can see, lines 1-3 are empty and line 5
> is empty, and there's also empty elements after line 5!.
>
>  [1] " "
>  [2] "
> "
>  [3] " "
>  [4] "  PERMNO          DATE    TICKER        PERMCO             PRC
> VOL    NUMTRD        vwretd        ewretd"
>  [5] ""
>  [6] "   10000    06/01/1986                    7952          .
> .         .     -0.000138      0.001926"
>  [7] "   10000    07/01/1986    OMFGA           7952        -2.56250
> 1000         .      0.013809      0.011061"
>  [8] "   10000    08/01/1986    OMFGA           7952        -2.50000
> 12800         .     -0.020744     -0.005117"
>  [9] "   10000    09/01/1986    OMFGA           7952        -2.50000
> 1400         .     -0.011219     -0.011588"
>  [10] "   10000    10/01/1986    OMFGA           7952        -2.50000
> 8500         .      0.000083      0.003651"
>  [11] "   10000    13/01/1986    OMFGA           7952        -2.62500
> 5450         .      0.002749      0.002433"
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Handling-8GB-txt-file-in-R-tp4500971p4502706.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Handling 8GB .txt file in R?

Rainer M Krug-6
In reply to this post by iliketurtles
On 24/03/12 09:08, iliketurtles wrote:

> Hi,
>
> I am mediocre at R, maybe 1000 hours experience, but I received an 8GB
> dataset and I don't know what to do with it. I have to do extensive analysis
> over it for my Honours thesis.
>
> I can't even import it. I've tried;
> - Splitting it up using the free csv-splitter-1.1.zip that seems to be
> working for everyone else (it doesn't work for me, it just outputs 1 single
> line).
> - Splitting it with Text Splitter doesn't work because you have to load it
> into memory first.
> - Importing using BigMemory's big.matrix(), however my computer just
> freezes.
> - Importing using ff's read.table.ffdf(), however I get the error message
> " in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   line 5 did not have 9 elements"
>
> Thanks for any ideas and assistance.

1) you should look if you really need to load the complete dataset - you might be able to load a
subset, sample it for the analysis, discard columns, ... There are many things possible

2) With csv files this size, it usually pays off to covert them into a database - sqlite coming to
mind as an easy to use one with sql support to select columns and rows to load. sqlite has a tool to
import a csv file into a sqlite database.

Concerning the general format of the csv, see the other suggestions.

Cheers,

Rainer



>
> Can R do this on a computer with 4 GB of memory and a dual core i5xx ?
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context: http://r.789695.n4.nabble.com/Handling-8GB-txt-file-in-R-tp4500971p4500971.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      [hidden email]

Skype:      RMkrug

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.