Quantcast

Can't import this 4GB DATASET

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Can't import this 4GB DATASET

iliketurtles
Dear Experienced R Practitioners,

I have 4GB .txt data called "dataset.txt" and have attempted to use ff, bigmemory, filehash and sqldf packages to import it, but have had no success. The readLines output of this data is:

readLines("dataset.txt",n=20)
 [1] " "                                                                                                                                                                                                                                                              
 [2] "                                                                                                                                                                                                                                                                "
 [3] " "                                                                                                                                                                                                                                                              
 [4] "  PERMNO          DATE    SHRCD    COMNAM                                       PRC           VOL"                                                                                                                                                              
 [5] ""                                                                                                                                                                                                                                                                
 [6] "   10001    01/09/1986     11      GREAT FALLS GAS CO                      -5.75000         14160"                                                                                                                                                              
 [7] "   10001    01/10/1986     11      GREAT FALLS GAS CO                      -5.87500             0"                                                                                                                                                              
 [8] "   10001    01/13/1986     11      GREAT FALLS GAS CO                      -5.87500          2805"                                                                                                                                                              
 [9] "   10001    01/14/1986     11      GREAT FALLS GAS CO                      -5.87500          2070"                                                                                                                                                              
[10] "   10001    01/15/1986     11      GREAT FALLS GAS CO                      -6.06250          6000"                                                                                                                                                              
[11] "   10001    01/16/1986     11      GREAT FALLS GAS CO                      -6.25000          1500"                                                                                                                                                              
[12] "   10001    01/17/1986     11      GREAT FALLS GAS CO                      -6.25000          7100"                                                                                                                                                              
[13] "   10001    01/20/1986     11      GREAT FALLS GAS CO                      -6.31250          1700"                                                                                                                                                              
[14] "   10001    01/21/1986     11      GREAT FALLS GAS CO                      -6.18750          4000"                                                                                                                                                              
[15] "   10001    01/22/1986     11      GREAT FALLS GAS CO                      -6.18750          5200"                                                                                                                                                              
[16] "   10001    01/23/1986     11      GREAT FALLS GAS CO                      -6.18750          4100"                                                                                                                                                              
[17] "   10001    01/24/1986     11      GREAT FALLS GAS CO                      -6.18750          1500"                                                                                                                                                              
[18] "   10001    01/27/1986     11      GREAT FALLS GAS CO                      -6.18750          4000"                                                                                                                                                              
[19] "   10001    01/28/1986     11      GREAT FALLS GAS CO                      -6.12500          3500"                                                                                                                                                              
[20] "   10001    01/29/1986     11      GREAT FALLS GAS CO                      -6.06250          4600"                                                        

This data goes on for a huge number of rows (not sure exactly how many). Each element in each row is separated by and uneven number of (what seem to be) spaces (maybe TAB? not sure). Further, there are some rows that are "incomplete", i.e. there's missing elements.

Take the first 29 rows of "dataset.txt" into a separate data file, let's call it "dataset2.txt".  read.table("dataset2.txt",skip=5) gives the perfect table that I want to end up with, except I want it with the 4GB data through bigmemory, ff or filehash.

read.table('dataset2.txt',skip=5)
      V1         V2 V3    V4    V5  V6 V7      V8    V9
1  10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160
2  10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750     0
3  10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750  2805
4  10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750  2070
5  10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625  6000
6  10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500  1500
7  10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500  7100
8  10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125  1700
9  10001 01/21/1986 11 GREAT FALLS GAS CO -6.1875  4000
10 10001 01/22/1986 11 GREAT FALLS GAS CO -6.1875  5200
11 10001 01/23/1986 11 GREAT FALLS GAS CO -6.1875  4100
12 10001 01/24/1986 11 GREAT FALLS GAS CO -6.1875  1500
13 10001 01/27/1986 11 GREAT FALLS GAS CO -6.1875  4000
14 10001 01/28/1986 11 GREAT FALLS GAS CO -6.1250  3500
15 10001 01/29/1986 11 GREAT FALLS GAS CO -6.0625  4600
16 10001 01/30/1986 11 GREAT FALLS GAS CO -6.0625  3830
17 10001 01/31/1986 11 GREAT FALLS GAS CO -6.1250   675
18 10001 02/03/1986 11 GREAT FALLS GAS CO -6.1250  2300
19 10001 02/04/1986 11 GREAT FALLS GAS CO -6.1250  4200
20 10001 02/05/1986 11 GREAT FALLS GAS CO -6.1250  1000
21 10001 02/06/1986 11 GREAT FALLS GAS CO -6.1250  4200
22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250  1800
23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875   100
24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125  1500
25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500  2500
26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500  1000
27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875     0
28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875  7000
29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500  5200


###MY ATTEMPT USING FILEHASH###
#First, try and upload the 29 row dataset (the tiny subset of the whole dataset).

install.packages("filehash");require(filehash)
dumpList(read.table("dataset2.txt",skip=6),dbName="db02")  #Note that DumpDF also works.
env02 <- db2env(db="db02")

with(env02,V1)

#
[1] 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001
[27] 10001 10001
#

with(env02,V2)

#
[1] 01/10/1986 01/13/1986 01/14/1986 01/15/1986 01/16/1986 01/17/1986 01/20/1986 01/21/1986 01/22/1986 01/23/1986 01/24/1986 01/27/1986 01/28/1986 01/29/1986
[15] 01/30/1986 01/31/1986 02/03/1986 02/04/1986 02/05/1986 02/06/1986 02/07/1986 02/10/1986 02/11/1986 02/12/1986 02/13/1986 02/14/1986 02/18/1986 02/19/1986
#

#so this is all good. But when we try to import the ENTIRE data;

dumpList(read.table("dataset.txt",skip=6),dbName="db01")
#Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 1991 did not have 9 elements

#This confirms that there are missing elements in later rows. So, I put fill=TRUE:

dumpDF(read.table("dataset.txt",skip=6,fill=TRUE),dbName="db01")
Error: cannot allocate vector of size 500.0 Mb
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  Reached total allocation of 4078Mb: see help(memory.size)
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  Reached total allocation of 4078Mb: see help(memory.size)
3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  Reached total allocation of 4078Mb: see help(memory.size)
4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  Reached total allocation of 4078Mb: see help(memory.size)

#This occurs after a 30 minute wait.



###MY ATTEMPT USING FF###
#First, try with the 29 row "datatset2.txt",
# open a connection to the file
con <- file('dataset2.txt', 'rt')
# read the remainder using read.table.ffdf
ffdf <- read.table.ffdf(file=con)
# close connection
close(con)

ffdf

#ffdf (all open) dim=c(29,9), dimorder=c(1,2) row.names=NULL
#ffdf virtual mapping
#  PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
V1           V1      integer       integer FALSE           FALSE            FALSE                 1                1               1           TRUE
V2           V2      integer       integer FALSE           FALSE            FALSE                 2                1               1           TRUE
V3           V3      integer       integer FALSE           FALSE            FALSE                 3                1               1           TRUE
V4           V4      integer       integer FALSE           FALSE            FALSE                 4                1               1           TRUE
V5           V5      integer       integer FALSE           FALSE            FALSE                 5                1               1           TRUE
V6           V6      integer       integer FALSE           FALSE            FALSE                 6                1               1           TRUE
V7           V7      integer       integer FALSE           FALSE            FALSE                 7                1               1           TRUE
V8           V8       double        double FALSE           FALSE            FALSE                 8                1               1           TRUE
V9           V9      integer       integer FALSE           FALSE            FALSE                 9                1               1           TRUE
ffdf data
           V1         V2         V3         V4         V5         V6         V7         V8         V9
1  10001      01/09/1986 11         GREAT      FALLS      GAS        CO         -5.7500    14160    
2  10001      01/10/1986 11         GREAT      FALLS      GAS        CO         -5.8750        0    
3  10001      01/13/1986 11         GREAT      FALLS      GAS        CO         -5.8750     2805    
4  10001      01/14/1986 11         GREAT      FALLS      GAS        CO         -5.8750     2070    
5  10001      01/15/1986 11         GREAT      FALLS      GAS        CO         -6.0625     6000    
6  10001      01/16/1986 11         GREAT      FALLS      GAS        CO         -6.2500     1500    
7  10001      01/17/1986 11         GREAT      FALLS      GAS        CO         -6.2500     7100    
8  10001      01/20/1986 11         GREAT      FALLS      GAS        CO         -6.3125     1700    
:           :          :          :          :          :          :          :          :          :
22 10001      02/07/1986 11         GREAT      FALLS      GAS        CO         -6.1250     1800    
23 10001      02/10/1986 11         GREAT      FALLS      GAS        CO         -6.1875      100    
24 10001      02/11/1986 11         GREAT      FALLS      GAS        CO         -6.3125     1500    
25 10001      02/12/1986 11         GREAT      FALLS      GAS        CO         -6.2500     2500    
26 10001      02/13/1986 11         GREAT      FALLS      GAS        CO         -6.2500     1000    
27 10001      02/14/1986 11         GREAT      FALLS      GAS        CO         -6.1875        0    
28 10001      02/18/1986 11         GREAT      FALLS      GAS        CO         -6.1875     7000    
29 10001      02/19/1986 11         GREAT      FALLS      GAS        CO         -6.2500     5200  


#GOOD!!! Now let's try with the 4GB "dataset.txt".

# open a connection to the file
con <- file('dataset.txt', 'rt')
# read the remainder using read.table.ffdf
ffdf <- read.table.ffdf(file=con)

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  line 4 did not have 9 elements


#Again, missing elements. Yet I don't know how to do something like fill=TRUE in read.table.ffdf (like you can in read.table()).



###MY ATTEMPT USING BIGMEMORY###
#I can't even import "dataset2.txt", so I'm trying to figure out this before I try on the big "dataset.txt".

 E    <- read.big.matrix("dataset2.txt", skip = 3,
                             backingpath = getwd(),
                             sep = " ",
                             type = "double",
)

#results in:

E[10,]
 [1]         NA         NA         NA 10001.0000         NA         NA         NA     1.0000         NA         NA         NA         NA    11.0000         NA
[15]         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA
[29]         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA
[43]         NA    -6.3125         NA         NA         NA         NA         NA         NA         NA         NA         NA  1700.0000         NA         NA
[57]         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA         NA
[71]         NA

#So all the character and dates are being screwed up.

 E    <- read.big.matrix("dataset2.txt", skip = 3,
                             backingpath = getwd(),
                             sep = " ",
                             type = "char",
)

E[10,]
[1]  NA  NA  NA  17  NA  NA  NA   1  NA  NA  NA  NA  11  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[41]  NA  NA  NA  -6  NA  NA  NA  NA  NA  NA  NA  NA  NA -92  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

#Even worse.
###MY ATTEMPT USING sqldf###
No idea what to do here.
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

Jan van der LAan-2

read.table imports the company name "GREAT FALLS GAS CO" as four  
separate columns. I think that needs to be one column. I can imagine  
that further one in your file you will have another company name that  
does not consist of four words which would cause the error you  
observed. From your output it seems that the columns are separated by  
spaces and not tabs (you would see \t in your output of readLines  
otherwise). As there are also spaces in your company names this makes  
it difficult to read the file in correctly.

Perhaps you have a fixed width file (columns are identified not by  
separator but by position in line) in which case all lines should have  
an equal length. You could check for this using the following code  
(not tested so could contain errors):

con <- file("dataset.txt", "rt")
# skip first 5 lines
lines <- readLines(con, n=5)
# initialize vector of line sizes (we'll have a growing vector which is not
# efficient, but we just want to have a a quick scan of the file)
line_sizes <- c()
# read the rest in blocks of 100.000 lines
while (TRUE) {
   lines <- readLines(con, n=1E5)
   if (length(lines) == 0) break;
   line_sizes <- c(line_sizes, nchar(lines))
}
# create a table of line sizes to check if they are all equal
table(lines_sizes)



HTH,
Jan



iliketurtles <[hidden email]> schreef:

> Dear Experienced R Practitioners,
>
> I have 4GB .txt data called "dataset.txt" and have attempted to use *ff,
> bigmemory, filehash and sqldf *packages to import it, but have had no
> success. The readLines output of this data is:
>
> readLines("dataset.txt",n=20)
>  [1] " "
>  [2] "
> "
>  [3] " "
>  [4] "  PERMNO          DATE    SHRCD    COMNAM
> PRC           VOL"
>  [5] ""
>  [6] "   10001    01/09/1986     11      GREAT FALLS GAS CO
> -5.75000         14160"
>  [7] "   10001    01/10/1986     11      GREAT FALLS GAS CO
> -5.87500             0"
>  [8] "   10001    01/13/1986     11      GREAT FALLS GAS CO
> -5.87500          2805"
>  [9] "   10001    01/14/1986     11      GREAT FALLS GAS CO
> -5.87500          2070"
> [10] "   10001    01/15/1986     11      GREAT FALLS GAS CO
> -6.06250          6000"
> [11] "   10001    01/16/1986     11      GREAT FALLS GAS CO
> -6.25000          1500"
> [12] "   10001    01/17/1986     11      GREAT FALLS GAS CO
> -6.25000          7100"
> [13] "   10001    01/20/1986     11      GREAT FALLS GAS CO
> -6.31250          1700"
> [14] "   10001    01/21/1986     11      GREAT FALLS GAS CO
> -6.18750          4000"
> [15] "   10001    01/22/1986     11      GREAT FALLS GAS CO
> -6.18750          5200"
> [16] "   10001    01/23/1986     11      GREAT FALLS GAS CO
> -6.18750          4100"
> [17] "   10001    01/24/1986     11      GREAT FALLS GAS CO
> -6.18750          1500"
> [18] "   10001    01/27/1986     11      GREAT FALLS GAS CO
> -6.18750          4000"
> [19] "   10001    01/28/1986     11      GREAT FALLS GAS CO
> -6.12500          3500"
> [20] "   10001    01/29/1986     11      GREAT FALLS GAS CO
> -6.06250          4600"
>
> This data goes on for a huge number of rows (not sure exactly how many).
> Each element in each row is separated by and uneven number of (what seem to
> be) spaces (maybe TAB? not sure). Further, there are some rows that are
> "incomplete", i.e. there's missing elements.
>
> Take the first 29 rows of "dataset.txt" into a separate data file, let's
> call it "dataset2.txt".  read.table("dataset2.txt",skip=5) gives the perfect
> table that I want to end up with, except I want it with the 4GB data through
> bigmemory, ff or filehash.
>
> read.table('dataset2.txt',skip=5)
>       V1         V2 V3    V4    V5  V6 V7      V8    V9
> 1  10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160
> 2  10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750     0
> 3  10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750  2805
> 4  10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750  2070
> 5  10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625  6000
> 6  10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500  1500
> 7  10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500  7100
> 8  10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125  1700
> 9  10001 01/21/1986 11 GREAT FALLS GAS CO -6.1875  4000
> 10 10001 01/22/1986 11 GREAT FALLS GAS CO -6.1875  5200
> 11 10001 01/23/1986 11 GREAT FALLS GAS CO -6.1875  4100
> 12 10001 01/24/1986 11 GREAT FALLS GAS CO -6.1875  1500
> 13 10001 01/27/1986 11 GREAT FALLS GAS CO -6.1875  4000
> 14 10001 01/28/1986 11 GREAT FALLS GAS CO -6.1250  3500
> 15 10001 01/29/1986 11 GREAT FALLS GAS CO -6.0625  4600
> 16 10001 01/30/1986 11 GREAT FALLS GAS CO -6.0625  3830
> 17 10001 01/31/1986 11 GREAT FALLS GAS CO -6.1250   675
> 18 10001 02/03/1986 11 GREAT FALLS GAS CO -6.1250  2300
> 19 10001 02/04/1986 11 GREAT FALLS GAS CO -6.1250  4200
> 20 10001 02/05/1986 11 GREAT FALLS GAS CO -6.1250  1000
> 21 10001 02/06/1986 11 GREAT FALLS GAS CO -6.1250  4200
> 22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250  1800
> 23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875   100
> 24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125  1500
> 25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500  2500
> 26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500  1000
> 27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875     0
> 28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875  7000
> 29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500  5200
>
>
> ###/*MY ATTEMPT USING FILEHASH*/###
> #First, try and upload the 29 row dataset (the tiny subset of the whole
> dataset).
>
> install.packages("filehash");require(filehash)
> dumpList(read.table("dataset2.txt",skip=6),dbName="db02")  #Note that DumpDF
> also works.
> env02 <- db2env(db="db02")
>
> with(env02,V1)
>
> #
> [1] 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001
> 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001
> 10001 10001
> [27] 10001 10001
> #
>
> with(env02,V2)
>
> #
> [1] 01/10/1986 01/13/1986 01/14/1986 01/15/1986 01/16/1986 01/17/1986
> 01/20/1986 01/21/1986 01/22/1986 01/23/1986 01/24/1986 01/27/1986 01/28/1986
> 01/29/1986
> [15] 01/30/1986 01/31/1986 02/03/1986 02/04/1986 02/05/1986 02/06/1986
> 02/07/1986 02/10/1986 02/11/1986 02/12/1986 02/13/1986 02/14/1986 02/18/1986
> 02/19/1986
> #
>
> #so this is all good. But when we try to import the ENTIRE data;
>
> dumpList(read.table("dataset.txt",skip=6),dbName="db01")
> #Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : line 1991 did not have 9 elements
>
> #This confirms that there are missing elements in later rows. So, I put
> fill=TRUE:
>
> dumpDF(read.table("dataset.txt",skip=6,fill=TRUE),dbName="db01")
> Error: cannot allocate vector of size 500.0 Mb
> In addition: Warning messages:
> 1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   Reached total allocation of 4078Mb: see help(memory.size)
> 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   Reached total allocation of 4078Mb: see help(memory.size)
> 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   Reached total allocation of 4078Mb: see help(memory.size)
> 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   Reached total allocation of 4078Mb: see help(memory.size)
>
> #This occurs after a 30 minute wait.
>
>
>
> ###/*MY ATTEMPT USING FF*/###
> #First, try with the 29 row "datatset2.txt",
> # open a connection to the file
> con <- file('dataset2.txt', 'rt')
> # read the remainder using read.table.ffdf
> ffdf <- read.table.ffdf(file=con)
> # close connection
> close(con)
>
> ffdf
>
> #ffdf (all open) dim=c(29,9), dimorder=c(1,2) row.names=NULL
> #ffdf virtual mapping
> #  PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
> PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
> PhysicalIsOpen
> V1           V1      integer       integer FALSE           FALSE
> FALSE                 1                1               1           TRUE
> V2           V2      integer       integer FALSE           FALSE
> FALSE                 2                1               1           TRUE
> V3           V3      integer       integer FALSE           FALSE
> FALSE                 3                1               1           TRUE
> V4           V4      integer       integer FALSE           FALSE
> FALSE                 4                1               1           TRUE
> V5           V5      integer       integer FALSE           FALSE
> FALSE                 5                1               1           TRUE
> V6           V6      integer       integer FALSE           FALSE
> FALSE                 6                1               1           TRUE
> V7           V7      integer       integer FALSE           FALSE
> FALSE                 7                1               1           TRUE
> V8           V8       double        double FALSE           FALSE
> FALSE                 8                1               1           TRUE
> V9           V9      integer       integer FALSE           FALSE
> FALSE                 9                1               1           TRUE
> ffdf data
>            V1         V2         V3         V4         V5         V6
> V7         V8         V9
> 1  10001      01/09/1986 11         GREAT      FALLS      GAS        CO
> -5.7500    14160
> 2  10001      01/10/1986 11         GREAT      FALLS      GAS        CO
> -5.8750        0
> 3  10001      01/13/1986 11         GREAT      FALLS      GAS        CO
> -5.8750     2805
> 4  10001      01/14/1986 11         GREAT      FALLS      GAS        CO
> -5.8750     2070
> 5  10001      01/15/1986 11         GREAT      FALLS      GAS        CO
> -6.0625     6000
> 6  10001      01/16/1986 11         GREAT      FALLS      GAS        CO
> -6.2500     1500
> 7  10001      01/17/1986 11         GREAT      FALLS      GAS        CO
> -6.2500     7100
> 8  10001      01/20/1986 11         GREAT      FALLS      GAS        CO
> -6.3125     1700
> :           :          :          :          :          :          :
> :          :          :
> 22 10001      02/07/1986 11         GREAT      FALLS      GAS        CO
> -6.1250     1800
> 23 10001      02/10/1986 11         GREAT      FALLS      GAS        CO
> -6.1875      100
> 24 10001      02/11/1986 11         GREAT      FALLS      GAS        CO
> -6.3125     1500
> 25 10001      02/12/1986 11         GREAT      FALLS      GAS        CO
> -6.2500     2500
> 26 10001      02/13/1986 11         GREAT      FALLS      GAS        CO
> -6.2500     1000
> 27 10001      02/14/1986 11         GREAT      FALLS      GAS        CO
> -6.1875        0
> 28 10001      02/18/1986 11         GREAT      FALLS      GAS        CO
> -6.1875     7000
> 29 10001      02/19/1986 11         GREAT      FALLS      GAS        CO
> -6.2500     5200
>
>
> #GOOD!!! Now let's try with the 4GB "dataset.txt".
>
> # open a connection to the file
> con <- file('dataset.txt', 'rt')
> # read the remainder using read.table.ffdf
> ffdf <- read.table.ffdf(file=con)
>
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
>   line 4 did not have 9 elements
>
>
> #Again, missing elements. Yet I don't know how to do something like
> fill=TRUE in read.table.ffdf (like you can in read.table()).
>
>
>
> ###/*MY ATTEMPT USING BIGMEMORY*/###
> #I can't even import "dataset2.txt", so I'm trying to figure out this before
> I try on the big "dataset.txt".
>
>  E    <- read.big.matrix("dataset2.txt", skip = 3,
>                              backingpath = getwd(),
>                              sep = " ",
>                              type = "double",
> )
>
> #results in:
>
> E[10,]
>  [1]         NA         NA         NA 10001.0000         NA         NA
> NA     1.0000         NA         NA         NA         NA    11.0000
> NA
> [15]         NA         NA         NA         NA         NA         NA
> NA         NA         NA         NA         NA         NA         NA
> NA
> [29]         NA         NA         NA         NA         NA         NA
> NA         NA         NA         NA         NA         NA         NA
> NA
> [43]         NA    -6.3125         NA         NA         NA         NA
> NA         NA         NA         NA         NA  1700.0000         NA
> NA
> [57]         NA         NA         NA         NA         NA         NA
> NA         NA         NA         NA         NA         NA         NA
> NA
> [71]         NA
>
> #So all the character and dates are being screwed up.
>
>  E    <- read.big.matrix("dataset2.txt", skip = 3,
>                              backingpath = getwd(),
>                              sep = " ",
>                              type = "char",
> )
>
> E[10,]
> [1]  NA  NA  NA  17  NA  NA  NA   1  NA  NA  NA  NA  11  NA  NA  NA  NA  NA
> NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
> NA  NA  NA
> [41]  NA  NA  NA  -6  NA  NA  NA  NA  NA  NA  NA  NA  NA -92  NA  NA  NA  NA
> NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
>
> #Even worse.
> ###/*MY ATTEMPT USING sqldf*/###
> No idea what to do here.
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context:  
> http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

iliketurtles
Jan, thank you.

> table(line_sizes)
line_sizes
       0        1       97      256
    1430     2860 46869069     1430
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

David Winsemius
In reply to this post by iliketurtles

On May 4, 2012, at 1:34 AM, iliketurtles wrote:

> Dear Experienced R Practitioners,
>
> I have 4GB .txt data called "dataset.txt" and have attempted to use  
> *ff,
> bigmemory, filehash and sqldf *packages to import it, but have had no
> success. The readLines output of this data is:
>

Ther alignment of that output makes me wonder if the file is tab-
speparated. You have considered the possibility that tab is the  
separator but have you actually tried using sep = "\t" in your read  
operations?

--
David.

> readLines("dataset.txt",n=20)
> [1] " "
> [2] "
> "
> [3] " "
> [4] "  PERMNO          DATE    SHRCD    COMNAM
> PRC           VOL"
> [5] ""
> [6] "   10001    01/09/1986     11      GREAT FALLS GAS CO
> -5.75000         14160"
> [7] "   10001    01/10/1986     11      GREAT FALLS GAS CO
> -5.87500             0"
> [8] "   10001    01/13/1986     11      GREAT FALLS GAS CO
> -5.87500          2805"
> [9] "   10001    01/14/1986     11      GREAT FALLS GAS CO
> [20] "   10001    01/29/1986     11      GREAT FALLS GAS CO
> -6.06250          4600"
>
> This data goes on for a huge number of rows (not sure exactly how  
> many).
> Each element in each row is separated by and uneven number of (what  
> seem to
> be) spaces (maybe TAB? not sure). Further, there are some rows that  
> are
> "incomplete", i.e. there's missing elements.
>
> Take the first 29 rows of "dataset.txt" into a separate data file,  
> let's
> call it "dataset2.txt".  read.table("dataset2.txt",skip=5) gives the  
> perfect
> table that I want to end up with, except I want it with the 4GB data  
> through
> bigmemory, ff or filehash.

snipped several failed attempts

NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
>
> #Even worse.
> ###/*MY ATTEMPT USING sqldf*/###
> No idea what to do here.
>
> -----

David Winsemius, MD

West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

iliketurtles
Hi David,

I've tried using sep="\t" but it doesn't work, unfortunately.

Thanks for your help.
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

Jan van der LAan-2
In reply to this post by iliketurtles

OK, not all, but most lines have the same length. Perhaps you could  
write the lines with a different line size to a separate file to have  
a closer look at those lines. Modifying the previous code (again not  
tested):

con <- file("dataset.txt", "rt")
out <- file("strangelines.txt", "wt")
# skip first 5 lines
lines <- readLines(con, n=5)
# read the rest in blocks of 100.000 lines
while (TRUE) {
    lines <- readLines(con, n=1E5)
    if (length(lines) == 0) break;
    strangelines <- lines[nchar(lines) != 97]
    writeLines(strangelines, con=out)
}
close(con)
close(out)

Jan



Quoting iliketurtles <[hidden email]>:

> Jan, thank you.
>
>> table(line_sizes)
> line_sizes
>        0        1       97      256
>     1430     2860 46869069     1430
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context:  
> http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862p4608172.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

iliketurtles
Your code works!

strangelines.txt was created, and it's a text file with just spacebars ... Seems like a few thousand lines of complete blanks (not 1 non-blank entry).

One thing, when I ran your code there was an error message;

> setwd("C:/Users/admin/Desktop/hons/Thesis")
> con <- file("dataset.txt", "rt")
> out <- file("strangelines.txt", "wt")
> # skip first 5 lines
> lines <- readLines(con, n=5)
> # read the rest in blocks of 100.000 lines
> while (TRUE) {
+     lines <- readLines(con, n=1E5)
+     if (length(lines) == 0) break;
+     strangelines <- lines[nchar(lines) != 97]
+     writeLines(strangelines, con=out)
+ }
Warning message:
In readLines(con, n = 1e+05) : incomplete final line found on 'dataset.txt'




I'm really not sure where to go from here. This has gone way out of my depth.
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Can't import this 4GB DATASET

Jan van der LAan-2

Perhaps you could contact the persons that supplied/created the file and
ask them what the format of the file exactly is. That is probably the
safest thing to do.

If you are sure that the lines containing only whitespace are
meaningless, then you could alter the previous code to make a copy of
the file containing only lines with a length equal to 97 characters (you
can do this by changing the '!=' to '==').

Since all lines are then of equal length, I suspect you have fixed width
file. You could open and read this file using the LaF package
(http://cran.r-project.org/web/packages/LaF/index.html; see the manual
vignette for more information). In the package ffbase
(http://cran.r-project.org/web/packages/ffbase/index.html) is a function
to convert from LaF to ff (laf_to_ffdf). I do not known if packages such
as rsqlite or bigmemory can import fixed width files.

The warning message indicates that the last line does not end with a new
line character which could indicate an incomplete file but often doesn't
mean anything. You could check the last line of the file to be sure.

HTH,

Jan



On 05/05/2012 05:21 AM, iliketurtles wrote:

> Your code works!
>
> strangelines.txt was created, and it's a text file with just spacebars ...
> Seems like a few thousand lines of complete blanks (not 1 non-blank entry).
>
> One thing, when I ran your code there was an error message;
>
>> setwd("C:/Users/admin/Desktop/hons/Thesis")
>> con<- file("dataset.txt", "rt")
>> out<- file("strangelines.txt", "wt")
>> # skip first 5 lines
>> lines<- readLines(con, n=5)
>> # read the rest in blocks of 100.000 lines
>> while (TRUE) {
> +     lines<- readLines(con, n=1E5)
> +     if (length(lines) == 0) break;
> +     strangelines<- lines[nchar(lines) != 97]
> +     writeLines(strangelines, con=out)
> + }
> Warning message:
> In readLines(con, n = 1e+05) : incomplete final line found on 'dataset.txt'
>
>
>
>
> I'm really not sure where to go from here. This has gone way out of my
> depth.
>
> -----
> ----
>
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
> --
> View this message in context: http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862p4610446.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...