|
Dear Experienced R Practitioners,
I have 4GB .txt data called "dataset.txt" and have attempted to use ff, bigmemory, filehash and sqldf packages to import it, but have had no success. The readLines output of this data is: readLines("dataset.txt",n=20) [1] " " [2] " " [3] " " [4] " PERMNO DATE SHRCD COMNAM PRC VOL" [5] "" [6] " 10001 01/09/1986 11 GREAT FALLS GAS CO -5.75000 14160" [7] " 10001 01/10/1986 11 GREAT FALLS GAS CO -5.87500 0" [8] " 10001 01/13/1986 11 GREAT FALLS GAS CO -5.87500 2805" [9] " 10001 01/14/1986 11 GREAT FALLS GAS CO -5.87500 2070" [10] " 10001 01/15/1986 11 GREAT FALLS GAS CO -6.06250 6000" [11] " 10001 01/16/1986 11 GREAT FALLS GAS CO -6.25000 1500" [12] " 10001 01/17/1986 11 GREAT FALLS GAS CO -6.25000 7100" [13] " 10001 01/20/1986 11 GREAT FALLS GAS CO -6.31250 1700" [14] " 10001 01/21/1986 11 GREAT FALLS GAS CO -6.18750 4000" [15] " 10001 01/22/1986 11 GREAT FALLS GAS CO -6.18750 5200" [16] " 10001 01/23/1986 11 GREAT FALLS GAS CO -6.18750 4100" [17] " 10001 01/24/1986 11 GREAT FALLS GAS CO -6.18750 1500" [18] " 10001 01/27/1986 11 GREAT FALLS GAS CO -6.18750 4000" [19] " 10001 01/28/1986 11 GREAT FALLS GAS CO -6.12500 3500" [20] " 10001 01/29/1986 11 GREAT FALLS GAS CO -6.06250 4600" This data goes on for a huge number of rows (not sure exactly how many). Each element in each row is separated by and uneven number of (what seem to be) spaces (maybe TAB? not sure). Further, there are some rows that are "incomplete", i.e. there's missing elements. Take the first 29 rows of "dataset.txt" into a separate data file, let's call it "dataset2.txt". read.table("dataset2.txt",skip=5) gives the perfect table that I want to end up with, except I want it with the 4GB data through bigmemory, ff or filehash. read.table('dataset2.txt',skip=5) V1 V2 V3 V4 V5 V6 V7 V8 V9 1 10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160 2 10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750 0 3 10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750 2805 4 10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750 2070 5 10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625 6000 6 10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500 1500 7 10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500 7100 8 10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125 1700 9 10001 01/21/1986 11 GREAT FALLS GAS CO -6.1875 4000 10 10001 01/22/1986 11 GREAT FALLS GAS CO -6.1875 5200 11 10001 01/23/1986 11 GREAT FALLS GAS CO -6.1875 4100 12 10001 01/24/1986 11 GREAT FALLS GAS CO -6.1875 1500 13 10001 01/27/1986 11 GREAT FALLS GAS CO -6.1875 4000 14 10001 01/28/1986 11 GREAT FALLS GAS CO -6.1250 3500 15 10001 01/29/1986 11 GREAT FALLS GAS CO -6.0625 4600 16 10001 01/30/1986 11 GREAT FALLS GAS CO -6.0625 3830 17 10001 01/31/1986 11 GREAT FALLS GAS CO -6.1250 675 18 10001 02/03/1986 11 GREAT FALLS GAS CO -6.1250 2300 19 10001 02/04/1986 11 GREAT FALLS GAS CO -6.1250 4200 20 10001 02/05/1986 11 GREAT FALLS GAS CO -6.1250 1000 21 10001 02/06/1986 11 GREAT FALLS GAS CO -6.1250 4200 22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250 1800 23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875 100 24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125 1500 25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500 2500 26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500 1000 27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875 0 28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875 7000 29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500 5200 ###MY ATTEMPT USING FILEHASH### #First, try and upload the 29 row dataset (the tiny subset of the whole dataset). install.packages("filehash");require(filehash) dumpList(read.table("dataset2.txt",skip=6),dbName="db02") #Note that DumpDF also works. env02 <- db2env(db="db02") with(env02,V1) # [1] 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 [27] 10001 10001 # with(env02,V2) # [1] 01/10/1986 01/13/1986 01/14/1986 01/15/1986 01/16/1986 01/17/1986 01/20/1986 01/21/1986 01/22/1986 01/23/1986 01/24/1986 01/27/1986 01/28/1986 01/29/1986 [15] 01/30/1986 01/31/1986 02/03/1986 02/04/1986 02/05/1986 02/06/1986 02/07/1986 02/10/1986 02/11/1986 02/12/1986 02/13/1986 02/14/1986 02/18/1986 02/19/1986 # #so this is all good. But when we try to import the ENTIRE data; dumpList(read.table("dataset.txt",skip=6),dbName="db01") #Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1991 did not have 9 elements #This confirms that there are missing elements in later rows. So, I put fill=TRUE: dumpDF(read.table("dataset.txt",skip=6,fill=TRUE),dbName="db01") Error: cannot allocate vector of size 500.0 Mb In addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 4078Mb: see help(memory.size) 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 4078Mb: see help(memory.size) 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 4078Mb: see help(memory.size) 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 4078Mb: see help(memory.size) #This occurs after a 30 minute wait. ###MY ATTEMPT USING FF### #First, try with the 29 row "datatset2.txt", # open a connection to the file con <- file('dataset2.txt', 'rt') # read the remainder using read.table.ffdf ffdf <- read.table.ffdf(file=con) # close connection close(con) ffdf #ffdf (all open) dim=c(29,9), dimorder=c(1,2) row.names=NULL #ffdf virtual mapping # PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen V1 V1 integer integer FALSE FALSE FALSE 1 1 1 TRUE V2 V2 integer integer FALSE FALSE FALSE 2 1 1 TRUE V3 V3 integer integer FALSE FALSE FALSE 3 1 1 TRUE V4 V4 integer integer FALSE FALSE FALSE 4 1 1 TRUE V5 V5 integer integer FALSE FALSE FALSE 5 1 1 TRUE V6 V6 integer integer FALSE FALSE FALSE 6 1 1 TRUE V7 V7 integer integer FALSE FALSE FALSE 7 1 1 TRUE V8 V8 double double FALSE FALSE FALSE 8 1 1 TRUE V9 V9 integer integer FALSE FALSE FALSE 9 1 1 TRUE ffdf data V1 V2 V3 V4 V5 V6 V7 V8 V9 1 10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160 2 10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750 0 3 10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750 2805 4 10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750 2070 5 10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625 6000 6 10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500 1500 7 10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500 7100 8 10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125 1700 : : : : : : : : : : 22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250 1800 23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875 100 24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125 1500 25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500 2500 26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500 1000 27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875 0 28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875 7000 29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500 5200 #GOOD!!! Now let's try with the 4GB "dataset.txt". # open a connection to the file con <- file('dataset.txt', 'rt') # read the remainder using read.table.ffdf ffdf <- read.table.ffdf(file=con) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 4 did not have 9 elements #Again, missing elements. Yet I don't know how to do something like fill=TRUE in read.table.ffdf (like you can in read.table()). ###MY ATTEMPT USING BIGMEMORY### #I can't even import "dataset2.txt", so I'm trying to figure out this before I try on the big "dataset.txt". E <- read.big.matrix("dataset2.txt", skip = 3, backingpath = getwd(), sep = " ", type = "double", ) #results in: E[10,] [1] NA NA NA 10001.0000 NA NA NA 1.0000 NA NA NA NA 11.0000 NA [15] NA NA NA NA NA NA NA NA NA NA NA NA NA NA [29] NA NA NA NA NA NA NA NA NA NA NA NA NA NA [43] NA -6.3125 NA NA NA NA NA NA NA NA NA 1700.0000 NA NA [57] NA NA NA NA NA NA NA NA NA NA NA NA NA NA [71] NA #So all the character and dates are being screwed up. E <- read.big.matrix("dataset2.txt", skip = 3, backingpath = getwd(), sep = " ", type = "char", ) E[10,] [1] NA NA NA 17 NA NA NA 1 NA NA NA NA 11 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [41] NA NA NA -6 NA NA NA NA NA NA NA NA NA -92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #Even worse. ###MY ATTEMPT USING sqldf### No idea what to do here.
----
Isaac Research Assistant Quantitative Finance Faculty, UTS |
|
read.table imports the company name "GREAT FALLS GAS CO" as four separate columns. I think that needs to be one column. I can imagine that further one in your file you will have another company name that does not consist of four words which would cause the error you observed. From your output it seems that the columns are separated by spaces and not tabs (you would see \t in your output of readLines otherwise). As there are also spaces in your company names this makes it difficult to read the file in correctly. Perhaps you have a fixed width file (columns are identified not by separator but by position in line) in which case all lines should have an equal length. You could check for this using the following code (not tested so could contain errors): con <- file("dataset.txt", "rt") # skip first 5 lines lines <- readLines(con, n=5) # initialize vector of line sizes (we'll have a growing vector which is not # efficient, but we just want to have a a quick scan of the file) line_sizes <- c() # read the rest in blocks of 100.000 lines while (TRUE) { lines <- readLines(con, n=1E5) if (length(lines) == 0) break; line_sizes <- c(line_sizes, nchar(lines)) } # create a table of line sizes to check if they are all equal table(lines_sizes) HTH, Jan iliketurtles <[hidden email]> schreef: > Dear Experienced R Practitioners, > > I have 4GB .txt data called "dataset.txt" and have attempted to use *ff, > bigmemory, filehash and sqldf *packages to import it, but have had no > success. The readLines output of this data is: > > readLines("dataset.txt",n=20) > [1] " " > [2] " > " > [3] " " > [4] " PERMNO DATE SHRCD COMNAM > PRC VOL" > [5] "" > [6] " 10001 01/09/1986 11 GREAT FALLS GAS CO > -5.75000 14160" > [7] " 10001 01/10/1986 11 GREAT FALLS GAS CO > -5.87500 0" > [8] " 10001 01/13/1986 11 GREAT FALLS GAS CO > -5.87500 2805" > [9] " 10001 01/14/1986 11 GREAT FALLS GAS CO > -5.87500 2070" > [10] " 10001 01/15/1986 11 GREAT FALLS GAS CO > -6.06250 6000" > [11] " 10001 01/16/1986 11 GREAT FALLS GAS CO > -6.25000 1500" > [12] " 10001 01/17/1986 11 GREAT FALLS GAS CO > -6.25000 7100" > [13] " 10001 01/20/1986 11 GREAT FALLS GAS CO > -6.31250 1700" > [14] " 10001 01/21/1986 11 GREAT FALLS GAS CO > -6.18750 4000" > [15] " 10001 01/22/1986 11 GREAT FALLS GAS CO > -6.18750 5200" > [16] " 10001 01/23/1986 11 GREAT FALLS GAS CO > -6.18750 4100" > [17] " 10001 01/24/1986 11 GREAT FALLS GAS CO > -6.18750 1500" > [18] " 10001 01/27/1986 11 GREAT FALLS GAS CO > -6.18750 4000" > [19] " 10001 01/28/1986 11 GREAT FALLS GAS CO > -6.12500 3500" > [20] " 10001 01/29/1986 11 GREAT FALLS GAS CO > -6.06250 4600" > > This data goes on for a huge number of rows (not sure exactly how many). > Each element in each row is separated by and uneven number of (what seem to > be) spaces (maybe TAB? not sure). Further, there are some rows that are > "incomplete", i.e. there's missing elements. > > Take the first 29 rows of "dataset.txt" into a separate data file, let's > call it "dataset2.txt". read.table("dataset2.txt",skip=5) gives the perfect > table that I want to end up with, except I want it with the 4GB data through > bigmemory, ff or filehash. > > read.table('dataset2.txt',skip=5) > V1 V2 V3 V4 V5 V6 V7 V8 V9 > 1 10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160 > 2 10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750 0 > 3 10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750 2805 > 4 10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750 2070 > 5 10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625 6000 > 6 10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500 1500 > 7 10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500 7100 > 8 10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125 1700 > 9 10001 01/21/1986 11 GREAT FALLS GAS CO -6.1875 4000 > 10 10001 01/22/1986 11 GREAT FALLS GAS CO -6.1875 5200 > 11 10001 01/23/1986 11 GREAT FALLS GAS CO -6.1875 4100 > 12 10001 01/24/1986 11 GREAT FALLS GAS CO -6.1875 1500 > 13 10001 01/27/1986 11 GREAT FALLS GAS CO -6.1875 4000 > 14 10001 01/28/1986 11 GREAT FALLS GAS CO -6.1250 3500 > 15 10001 01/29/1986 11 GREAT FALLS GAS CO -6.0625 4600 > 16 10001 01/30/1986 11 GREAT FALLS GAS CO -6.0625 3830 > 17 10001 01/31/1986 11 GREAT FALLS GAS CO -6.1250 675 > 18 10001 02/03/1986 11 GREAT FALLS GAS CO -6.1250 2300 > 19 10001 02/04/1986 11 GREAT FALLS GAS CO -6.1250 4200 > 20 10001 02/05/1986 11 GREAT FALLS GAS CO -6.1250 1000 > 21 10001 02/06/1986 11 GREAT FALLS GAS CO -6.1250 4200 > 22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250 1800 > 23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875 100 > 24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125 1500 > 25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500 2500 > 26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500 1000 > 27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875 0 > 28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875 7000 > 29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500 5200 > > > ###/*MY ATTEMPT USING FILEHASH*/### > #First, try and upload the 29 row dataset (the tiny subset of the whole > dataset). > > install.packages("filehash");require(filehash) > dumpList(read.table("dataset2.txt",skip=6),dbName="db02") #Note that DumpDF > also works. > env02 <- db2env(db="db02") > > with(env02,V1) > > # > [1] 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 > 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 > 10001 10001 > [27] 10001 10001 > # > > with(env02,V2) > > # > [1] 01/10/1986 01/13/1986 01/14/1986 01/15/1986 01/16/1986 01/17/1986 > 01/20/1986 01/21/1986 01/22/1986 01/23/1986 01/24/1986 01/27/1986 01/28/1986 > 01/29/1986 > [15] 01/30/1986 01/31/1986 02/03/1986 02/04/1986 02/05/1986 02/06/1986 > 02/07/1986 02/10/1986 02/11/1986 02/12/1986 02/13/1986 02/14/1986 02/18/1986 > 02/19/1986 > # > > #so this is all good. But when we try to import the ENTIRE data; > > dumpList(read.table("dataset.txt",skip=6),dbName="db01") > #Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : line 1991 did not have 9 elements > > #This confirms that there are missing elements in later rows. So, I put > fill=TRUE: > > dumpDF(read.table("dataset.txt",skip=6,fill=TRUE),dbName="db01") > Error: cannot allocate vector of size 500.0 Mb > In addition: Warning messages: > 1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > Reached total allocation of 4078Mb: see help(memory.size) > 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > Reached total allocation of 4078Mb: see help(memory.size) > 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > Reached total allocation of 4078Mb: see help(memory.size) > 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > Reached total allocation of 4078Mb: see help(memory.size) > > #This occurs after a 30 minute wait. > > > > ###/*MY ATTEMPT USING FF*/### > #First, try with the 29 row "datatset2.txt", > # open a connection to the file > con <- file('dataset2.txt', 'rt') > # read the remainder using read.table.ffdf > ffdf <- read.table.ffdf(file=con) > # close connection > close(con) > > ffdf > > #ffdf (all open) dim=c(29,9), dimorder=c(1,2) row.names=NULL > #ffdf virtual mapping > # PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix > PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol > PhysicalIsOpen > V1 V1 integer integer FALSE FALSE > FALSE 1 1 1 TRUE > V2 V2 integer integer FALSE FALSE > FALSE 2 1 1 TRUE > V3 V3 integer integer FALSE FALSE > FALSE 3 1 1 TRUE > V4 V4 integer integer FALSE FALSE > FALSE 4 1 1 TRUE > V5 V5 integer integer FALSE FALSE > FALSE 5 1 1 TRUE > V6 V6 integer integer FALSE FALSE > FALSE 6 1 1 TRUE > V7 V7 integer integer FALSE FALSE > FALSE 7 1 1 TRUE > V8 V8 double double FALSE FALSE > FALSE 8 1 1 TRUE > V9 V9 integer integer FALSE FALSE > FALSE 9 1 1 TRUE > ffdf data > V1 V2 V3 V4 V5 V6 > V7 V8 V9 > 1 10001 01/09/1986 11 GREAT FALLS GAS CO > -5.7500 14160 > 2 10001 01/10/1986 11 GREAT FALLS GAS CO > -5.8750 0 > 3 10001 01/13/1986 11 GREAT FALLS GAS CO > -5.8750 2805 > 4 10001 01/14/1986 11 GREAT FALLS GAS CO > -5.8750 2070 > 5 10001 01/15/1986 11 GREAT FALLS GAS CO > -6.0625 6000 > 6 10001 01/16/1986 11 GREAT FALLS GAS CO > -6.2500 1500 > 7 10001 01/17/1986 11 GREAT FALLS GAS CO > -6.2500 7100 > 8 10001 01/20/1986 11 GREAT FALLS GAS CO > -6.3125 1700 > : : : : : : : > : : : > 22 10001 02/07/1986 11 GREAT FALLS GAS CO > -6.1250 1800 > 23 10001 02/10/1986 11 GREAT FALLS GAS CO > -6.1875 100 > 24 10001 02/11/1986 11 GREAT FALLS GAS CO > -6.3125 1500 > 25 10001 02/12/1986 11 GREAT FALLS GAS CO > -6.2500 2500 > 26 10001 02/13/1986 11 GREAT FALLS GAS CO > -6.2500 1000 > 27 10001 02/14/1986 11 GREAT FALLS GAS CO > -6.1875 0 > 28 10001 02/18/1986 11 GREAT FALLS GAS CO > -6.1875 7000 > 29 10001 02/19/1986 11 GREAT FALLS GAS CO > -6.2500 5200 > > > #GOOD!!! Now let's try with the 4GB "dataset.txt". > > # open a connection to the file > con <- file('dataset.txt', 'rt') > # read the remainder using read.table.ffdf > ffdf <- read.table.ffdf(file=con) > > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > line 4 did not have 9 elements > > > #Again, missing elements. Yet I don't know how to do something like > fill=TRUE in read.table.ffdf (like you can in read.table()). > > > > ###/*MY ATTEMPT USING BIGMEMORY*/### > #I can't even import "dataset2.txt", so I'm trying to figure out this before > I try on the big "dataset.txt". > > E <- read.big.matrix("dataset2.txt", skip = 3, > backingpath = getwd(), > sep = " ", > type = "double", > ) > > #results in: > > E[10,] > [1] NA NA NA 10001.0000 NA NA > NA 1.0000 NA NA NA NA 11.0000 > NA > [15] NA NA NA NA NA NA > NA NA NA NA NA NA NA > NA > [29] NA NA NA NA NA NA > NA NA NA NA NA NA NA > NA > [43] NA -6.3125 NA NA NA NA > NA NA NA NA NA 1700.0000 NA > NA > [57] NA NA NA NA NA NA > NA NA NA NA NA NA NA > NA > [71] NA > > #So all the character and dates are being screwed up. > > E <- read.big.matrix("dataset2.txt", skip = 3, > backingpath = getwd(), > sep = " ", > type = "char", > ) > > E[10,] > [1] NA NA NA 17 NA NA NA 1 NA NA NA NA 11 NA NA NA NA NA > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > NA NA NA > [41] NA NA NA -6 NA NA NA NA NA NA NA NA NA -92 NA NA NA NA > NA NA NA NA NA NA NA NA NA NA NA NA NA > > #Even worse. > ###/*MY ATTEMPT USING sqldf*/### > No idea what to do here. > > ----- > ---- > > Isaac > Research Assistant > Quantitative Finance Faculty, UTS > -- > View this message in context: > http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Jan, thank you.
> table(line_sizes) line_sizes 0 1 97 256 1430 2860 46869069 1430
----
Isaac Research Assistant Quantitative Finance Faculty, UTS |
|
In reply to this post by iliketurtles
On May 4, 2012, at 1:34 AM, iliketurtles wrote: > Dear Experienced R Practitioners, > > I have 4GB .txt data called "dataset.txt" and have attempted to use > *ff, > bigmemory, filehash and sqldf *packages to import it, but have had no > success. The readLines output of this data is: > Ther alignment of that output makes me wonder if the file is tab- speparated. You have considered the possibility that tab is the separator but have you actually tried using sep = "\t" in your read operations? -- David. > readLines("dataset.txt",n=20) > [1] " " > [2] " > " > [3] " " > [4] " PERMNO DATE SHRCD COMNAM > PRC VOL" > [5] "" > [6] " 10001 01/09/1986 11 GREAT FALLS GAS CO > -5.75000 14160" > [7] " 10001 01/10/1986 11 GREAT FALLS GAS CO > -5.87500 0" > [8] " 10001 01/13/1986 11 GREAT FALLS GAS CO > -5.87500 2805" > [9] " 10001 01/14/1986 11 GREAT FALLS GAS CO > [20] " 10001 01/29/1986 11 GREAT FALLS GAS CO > -6.06250 4600" > > This data goes on for a huge number of rows (not sure exactly how > many). > Each element in each row is separated by and uneven number of (what > seem to > be) spaces (maybe TAB? not sure). Further, there are some rows that > are > "incomplete", i.e. there's missing elements. > > Take the first 29 rows of "dataset.txt" into a separate data file, > let's > call it "dataset2.txt". read.table("dataset2.txt",skip=5) gives the > perfect > table that I want to end up with, except I want it with the 4GB data > through > bigmemory, ff or filehash. snipped several failed attempts NA NA NA NA NA NA NA NA NA NA NA NA NA > > #Even worse. > ###/*MY ATTEMPT USING sqldf*/### > No idea what to do here. > > ----- David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hi David,
I've tried using sep="\t" but it doesn't work, unfortunately. Thanks for your help.
----
Isaac Research Assistant Quantitative Finance Faculty, UTS |
|
In reply to this post by iliketurtles
OK, not all, but most lines have the same length. Perhaps you could write the lines with a different line size to a separate file to have a closer look at those lines. Modifying the previous code (again not tested): con <- file("dataset.txt", "rt") out <- file("strangelines.txt", "wt") # skip first 5 lines lines <- readLines(con, n=5) # read the rest in blocks of 100.000 lines while (TRUE) { lines <- readLines(con, n=1E5) if (length(lines) == 0) break; strangelines <- lines[nchar(lines) != 97] writeLines(strangelines, con=out) } close(con) close(out) Jan Quoting iliketurtles <[hidden email]>: > Jan, thank you. > >> table(line_sizes) > line_sizes > 0 1 97 256 > 1430 2860 46869069 1430 > > ----- > ---- > > Isaac > Research Assistant > Quantitative Finance Faculty, UTS > -- > View this message in context: > http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862p4608172.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Your code works!
strangelines.txt was created, and it's a text file with just spacebars ... Seems like a few thousand lines of complete blanks (not 1 non-blank entry). One thing, when I ran your code there was an error message; > setwd("C:/Users/admin/Desktop/hons/Thesis") > con <- file("dataset.txt", "rt") > out <- file("strangelines.txt", "wt") > # skip first 5 lines > lines <- readLines(con, n=5) > # read the rest in blocks of 100.000 lines > while (TRUE) { + lines <- readLines(con, n=1E5) + if (length(lines) == 0) break; + strangelines <- lines[nchar(lines) != 97] + writeLines(strangelines, con=out) + } Warning message: In readLines(con, n = 1e+05) : incomplete final line found on 'dataset.txt' I'm really not sure where to go from here. This has gone way out of my depth.
----
Isaac Research Assistant Quantitative Finance Faculty, UTS |
|
Perhaps you could contact the persons that supplied/created the file and ask them what the format of the file exactly is. That is probably the safest thing to do. If you are sure that the lines containing only whitespace are meaningless, then you could alter the previous code to make a copy of the file containing only lines with a length equal to 97 characters (you can do this by changing the '!=' to '=='). Since all lines are then of equal length, I suspect you have fixed width file. You could open and read this file using the LaF package (http://cran.r-project.org/web/packages/LaF/index.html; see the manual vignette for more information). In the package ffbase (http://cran.r-project.org/web/packages/ffbase/index.html) is a function to convert from LaF to ff (laf_to_ffdf). I do not known if packages such as rsqlite or bigmemory can import fixed width files. The warning message indicates that the last line does not end with a new line character which could indicate an incomplete file but often doesn't mean anything. You could check the last line of the file to be sure. HTH, Jan On 05/05/2012 05:21 AM, iliketurtles wrote: > Your code works! > > strangelines.txt was created, and it's a text file with just spacebars ... > Seems like a few thousand lines of complete blanks (not 1 non-blank entry). > > One thing, when I ran your code there was an error message; > >> setwd("C:/Users/admin/Desktop/hons/Thesis") >> con<- file("dataset.txt", "rt") >> out<- file("strangelines.txt", "wt") >> # skip first 5 lines >> lines<- readLines(con, n=5) >> # read the rest in blocks of 100.000 lines >> while (TRUE) { > + lines<- readLines(con, n=1E5) > + if (length(lines) == 0) break; > + strangelines<- lines[nchar(lines) != 97] > + writeLines(strangelines, con=out) > + } > Warning message: > In readLines(con, n = 1e+05) : incomplete final line found on 'dataset.txt' > > > > > I'm really not sure where to go from here. This has gone way out of my > depth. > > ----- > ---- > > Isaac > Research Assistant > Quantitative Finance Faculty, UTS > -- > View this message in context: http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862p4610446.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
