|
Dear R users, Ive just started using the ff package.
There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only column from the file, skipping the first 100 rows. Below Ive provided different outcomes, which will clarify my problem > sessionInfo() R version 2.14.2 (2012-02-29) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: ... attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ff_2.2-7 bit_1.1-8 ##--------------------------------------------------------------------------------------- ## I want to read the second column only: x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') ## The following command works fine: > read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1e3) ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix V2 V2 double double FALSE FALSE PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol V2 FALSE 1 1 1 PhysicalIsOpen V2 TRUE ffdf data V2 1 -0.5412 2 -0.5842 3 -0.5920 4 -0.5451 5 -0.5099 6 -0.5021 7 -0.4943 8 -0.5490 : : 993 -0.4865 994 -0.6584 995 -0.7482 996 -0.8732 997 -0.8303 998 -0.7248 999 -0.5490 1000 -0.4240 Then I extend nrows by 1, I get warning about number of columns: > read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1001) ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix V2 V2 double double FALSE FALSE PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol V2 FALSE 1 1 1 PhysicalIsOpen V2 TRUE ffdf data V2 1 -0.5412 2 -0.5842 3 -0.5920 4 -0.5451 5 -0.5099 6 -0.5021 7 -0.4943 8 -0.5490 : : 994 -0.6584 995 -0.7482 996 -0.8732 997 -0.8303 998 -0.7248 999 -0.5490 1000 -0.4240 1001 -0.3849 Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : cols = 1 != length(data) = 7 > Then, going much beyond 1000 brings problems: > read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1e4) Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names Question is why? The number of columns does not change in the file... I will appreciate any help.. Best, Robert |
|
Having had a quick look at the source code for read.table.ffdf, I suspect that using 'NULL' in the colClasses argument is not allowed. Could you try to see if you can use read.table.ffdf with specifying the colClasses for all columns (thereby reading in all columns in the file)? If that works, you can be quite sure that indeed that number of columns is constant in the file (sometimes a ' or unquoted , can mess things up). Jan threshold <[hidden email]> schreef: > *Dear R users, Ive just started using the ff package. > > There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only > column from the file, skipping the first 100 rows. > Below Ive provided different outcomes, which will clarify my problem > * >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > ... > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] ff_2.2-7 bit_1.1-8 > > ##--------------------------------------------------------------------------------------- > ## *I want to read the second column only:* > x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') > > ##* The following command works fine:* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1e3) > ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL > ffdf virtual mapping > PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix > V2 V2 double double FALSE FALSE > PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol > V2 FALSE 1 1 1 > PhysicalIsOpen > V2 TRUE > ffdf data > V2 > 1 -0.5412 > 2 -0.5842 > 3 -0.5920 > 4 -0.5451 > 5 -0.5099 > 6 -0.5021 > 7 -0.4943 > 8 -0.5490 > : : > 993 -0.4865 > 994 -0.6584 > 995 -0.7482 > 996 -0.8732 > 997 -0.8303 > 998 -0.7248 > 999 -0.5490 > 1000 -0.4240 > > *Then I extend nrows by 1, I get warning about number of columns:* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1001) > ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL > ffdf virtual mapping > PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix > V2 V2 double double FALSE FALSE > PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol > V2 FALSE 1 1 1 > PhysicalIsOpen > V2 TRUE > ffdf data > V2 > 1 -0.5412 > 2 -0.5842 > 3 -0.5920 > 4 -0.5451 > 5 -0.5099 > 6 -0.5021 > 7 -0.4943 > 8 -0.5490 > : : > 994 -0.6584 > 995 -0.7482 > 996 -0.8732 > 997 -0.8303 > 998 -0.7248 > 999 -0.5490 > 1000 -0.4240 > 1001 -0.3849 > Warning message: > In read.table(file = file, header = header, sep = sep, quote = quote, : > cols = 1 != length(data) = 7 >> > > *Then, going much beyond 1000 brings problems:* >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1e4) > Error in read.table(file = file, header = header, sep = sep, quote = quote, > : > more columns than column names > > *Question is why? The number of columns does not change in the file... > > I will appreciate any help.. > > > Best, Robert > > * > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Dear Jan, thank you for your answer.
I am basically following the code Ive been using with read.table, where x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') has been working fine. Reading all columns works with me but take much longer than allowed time constrains.. (460 such sets+ time for processing). The number of columns remains 7 over the whole data set. Best, Robert |
|
In reply to this post by Jan van der LAan-2
..plus I get the following message after reading the whole set (all 7 columns):
> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000, next.rows=1e7, VERBOSE=TRUE) read.table.ffdf 1..1000 (1000) csv-read=0.02sec ffdf-write=0.08sec read.table.ffdf 1001..10001000 (10000000) csv-read=282.16sec ffdf-write=65.01sec read.table.ffdf 10001001..20001000 (10000000) csv-read=240.3sec ffdf-write=63.84sec read.table.ffdf 20001001..30001000 (10000000) csv-read=213.78sec ffdf-write=149.2sec read.table.ffdf 30001001..40001000 (10000000) csv-read=217.36sec ffdf-write=379.8sec read.table.ffdf 40001001..50001000 (10000000) csv-read=541.28secError: cannot allocate vector of size 381.5 Mb In addition: There were 14 warnings (use warnings() to see them) > warnings() Warning messages: 1: In match(levels(x), lev) : Reached total allocation of 7987Mb: see help(memory.size) 2: In match(levels(x), lev) : Reached total allocation of 7987Mb: see help(memory.size) |
|
In reply to this post by R Kozarski
Looking at the source code for read.table.ffdf what seems to happen is that when reading the first block of data by read.table (standard 1000 lines) the specified colClasses are used. In subsequent calls the types of the columns of the ffdf object are used as colClasses. In your case the ffdf object had only one column. This probably causes the error. What you could try is to use the packages ffbase and LaF (untested): library(ffbase) library(LaF) x.class <- c('character', 'numeric','character','character', 'character', 'character', 'character') laf <- laf_open_csv(file=csvfile, header=FALSE, skip=100, column_types=x.class) yourdata <- laf_to_ffdf(laf, columns=2) I specify column type 'character' as a type is needed. However, by using the column=2 argument only the second column is read. It looks like you have a decent amount of memory, so you could also try yourdata <- laf[,2] to read the data in as a standard R vector. HTH, Jan threshold <[hidden email]> schreef: > Dear Jan, thank you for your answer. > I am basically following the code Ive been using with read.table, where > x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') > has been working fine. > > Reading all columns works with me but take much longer than allowed time > constrains.. (460 such sets+ time for processing). The number of columns > remains 7 over the whole data set. > > Best, Robert > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637896.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by R Kozarski
You probably have a character (which is converted to factor) or factor column with a large number of distinct values. All the levels of a factor are stored in memory in ff. Jan threshold <[hidden email]> schreef: > *..plus I get the following message after reading the whole set (all 7 > columns):* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000, >> next.rows=1e7, VERBOSE=TRUE) > > read.table.ffdf 1..1000 (1000) csv-read=0.02sec ffdf-write=0.08sec > read.table.ffdf 1001..10001000 (10000000) csv-read=282.16sec > ffdf-write=65.01sec > read.table.ffdf 10001001..20001000 (10000000) csv-read=240.3sec > ffdf-write=63.84sec > read.table.ffdf 20001001..30001000 (10000000) csv-read=213.78sec > ffdf-write=149.2sec > read.table.ffdf 30001001..40001000 (10000000) csv-read=217.36sec > ffdf-write=379.8sec > read.table.ffdf 40001001..50001000 (10000000) csv-read=541.28secError: > cannot allocate vector of size 381.5 Mb > In addition: There were 14 warnings (use warnings() to see them) >> warnings() > Warning messages: > 1: In match(levels(x), lev) : > Reached total allocation of 7987Mb: see help(memory.size) > 2: In match(levels(x), lev) : > Reached total allocation of 7987Mb: see help(memory.size) > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637900.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
