ff package: reading selected columns from csv

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ff package: reading selected columns from csv

R Kozarski
Dear R users, Ive just started using the ff package.

There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only column from the file, skipping the first 100 rows.
Below Ive provided different outcomes, which will clarify my problem

> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
...

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
[1] ff_2.2-7  bit_1.1-8

##---------------------------------------------------------------------------------------
## I want to read the second column only:
x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL')

## The following command works fine:

>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1e3)
ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
   PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
V2           V2       double        double FALSE           FALSE
   PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
V2            FALSE                 1                1               1
   PhysicalIsOpen
V2           TRUE
ffdf data
          V2
1    -0.5412
2    -0.5842
3    -0.5920
4    -0.5451
5    -0.5099
6    -0.5021
7    -0.4943
8    -0.5490
:          :
993  -0.4865
994  -0.6584
995  -0.7482
996  -0.8732
997  -0.8303
998  -0.7248
999  -0.5490
1000 -0.4240

Then I extend nrows by 1, I get warning about number of columns:

>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1001)
ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
   PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
V2           V2       double        double FALSE           FALSE
   PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
V2            FALSE                 1                1               1
   PhysicalIsOpen
V2           TRUE
ffdf data
          V2
1    -0.5412
2    -0.5842
3    -0.5920
4    -0.5451
5    -0.5099
6    -0.5021
7    -0.4943
8    -0.5490
:          :
994  -0.6584
995  -0.7482
996  -0.8732
997  -0.8303
998  -0.7248
999  -0.5490
1000 -0.4240
1001 -0.3849
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  cols = 1 != length(data) = 7
>

Then, going much beyond 1000 brings problems:
>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100, colClasses=x.class, nrows=1e4)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  more columns than column names

Question is why? The number of columns does not change in the file...

I will appreciate any help..


Best, Robert


Reply | Threaded
Open this post in threaded view
|

Re: ff package: reading selected columns from csv

Jan van der LAan-2

Having had a quick look at the source code for read.table.ffdf, I  
suspect that using 'NULL' in the colClasses argument is not allowed.  
Could you try to see if you can use read.table.ffdf with specifying  
the colClasses for all columns (thereby reading in all columns in the  
file)? If that works, you can be quite sure that indeed that number of  
columns is constant in the file (sometimes a ' or unquoted , can mess  
things up).

Jan




threshold <[hidden email]> schreef:

> *Dear R users, Ive just started using the ff package.
>
> There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only
> column from the file, skipping the first 100 rows.
> Below Ive provided different outcomes, which will clarify my problem
> *
>> sessionInfo()
> R version 2.14.2 (2012-02-29)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> ...
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] ff_2.2-7  bit_1.1-8
>
> ##---------------------------------------------------------------------------------------
> ## *I want to read the second column only:*
> x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL')
>
> ##* The following command works fine:*
>
>>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100,
>> colClasses=x.class, nrows=1e3)
> ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL
> ffdf virtual mapping
>    PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
> V2           V2       double        double FALSE           FALSE
>    PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
> V2            FALSE                 1                1               1
>    PhysicalIsOpen
> V2           TRUE
> ffdf data
>           V2
> 1    -0.5412
> 2    -0.5842
> 3    -0.5920
> 4    -0.5451
> 5    -0.5099
> 6    -0.5021
> 7    -0.4943
> 8    -0.5490
> :          :
> 993  -0.4865
> 994  -0.6584
> 995  -0.7482
> 996  -0.8732
> 997  -0.8303
> 998  -0.7248
> 999  -0.5490
> 1000 -0.4240
>
> *Then I extend nrows by 1, I get warning about number of columns:*
>
>>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100,
>> colClasses=x.class, nrows=1001)
> ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL
> ffdf virtual mapping
>    PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix
> V2           V2       double        double FALSE           FALSE
>    PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
> V2            FALSE                 1                1               1
>    PhysicalIsOpen
> V2           TRUE
> ffdf data
>           V2
> 1    -0.5412
> 2    -0.5842
> 3    -0.5920
> 4    -0.5451
> 5    -0.5099
> 6    -0.5021
> 7    -0.4943
> 8    -0.5490
> :          :
> 994  -0.6584
> 995  -0.7482
> 996  -0.8732
> 997  -0.8303
> 998  -0.7248
> 999  -0.5490
> 1000 -0.4240
> 1001 -0.3849
> Warning message:
> In read.table(file = file, header = header, sep = sep, quote = quote,  :
>   cols = 1 != length(data) = 7
>>
>
> *Then, going much beyond 1000 brings problems:*
>>     read.csv.ffdf(file=csvfile, header=FALSE, skip=100,
>> colClasses=x.class, nrows=1e4)
> Error in read.table(file = file, header = header, sep = sep, quote = quote,
> :
>   more columns than column names
>
> *Question is why? The number of columns does not change in the file...
>
> I will appreciate any help..
>
>
> Best, Robert
>
> *
>
>
>
>
> --
> View this message in context:  
> http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: ff package: reading selected columns from csv

R Kozarski
Dear Jan, thank you for your answer.
I am basically following the code Ive been using with read.table, where
x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL')
has been working fine.

Reading all columns works with me but take much longer than allowed time constrains.. (460 such sets+ time for processing). The number of columns remains 7 over the whole data set.

Best, Robert
 
Reply | Threaded
Open this post in threaded view
|

Re: ff package: reading selected columns from csv

R Kozarski
In reply to this post by Jan van der LAan-2
..plus I get the following message after reading the whole set (all 7 columns):

> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000, next.rows=1e7, VERBOSE=TRUE)

read.table.ffdf 1..1000 (1000)  csv-read=0.02sec ffdf-write=0.08sec
read.table.ffdf 1001..10001000 (10000000)  csv-read=282.16sec ffdf-write=65.01sec
read.table.ffdf 10001001..20001000 (10000000)  csv-read=240.3sec ffdf-write=63.84sec
read.table.ffdf 20001001..30001000 (10000000)  csv-read=213.78sec ffdf-write=149.2sec
read.table.ffdf 30001001..40001000 (10000000)  csv-read=217.36sec ffdf-write=379.8sec
read.table.ffdf 40001001..50001000 (10000000)  csv-read=541.28secError: cannot allocate vector of size 381.5 Mb
In addition: There were 14 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In match(levels(x), lev) :
  Reached total allocation of 7987Mb: see help(memory.size)
2: In match(levels(x), lev) :
  Reached total allocation of 7987Mb: see help(memory.size)
Reply | Threaded
Open this post in threaded view
|

Re: ff package: reading selected columns from csv

Jan van der LAan-2
In reply to this post by R Kozarski

Looking at the source code for read.table.ffdf what seems to happen is  
that when reading the first block of data by read.table (standard 1000  
lines) the specified colClasses are used. In subsequent calls the  
types of the columns of the ffdf object are used as colClasses. In  
your case the ffdf object had only one column. This probably causes  
the error.

What you could try is to use the packages ffbase and LaF (untested):

library(ffbase)
library(LaF)

x.class <- c('character', 'numeric','character','character',
     'character', 'character', 'character')
laf <- laf_open_csv(file=csvfile, header=FALSE,
      skip=100, column_types=x.class)
yourdata <- laf_to_ffdf(laf, columns=2)

I specify column type 'character' as a type is needed. However, by  
using the column=2 argument only the second column is read.

It looks like you have a decent amount of memory, so you could also try

yourdata <- laf[,2]

to read the data in as a standard R vector.

HTH,

Jan




threshold <[hidden email]> schreef:

> Dear Jan, thank you for your answer.
> I am basically following the code Ive been using with read.table, where
> x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL')
> has been working fine.
>
> Reading all columns works with me but take much longer than allowed time
> constrains.. (460 such sets+ time for processing). The number of columns
> remains 7 over the whole data set.
>
> Best, Robert
>
>
>
>
> --
> View this message in context:  
> http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637896.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: ff package: reading selected columns from csv

Jan van der LAan-2
In reply to this post by R Kozarski

You probably have a character (which is converted to factor) or factor  
column with a large number of distinct values. All the levels of a  
factor are stored in memory in ff.

Jan


threshold <[hidden email]> schreef:

> *..plus I get the following message after reading the whole set (all 7
> columns):*
>
>> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000,
>> next.rows=1e7, VERBOSE=TRUE)
>
> read.table.ffdf 1..1000 (1000)  csv-read=0.02sec ffdf-write=0.08sec
> read.table.ffdf 1001..10001000 (10000000)  csv-read=282.16sec
> ffdf-write=65.01sec
> read.table.ffdf 10001001..20001000 (10000000)  csv-read=240.3sec
> ffdf-write=63.84sec
> read.table.ffdf 20001001..30001000 (10000000)  csv-read=213.78sec
> ffdf-write=149.2sec
> read.table.ffdf 30001001..40001000 (10000000)  csv-read=217.36sec
> ffdf-write=379.8sec
> read.table.ffdf 40001001..50001000 (10000000)  csv-read=541.28secError:
> cannot allocate vector of size 381.5 Mb
> In addition: There were 14 warnings (use warnings() to see them)
>> warnings()
> Warning messages:
> 1: In match(levels(x), lev) :
>   Reached total allocation of 7987Mb: see help(memory.size)
> 2: In match(levels(x), lev) :
>   Reached total allocation of 7987Mb: see help(memory.size)
>
>
>
> --
> View this message in context:  
> http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637900.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.