Quantcast

New function fread() in v1.8.7

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

New function fread() in v1.8.7

Matthew Dowle

Hi datatablers,

Feedback and bug reports much appreciated :

=====
New function fread(), a fast and friendly file reader.
* header, skip, nrows, sep and colClasses are all auto detected.
* integers>2^31 are detected and read natively as bit64::integer64.
* accepts filenames, URLs and "A,B\n1,2\n3,4" directly
* new implementation entirely in C
* with a 50MB .csv, 1 million rows x 6 columns :
     read.csv("test.csv")                                   # 30-60 sec
     read.table("test.csv",<all known tricks, known nrows>) #    10 sec
     fread("test.csv")                                      #     3 sec
* airline data: 658MB csv (7 million rows x 29 columns)
     read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
     fread("2008.csv")                                      #    50 sec
See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
discussions and beta testing.
=====

1.8.7 is passing checks on Unix and Windows (but not Mac yet) :

   install.packages("data.table", repos="http://R-Forge.R-project.org")
   require(data.table)
   ?fread
   fread("your biggest baddest file")

Oddly, R-Forge appears to be compiling Win64 with -O2 optimization
rather
than -O3 (but -O3 on Win32 ok), so speedups might not be as great on
Win64
until that can be resolved on R-Forge, unless you compile yourself. -O3
has some optimizations that fread may benefit from. But interested to
hear.

Seasons greatings!

Matthew


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
Hi Matthew,

I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function

            Date.and.Time Open High  Low Close Volume
    1 2007/01/01 22:51:00 5683 5683 5673  5673     64
    2 2007/01/01 22:52:00 5675 5676 5674  5674     17
    3 2007/01/01 22:53:00 5674 5674 5673  5674     42

The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.

but when reading the same file using fread i get the following output

        V1 V2                                             V3
    1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
    2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
    3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42

This is because the autodetect is using the "/" as a separator...

I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.

Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".

Is there any way to fix this, so that it correctly reads all 6 columns separately?

Thanks

HLM

On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:

>
> Hi datatablers,
>
> Feedback and bug reports much appreciated :
>
> =====
> New function fread(), a fast and friendly file reader.
> * header, skip, nrows, sep and colClasses are all auto detected.
> * integers>2^31 are detected and read natively as bit64::integer64.
> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
> * new implementation entirely in C
> * with a 50MB .csv, 1 million rows x 6 columns :
>    read.csv("test.csv")                                   # 30-60 sec
>    read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>    fread("test.csv")                                      #     3 sec
> * airline data: 658MB csv (7 million rows x 29 columns)
>    read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>    fread("2008.csv")                                      #    50 sec
> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
> discussions and beta testing.
> =====
>
> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>
>  install.packages("data.table", repos="http://R-Forge.R-project.org")
>  require(data.table)
>  ?fread
>  fread("your biggest baddest file")
>
> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
> until that can be resolved on R-Forge, unless you compile yourself. -O3
> has some optimizations that fread may benefit from. But interested to hear.
>
> Seasons greatings!
>
> Matthew
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
oups…forgot to add the output from the verbose part…here it is...

Detected eol as \r\n (CRLF) in that order, the Windows standard.
Starting format detection on line 30 (the last non blank line in the first 30)
Detected sep as '/' and 3 columns
Type codes: 003
Found first row with 3 fields occuring on line 1 (either column names or first row of data)
The first data row has some non character fields. Treating as a data row and using default column names.
Count of eol after pos: 1143699
Subtracted 1 for last eol and any trailing empty lines, leaving 1143698 data rows
   0.153s ( 21%) Memory map (quicker if you rerun)
   0.000s (  0%) Format detection
   0.095s ( 13%) Count rows (wc -l)
   0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
   0.480s ( 66%) Reading data
   0.000s (  0%) Bumping column type midread and coercing data already read
   0.002s (  0%) Changing na.strings to NA
   0.731s        Total


On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]> wrote:

> Hi Matthew,
>
> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>
>            Date.and.Time Open High  Low Close Volume
>    1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>    2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>    3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>
> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>
> but when reading the same file using fread i get the following output
>
>        V1 V2                                             V3
>    1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>    2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>    3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>
> This is because the autodetect is using the "/" as a separator...
>
> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>
> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>
> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>
> Thanks
>
> HLM
>
> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:
>
>>
>> Hi datatablers,
>>
>> Feedback and bug reports much appreciated :
>>
>> =====
>> New function fread(), a fast and friendly file reader.
>> * header, skip, nrows, sep and colClasses are all auto detected.
>> * integers>2^31 are detected and read natively as bit64::integer64.
>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>> * new implementation entirely in C
>> * with a 50MB .csv, 1 million rows x 6 columns :
>>   read.csv("test.csv")                                   # 30-60 sec
>>   read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>   fread("test.csv")                                      #     3 sec
>> * airline data: 658MB csv (7 million rows x 29 columns)
>>   read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>   fread("2008.csv")                                      #    50 sec
>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>> discussions and beta testing.
>> =====
>>
>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>
>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>> require(data.table)
>> ?fread
>> fread("your biggest baddest file")
>>
>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>> has some optimizations that fread may benefit from. But interested to hear.
>>
>> Seasons greatings!
>>
>> Matthew
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle

Hi,

Ah yes, haven't hooked up the sep override yet, apologies, will fix.
Maybe setting autostart to the row number of the header row (probably
1)
might work.

Thanks,
Matthew


On 24.12.2012 11:08, Hideyoshi Maeda wrote:

> oups…forgot to add the output from the verbose part…here it is...
>
> Detected eol as \r\n (CRLF) in that order, the Windows standard.
> Starting format detection on line 30 (the last non blank line in the
> first 30)
> Detected sep as '/' and 3 columns
> Type codes: 003
> Found first row with 3 fields occuring on line 1 (either column names
> or first row of data)
> The first data row has some non character fields. Treating as a data
> row and using default column names.
> Count of eol after pos: 1143699
> Subtracted 1 for last eol and any trailing empty lines, leaving
> 1143698 data rows
>    0.153s ( 21%) Memory map (quicker if you rerun)
>    0.000s (  0%) Format detection
>    0.095s ( 13%) Count rows (wc -l)
>    0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>    0.480s ( 66%) Reading data
>    0.000s (  0%) Bumping column type midread and coercing data
> already read
>    0.002s (  0%) Changing na.strings to NA
>    0.731s        Total
>
>
> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]>
> wrote:
>
>> Hi Matthew,
>>
>> I am using the new `data.table` `fread()` function to read my csv
>> files, which has the format as follows when using the read.csv
>> function
>>
>>            Date.and.Time Open High  Low Close Volume
>>    1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>    2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>    3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>
>> The value of the first column is all of: `2007/01/01 22:53:00`, the
>> next 5 columns are separated with commas.
>>
>> but when reading the same file using fread i get the following
>> output
>>
>>        V1 V2                                             V3
>>    1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>    2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>    3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>
>> This is because the autodetect is using the "/" as a separator...
>>
>> I tried overriding this using the `sep=","` argument but this does
>> not seem to be used in the function anywhere.
>>
>> Furthremore when using verbose I get the following output, which
>> suggests that I was right in thinking that "/" is used as a separator
>> rather than ",".
>>
>> Is there any way to fix this, so that it correctly reads all 6
>> columns separately?
>>
>> Thanks
>>
>> HLM
>>
>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]>
>> wrote:
>>
>>>
>>> Hi datatablers,
>>>
>>> Feedback and bug reports much appreciated :
>>>
>>> =====
>>> New function fread(), a fast and friendly file reader.
>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>> * new implementation entirely in C
>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>   read.csv("test.csv")                                   # 30-60
>>> sec
>>>   read.table("test.csv",<all known tricks, known nrows>) #    10
>>> sec
>>>   fread("test.csv")                                      #     3
>>> sec
>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>   read.table("2008.csv",<all known tricks, known nrows>) #   360
>>> sec
>>>   fread("2008.csv")                                      #    50
>>> sec
>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>> discussions and beta testing.
>>> =====
>>>
>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>
>>> install.packages("data.table",
>>> repos="http://R-Forge.R-project.org")
>>> require(data.table)
>>> ?fread
>>> fread("your biggest baddest file")
>>>
>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization
>>> rather
>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great
>>> on Win64
>>> until that can be resolved on R-Forge, unless you compile yourself.
>>> -O3
>>> has some optimizations that fread may benefit from. But interested
>>> to hear.
>>>
>>> Seasons greatings!
>>>
>>> Matthew
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [hidden email]
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
Thanks for the quick response.

I wasn't sure if I understood you correctly, but isn't the problem the way that autostart finds separators?

and in my example, it had headers, so I think it would need to start from row 2 wouldn't it, i.e. the first row that has non-header values?

Thanks

On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]> wrote:

>
> Hi,
>
> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
> Maybe setting autostart to the row number of the header row (probably 1)
> might work.
>
> Thanks,
> Matthew
>
>
> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>> oups…forgot to add the output from the verbose part…here it is...
>>
>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Starting format detection on line 30 (the last non blank line in the
>> first 30)
>> Detected sep as '/' and 3 columns
>> Type codes: 003
>> Found first row with 3 fields occuring on line 1 (either column names
>> or first row of data)
>> The first data row has some non character fields. Treating as a data
>> row and using default column names.
>> Count of eol after pos: 1143699
>> Subtracted 1 for last eol and any trailing empty lines, leaving
>> 1143698 data rows
>>   0.153s ( 21%) Memory map (quicker if you rerun)
>>   0.000s (  0%) Format detection
>>   0.095s ( 13%) Count rows (wc -l)
>>   0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>   0.480s ( 66%) Reading data
>>   0.000s (  0%) Bumping column type midread and coercing data already read
>>   0.002s (  0%) Changing na.strings to NA
>>   0.731s        Total
>>
>>
>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]> wrote:
>>
>>> Hi Matthew,
>>>
>>> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>>>
>>>           Date.and.Time Open High  Low Close Volume
>>>   1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>   2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>   3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>
>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>>>
>>> but when reading the same file using fread i get the following output
>>>
>>>       V1 V2                                             V3
>>>   1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>   2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>   3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>
>>> This is because the autodetect is using the "/" as a separator...
>>>
>>> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>>>
>>> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>>>
>>> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>>>
>>> Thanks
>>>
>>> HLM
>>>
>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:
>>>
>>>>
>>>> Hi datatablers,
>>>>
>>>> Feedback and bug reports much appreciated :
>>>>
>>>> =====
>>>> New function fread(), a fast and friendly file reader.
>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>> * new implementation entirely in C
>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>  read.csv("test.csv")                                   # 30-60 sec
>>>>  read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>  fread("test.csv")                                      #     3 sec
>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>  read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>  fread("2008.csv")                                      #    50 sec
>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>> discussions and beta testing.
>>>> =====
>>>>
>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>
>>>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>>>> require(data.table)
>>>> ?fread
>>>> fread("your biggest baddest file")
>>>>
>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>> has some optimizations that fread may benefit from. But interested to hear.
>>>>
>>>> Seasons greatings!
>>>>
>>>> Matthew
>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> [hidden email]
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

akhilsbehl
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by Hideyoshi Maeda

Yes autostart is the line it detects separators, then it searches
upwards to find the first row with the same number of columns. If that
row is all character then it deems that as the column name row.  So if
you start autostart on 1, it's already at the top and it might catch the
right separator by avoiding the data rows for separator detection.

On 24.12.2012 11:52, Hideyoshi Maeda wrote:

> Thanks for the quick response.
>
> I wasn't sure if I understood you correctly, but isn't the problem
> the way that autostart finds separators?
>
> and in my example, it had headers, so I think it would need to start
> from row 2 wouldn't it, i.e. the first row that has non-header
> values?
>
> Thanks
>
> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]>
> wrote:
>
>>
>> Hi,
>>
>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>> Maybe setting autostart to the row number of the header row
>> (probably 1)
>> might work.
>>
>> Thanks,
>> Matthew
>>
>>
>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>> oups…forgot to add the output from the verbose part…here it is...
>>>
>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>> Starting format detection on line 30 (the last non blank line in
>>> the
>>> first 30)
>>> Detected sep as '/' and 3 columns
>>> Type codes: 003
>>> Found first row with 3 fields occuring on line 1 (either column
>>> names
>>> or first row of data)
>>> The first data row has some non character fields. Treating as a
>>> data
>>> row and using default column names.
>>> Count of eol after pos: 1143699
>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>> 1143698 data rows
>>>   0.153s ( 21%) Memory map (quicker if you rerun)
>>>   0.000s (  0%) Format detection
>>>   0.095s ( 13%) Count rows (wc -l)
>>>   0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>   0.480s ( 66%) Reading data
>>>   0.000s (  0%) Bumping column type midread and coercing data
>>> already read
>>>   0.002s (  0%) Changing na.strings to NA
>>>   0.731s        Total
>>>
>>>
>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>> <[hidden email]> wrote:
>>>
>>>> Hi Matthew,
>>>>
>>>> I am using the new `data.table` `fread()` function to read my csv
>>>> files, which has the format as follows when using the read.csv
>>>> function
>>>>
>>>>           Date.and.Time Open High  Low Close Volume
>>>>   1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>   2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>   3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>
>>>> The value of the first column is all of: `2007/01/01 22:53:00`,
>>>> the next 5 columns are separated with commas.
>>>>
>>>> but when reading the same file using fread i get the following
>>>> output
>>>>
>>>>       V1 V2                                             V3
>>>>   1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>   2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>   3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>
>>>> This is because the autodetect is using the "/" as a separator...
>>>>
>>>> I tried overriding this using the `sep=","` argument but this does
>>>> not seem to be used in the function anywhere.
>>>>
>>>> Furthremore when using verbose I get the following output, which
>>>> suggests that I was right in thinking that "/" is used as a
>>>> separator rather than ",".
>>>>
>>>> Is there any way to fix this, so that it correctly reads all 6
>>>> columns separately?
>>>>
>>>> Thanks
>>>>
>>>> HLM
>>>>
>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi datatablers,
>>>>>
>>>>> Feedback and bug reports much appreciated :
>>>>>
>>>>> =====
>>>>> New function fread(), a fast and friendly file reader.
>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>> * integers>2^31 are detected and read natively as
>>>>> bit64::integer64.
>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>> * new implementation entirely in C
>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>  read.csv("test.csv")                                   # 30-60
>>>>> sec
>>>>>  read.table("test.csv",<all known tricks, known nrows>) #    10
>>>>> sec
>>>>>  fread("test.csv")                                      #     3
>>>>> sec
>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>  read.table("2008.csv",<all known tricks, known nrows>) #   360
>>>>> sec
>>>>>  fread("2008.csv")                                      #    50
>>>>> sec
>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>> discussions and beta testing.
>>>>> =====
>>>>>
>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>
>>>>> install.packages("data.table",
>>>>> repos="http://R-Forge.R-project.org")
>>>>> require(data.table)
>>>>> ?fread
>>>>> fread("your biggest baddest file")
>>>>>
>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>> optimization rather
>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great
>>>>> on Win64
>>>>> until that can be resolved on R-Forge, unless you compile
>>>>> yourself. -O3
>>>>> has some optimizations that fread may benefit from. But
>>>>> interested to hear.
>>>>>
>>>>> Seasons greatings!
>>>>>
>>>>> Matthew
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> [hidden email]
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by akhilsbehl

Great. Looks like cols 3,4,9 and 12 are being detected as integer64 ok
(16 width is just about ok, limit is 18 digits for integer64), but later
on in the file there is a . or more digits in one of those columns that
causes the bump to real. There is a nice message telling you which line
and which field and the contents is causing the bump,  but the
'unimplemented' error happens first. oops,  will fix ..

Thanks!


On 24.12.2012 12:21, akhilsbehl wrote:

> Here is a new problem:
>
> I have a csv that looks like this:
>
>
> PO,CASH,2012080150000306,67389310793869,bbRELIANCE,EQ,74025,700,2012080150004326,1,3,2012080150001143,1,3
>
> PO,CASH,2012080150000307,67389310793884,bbRELIANCE,EQ,74025,2000,2012080150007969,1,3,2012080150001143,1,3
>
> PO,CASH,2012080150000308,67389310793896,bbRELIANCE,EQ,74025,1000,2012080150002222,1,3,2012080150001143,1,3
>
> read.csv(filename) gives me:
>
> 1 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025   700
> 2.01208e+15   1
> 3 2.01208e+15   1   3
> 2 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025  2000
> 2.01208e+15   1
> 3 2.01208e+15   1   3
> 3 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025  1000
> 2.01208e+15   1
> 3 2.01208e+15   1   3
>
> fread(filename, verbose=TRUE) gives me:
>
> Detected eol as \n only (no \r afterwards), the UNIX and Mac
> standard.
> Starting format detection on line 30 (the last non blank line in the
> first
> 30)
> Detected sep as ',' and 14 columns
> Type codes: 33113300100100
> Found first row with 14 fields occuring on line 1 (either column
> names or
> first row of data)
> The first data row has some non character fields. Treating as a data
> row and
> using default column names.
> Count of eol after pos: 54025
> Subtracted 1 for last eol and any trailing empty lines, leaving 54024
> data
> rows
>
> Error in fread(data.files[[2]], verbose = TRUE) :
>   Coercing integer64 to real needs to be implemented
>
> Type codes show it is trying to read columns 3, 4, 9, 12 as real
> numbers.
> Now, I may be out of depth here but shouldn't they just be integers?
> Am I
> missing something?
>
> Thanks.
>
> --
> ASB.
>
>
>
> --
> View this message in context:
>
> http://r.789695.n4.nabble.com/New-function-fread-in-v1-8-7-tp4653745p4653872.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> [hidden email]
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
In reply to this post by Matthew Dowle
using autostart=1 gives the following error

Error in fread(file.path, autostart = 1) :
' ends field 2 on line 1 when detecting types: Date and Time,Open,High,Low,Close,Volume
2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64


On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]> wrote:

>
> Yes autostart is the line it detects separators, then it searches upwards to find the first row with the same number of columns. If that row is all character then it deems that as the column name row.  So if you start autostart on 1, it's already at the top and it might catch the right separator by avoiding the data rows for separator detection.
>
> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>> Thanks for the quick response.
>>
>> I wasn't sure if I understood you correctly, but isn't the problem
>> the way that autostart finds separators?
>>
>> and in my example, it had headers, so I think it would need to start
>> from row 2 wouldn't it, i.e. the first row that has non-header values?
>>
>> Thanks
>>
>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]> wrote:
>>
>>>
>>> Hi,
>>>
>>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>>> Maybe setting autostart to the row number of the header row (probably 1)
>>> might work.
>>>
>>> Thanks,
>>> Matthew
>>>
>>>
>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>
>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>> Starting format detection on line 30 (the last non blank line in the
>>>> first 30)
>>>> Detected sep as '/' and 3 columns
>>>> Type codes: 003
>>>> Found first row with 3 fields occuring on line 1 (either column names
>>>> or first row of data)
>>>> The first data row has some non character fields. Treating as a data
>>>> row and using default column names.
>>>> Count of eol after pos: 1143699
>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>> 1143698 data rows
>>>>  0.153s ( 21%) Memory map (quicker if you rerun)
>>>>  0.000s (  0%) Format detection
>>>>  0.095s ( 13%) Count rows (wc -l)
>>>>  0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>  0.480s ( 66%) Reading data
>>>>  0.000s (  0%) Bumping column type midread and coercing data already read
>>>>  0.002s (  0%) Changing na.strings to NA
>>>>  0.731s        Total
>>>>
>>>>
>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]> wrote:
>>>>
>>>>> Hi Matthew,
>>>>>
>>>>> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>>>>>
>>>>>          Date.and.Time Open High  Low Close Volume
>>>>>  1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>  2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>  3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>
>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>>>>>
>>>>> but when reading the same file using fread i get the following output
>>>>>
>>>>>      V1 V2                                             V3
>>>>>  1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>  2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>  3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>
>>>>> This is because the autodetect is using the "/" as a separator...
>>>>>
>>>>> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>>>>>
>>>>> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>>>>>
>>>>> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>>>>>
>>>>> Thanks
>>>>>
>>>>> HLM
>>>>>
>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:
>>>>>
>>>>>>
>>>>>> Hi datatablers,
>>>>>>
>>>>>> Feedback and bug reports much appreciated :
>>>>>>
>>>>>> =====
>>>>>> New function fread(), a fast and friendly file reader.
>>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>> * new implementation entirely in C
>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>> read.csv("test.csv")                                   # 30-60 sec
>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>>> fread("test.csv")                                      #     3 sec
>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>>> fread("2008.csv")                                      #    50 sec
>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>>> discussions and beta testing.
>>>>>> =====
>>>>>>
>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>
>>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>>>>>> require(data.table)
>>>>>> ?fread
>>>>>> fread("your biggest baddest file")
>>>>>>
>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>>>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>>>> has some optimizations that fread may benefit from. But interested to hear.
>>>>>>
>>>>>> Seasons greatings!
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> [hidden email]
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>
>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

akhilsbehl
In reply to this post by Matthew Dowle
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle

Help is always welcome.

fread's source code is hopefully fairly easy to follow. It's all in one
file with very few library calls: just ifs, loops and a few switches.
Anyone is welcome to join the data.table project and commit changes.

Or there's the documentation. And test cases. If you're not sure what
needs to be done, then there's a TO DO list at the top of fread.c. Or
for the project generally there are 104 feature requests waiting to be
picked off. Subscribe to datatable-commits to keep up to date with
latest commits.

Here's fread's source
https://r-forge.r-project.org/scm/viewvc.php/pkg/src/readfile.c?view=markup&root=datatable

and the help file :
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?root=datatable&view=log

and the outstanding feature requests :
https://r-forge.r-project.org/tracker/?atid=978&group_id=240&func=browse

Matthew


On 24.12.2012 16:29, akhilsbehl wrote:

> Thanks a lot. Is there anyway I can help? Besides the feedback I
> mean? :)
>
>
>
> --
> View this message in context:
>
> http://r.789695.n4.nabble.com/New-function-fread-in-v1-8-7-tp4653745p4653882.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> [hidden email]
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by Hideyoshi Maeda

sep is now passed through and have added your example as a test.
Hope ok now.

Thanks,
Matthew

On 24.12.2012 14:18, Hideyoshi Maeda wrote:

> using autostart=1 gives the following error
>
> Error in fread(file.path, autostart = 1) :
> ' ends field 2 on line 1 when detecting types: Date and
> Time,Open,High,Low,Close,Volume
> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>
>
> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]>
> wrote:
>
>>
>> Yes autostart is the line it detects separators, then it searches
>> upwards to find the first row with the same number of columns. If that
>> row is all character then it deems that as the column name row.  So if
>> you start autostart on 1, it's already at the top and it might catch
>> the right separator by avoiding the data rows for separator detection.
>>
>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>> Thanks for the quick response.
>>>
>>> I wasn't sure if I understood you correctly, but isn't the problem
>>> the way that autostart finds separators?
>>>
>>> and in my example, it had headers, so I think it would need to
>>> start
>>> from row 2 wouldn't it, i.e. the first row that has non-header
>>> values?
>>>
>>> Thanks
>>>
>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Ah yes, haven't hooked up the sep override yet, apologies, will
>>>> fix.
>>>> Maybe setting autostart to the row number of the header row
>>>> (probably 1)
>>>> might work.
>>>>
>>>> Thanks,
>>>> Matthew
>>>>
>>>>
>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>>
>>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>>> Starting format detection on line 30 (the last non blank line in
>>>>> the
>>>>> first 30)
>>>>> Detected sep as '/' and 3 columns
>>>>> Type codes: 003
>>>>> Found first row with 3 fields occuring on line 1 (either column
>>>>> names
>>>>> or first row of data)
>>>>> The first data row has some non character fields. Treating as a
>>>>> data
>>>>> row and using default column names.
>>>>> Count of eol after pos: 1143699
>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>> 1143698 data rows
>>>>>  0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>  0.000s (  0%) Format detection
>>>>>  0.095s ( 13%) Count rows (wc -l)
>>>>>  0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>  0.480s ( 66%) Reading data
>>>>>  0.000s (  0%) Bumping column type midread and coercing data
>>>>> already read
>>>>>  0.002s (  0%) Changing na.strings to NA
>>>>>  0.731s        Total
>>>>>
>>>>>
>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>>>> <[hidden email]> wrote:
>>>>>
>>>>>> Hi Matthew,
>>>>>>
>>>>>> I am using the new `data.table` `fread()` function to read my
>>>>>> csv files, which has the format as follows when using the read.csv
>>>>>> function
>>>>>>
>>>>>>          Date.and.Time Open High  Low Close Volume
>>>>>>  1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>  2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>  3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>
>>>>>> The value of the first column is all of: `2007/01/01 22:53:00`,
>>>>>> the next 5 columns are separated with commas.
>>>>>>
>>>>>> but when reading the same file using fread i get the following
>>>>>> output
>>>>>>
>>>>>>      V1 V2                                             V3
>>>>>>  1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>  2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>  3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>
>>>>>> This is because the autodetect is using the "/" as a
>>>>>> separator...
>>>>>>
>>>>>> I tried overriding this using the `sep=","` argument but this
>>>>>> does not seem to be used in the function anywhere.
>>>>>>
>>>>>> Furthremore when using verbose I get the following output, which
>>>>>> suggests that I was right in thinking that "/" is used as a
>>>>>> separator rather than ",".
>>>>>>
>>>>>> Is there any way to fix this, so that it correctly reads all 6
>>>>>> columns separately?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> HLM
>>>>>>
>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi datatablers,
>>>>>>>
>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>
>>>>>>> =====
>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>> * header, skip, nrows, sep and colClasses are all auto
>>>>>>> detected.
>>>>>>> * integers>2^31 are detected and read natively as
>>>>>>> bit64::integer64.
>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>> * new implementation entirely in C
>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>> read.csv("test.csv")                                   # 30-60
>>>>>>> sec
>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10
>>>>>>> sec
>>>>>>> fread("test.csv")                                      #     3
>>>>>>> sec
>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360
>>>>>>> sec
>>>>>>> fread("2008.csv")                                      #    50
>>>>>>> sec
>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for
>>>>>>> ideas,
>>>>>>> discussions and beta testing.
>>>>>>> =====
>>>>>>>
>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>>
>>>>>>> install.packages("data.table",
>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>> require(data.table)
>>>>>>> ?fread
>>>>>>> fread("your biggest baddest file")
>>>>>>>
>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>>>> optimization rather
>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as
>>>>>>> great on Win64
>>>>>>> until that can be resolved on R-Forge, unless you compile
>>>>>>> yourself. -O3
>>>>>>> has some optimizations that fread may benefit from. But
>>>>>>> interested to hear.
>>>>>>>
>>>>>>> Seasons greatings!
>>>>>>>
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> datatable-help mailing list
>>>>>>> [hidden email]
>>>>>>>
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>
>>>>
>>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
The sep argument now works thank you!

But just out of curiosity…not a major problem of sorts but by using fread(file.path,sep=",") on my csv file, the column names includes "." as shown in my original email… but the output result automatically removes the "." in the column name…is there a way to stop it from doing that?, i.e. the first column becomes "Data and Time"  when using fread, rather than the original "Date.and.Time" when using read.csv


On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]> wrote:

>
> sep is now passed through and have added your example as a test.
> Hope ok now.
>
> Thanks,
> Matthew
>
> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>> using autostart=1 gives the following error
>>
>> Error in fread(file.path, autostart = 1) :
>> ' ends field 2 on line 1 when detecting types: Date and
>> Time,Open,High,Low,Close,Volume
>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>
>>
>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]> wrote:
>>
>>>
>>> Yes autostart is the line it detects separators, then it searches upwards to find the first row with the same number of columns. If that row is all character then it deems that as the column name row.  So if you start autostart on 1, it's already at the top and it might catch the right separator by avoiding the data rows for separator detection.
>>>
>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>> Thanks for the quick response.
>>>>
>>>> I wasn't sure if I understood you correctly, but isn't the problem
>>>> the way that autostart finds separators?
>>>>
>>>> and in my example, it had headers, so I think it would need to start
>>>> from row 2 wouldn't it, i.e. the first row that has non-header values?
>>>>
>>>> Thanks
>>>>
>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>>>>> Maybe setting autostart to the row number of the header row (probably 1)
>>>>> might work.
>>>>>
>>>>> Thanks,
>>>>> Matthew
>>>>>
>>>>>
>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>>>
>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>>>> Starting format detection on line 30 (the last non blank line in the
>>>>>> first 30)
>>>>>> Detected sep as '/' and 3 columns
>>>>>> Type codes: 003
>>>>>> Found first row with 3 fields occuring on line 1 (either column names
>>>>>> or first row of data)
>>>>>> The first data row has some non character fields. Treating as a data
>>>>>> row and using default column names.
>>>>>> Count of eol after pos: 1143699
>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>>> 1143698 data rows
>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>> 0.000s (  0%) Format detection
>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>> 0.480s ( 66%) Reading data
>>>>>> 0.000s (  0%) Bumping column type midread and coercing data already read
>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>> 0.731s        Total
>>>>>>
>>>>>>
>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]> wrote:
>>>>>>
>>>>>>> Hi Matthew,
>>>>>>>
>>>>>>> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>>>>>>>
>>>>>>>         Date.and.Time Open High  Low Close Volume
>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>
>>>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>
>>>>>>> but when reading the same file using fread i get the following output
>>>>>>>
>>>>>>>     V1 V2                                             V3
>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>
>>>>>>> This is because the autodetect is using the "/" as a separator...
>>>>>>>
>>>>>>> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>>>>>>>
>>>>>>> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>>>>>>>
>>>>>>> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> HLM
>>>>>>>
>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi datatablers,
>>>>>>>>
>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>
>>>>>>>> =====
>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>> * new implementation entirely in C
>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>> read.csv("test.csv")                                   # 30-60 sec
>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>>>>> fread("test.csv")                                      #     3 sec
>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>>>>> fread("2008.csv")                                      #    50 sec
>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>>>>> discussions and beta testing.
>>>>>>>> =====
>>>>>>>>
>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>>>
>>>>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>>>>>>>> require(data.table)
>>>>>>>> ?fread
>>>>>>>> fread("your biggest baddest file")
>>>>>>>>
>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>>>>>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>>>>>> has some optimizations that fread may benefit from. But interested to hear.
>>>>>>>>
>>>>>>>> Seasons greatings!
>>>>>>>>
>>>>>>>> Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> datatable-help mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>
>>>>>
>>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle

Great. Thanks for confirm.

The file itself has "Date and Time" as the column name doesn't it i.e.
with spaces not dots? fread retains exactly what's in the file, whereas
read.csv runs the column names through base::make.names() which converts
the spaces to dots to make the column names syntactically valid, iiuc.
data.table's general policy is to allow spaces and other unusual
characters in columns names and retain them throughout (forgiving the
odd bug now fixed caused by some make.names calls which should have been
make.unique).

To do the same as read.csv :

     DT = fread(...)
     setnames(DT,make.names(names(DT)))

Not sure I understood correctly and I didn't test.


On 28.12.2012 21:36, Hideyoshi Maeda wrote:

> The sep argument now works thank you!
>
> But just out of curiosity…not a major problem of sorts but by using
> fread(file.path,sep=",") on my csv file, the column names includes
> "."
> as shown in my original email… but the output result automatically
> removes the "." in the column name…is there a way to stop it from
> doing that?, i.e. the first column becomes "Data and Time"  when
> using
> fread, rather than the original "Date.and.Time" when using read.csv
>
>
> On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]>
> wrote:
>
>>
>> sep is now passed through and have added your example as a test.
>> Hope ok now.
>>
>> Thanks,
>> Matthew
>>
>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>> using autostart=1 gives the following error
>>>
>>> Error in fread(file.path, autostart = 1) :
>>> ' ends field 2 on line 1 when detecting types: Date and
>>> Time,Open,High,Low,Close,Volume
>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>
>>>
>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]>
>>> wrote:
>>>
>>>>
>>>> Yes autostart is the line it detects separators, then it searches
>>>> upwards to find the first row with the same number of columns. If
>>>> that row is all character then it deems that as the column name row.
>>>> So if you start autostart on 1, it's already at the top and it
>>>> might catch the right separator by avoiding the data rows for
>>>> separator detection.
>>>>
>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>> Thanks for the quick response.
>>>>>
>>>>> I wasn't sure if I understood you correctly, but isn't the
>>>>> problem
>>>>> the way that autostart finds separators?
>>>>>
>>>>> and in my example, it had headers, so I think it would need to
>>>>> start
>>>>> from row 2 wouldn't it, i.e. the first row that has non-header
>>>>> values?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will
>>>>>> fix.
>>>>>> Maybe setting autostart to the row number of the header row
>>>>>> (probably 1)
>>>>>> might work.
>>>>>>
>>>>>> Thanks,
>>>>>> Matthew
>>>>>>
>>>>>>
>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>> oups…forgot to add the output from the verbose part…here it
>>>>>>> is...
>>>>>>>
>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows
>>>>>>> standard.
>>>>>>> Starting format detection on line 30 (the last non blank line
>>>>>>> in the
>>>>>>> first 30)
>>>>>>> Detected sep as '/' and 3 columns
>>>>>>> Type codes: 003
>>>>>>> Found first row with 3 fields occuring on line 1 (either column
>>>>>>> names
>>>>>>> or first row of data)
>>>>>>> The first data row has some non character fields. Treating as a
>>>>>>> data
>>>>>>> row and using default column names.
>>>>>>> Count of eol after pos: 1143699
>>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>>>> 1143698 data rows
>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>> 0.000s (  0%) Format detection
>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>> 0.480s ( 66%) Reading data
>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data
>>>>>>> already read
>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>> 0.731s        Total
>>>>>>>
>>>>>>>
>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Hi Matthew,
>>>>>>>>
>>>>>>>> I am using the new `data.table` `fread()` function to read my
>>>>>>>> csv files, which has the format as follows when using the
>>>>>>>> read.csv function
>>>>>>>>
>>>>>>>>         Date.and.Time Open High  Low Close Volume
>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>
>>>>>>>> The value of the first column is all of: `2007/01/01
>>>>>>>> 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>
>>>>>>>> but when reading the same file using fread i get the following
>>>>>>>> output
>>>>>>>>
>>>>>>>>     V1 V2                                             V3
>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>
>>>>>>>> This is because the autodetect is using the "/" as a
>>>>>>>> separator...
>>>>>>>>
>>>>>>>> I tried overriding this using the `sep=","` argument but this
>>>>>>>> does not seem to be used in the function anywhere.
>>>>>>>>
>>>>>>>> Furthremore when using verbose I get the following output,
>>>>>>>> which suggests that I was right in thinking that "/" is used as
>>>>>>>> a separator rather than ",".
>>>>>>>>
>>>>>>>> Is there any way to fix this, so that it correctly reads all 6
>>>>>>>> columns separately?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> HLM
>>>>>>>>
>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi datatablers,
>>>>>>>>>
>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>
>>>>>>>>> =====
>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto
>>>>>>>>> detected.
>>>>>>>>> * integers>2^31 are detected and read natively as
>>>>>>>>> bit64::integer64.
>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>> * new implementation entirely in C
>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>> read.csv("test.csv")                                   #
>>>>>>>>> 30-60 sec
>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    
>>>>>>>>> 10 sec
>>>>>>>>> fread("test.csv")                                      #    
>>>>>>>>> 3 sec
>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #  
>>>>>>>>> 360 sec
>>>>>>>>> fread("2008.csv")                                      #    
>>>>>>>>> 50 sec
>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for
>>>>>>>>> ideas,
>>>>>>>>> discussions and beta testing.
>>>>>>>>> =====
>>>>>>>>>
>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet)
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> install.packages("data.table",
>>>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>>>> require(data.table)
>>>>>>>>> ?fread
>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>
>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>>>>>> optimization rather
>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as
>>>>>>>>> great on Win64
>>>>>>>>> until that can be resolved on R-Forge, unless you compile
>>>>>>>>> yourself. -O3
>>>>>>>>> has some optimizations that fread may benefit from. But
>>>>>>>>> interested to hear.
>>>>>>>>>
>>>>>>>>> Seasons greatings!
>>>>>>>>>
>>>>>>>>> Matthew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> datatable-help mailing list
>>>>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>
>>>>>>
>>>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Hideyoshi Maeda
No problem for the confirm…Thanks again for fixing it.

As for the file itself having "Date and Time", you are right….i just assumed that this function was designed to replace/speed up the read.csv function, i.e. work in exactly the same way but faster. Thanks for letting me know about the make.names call though.



On 28 Dec 2012, at 22:06, Matthew Dowle <[hidden email]> wrote:

>
> Great. Thanks for confirm.
>
> The file itself has "Date and Time" as the column name doesn't it i.e. with spaces not dots? fread retains exactly what's in the file, whereas read.csv runs the column names through base::make.names() which converts the spaces to dots to make the column names syntactically valid, iiuc. data.table's general policy is to allow spaces and other unusual characters in columns names and retain them throughout (forgiving the odd bug now fixed caused by some make.names calls which should have been make.unique).
>
> To do the same as read.csv :
>
>    DT = fread(...)
>    setnames(DT,make.names(names(DT)))
>
> Not sure I understood correctly and I didn't test.
>
>
> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>> The sep argument now works thank you!
>>
>> But just out of curiosity…not a major problem of sorts but by using
>> fread(file.path,sep=",") on my csv file, the column names includes "."
>> as shown in my original email… but the output result automatically
>> removes the "." in the column name…is there a way to stop it from
>> doing that?, i.e. the first column becomes "Data and Time"  when using
>> fread, rather than the original "Date.and.Time" when using read.csv
>>
>>
>> On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]> wrote:
>>
>>>
>>> sep is now passed through and have added your example as a test.
>>> Hope ok now.
>>>
>>> Thanks,
>>> Matthew
>>>
>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>> using autostart=1 gives the following error
>>>>
>>>> Error in fread(file.path, autostart = 1) :
>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>> Time,Open,High,Low,Close,Volume
>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>
>>>>
>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]> wrote:
>>>>
>>>>>
>>>>> Yes autostart is the line it detects separators, then it searches upwards to find the first row with the same number of columns. If that row is all character then it deems that as the column name row. So if you start autostart on 1, it's already at the top and it might catch the right separator by avoiding the data rows for separator detection.
>>>>>
>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> I wasn't sure if I understood you correctly, but isn't the problem
>>>>>> the way that autostart finds separators?
>>>>>>
>>>>>> and in my example, it had headers, so I think it would need to start
>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header values?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>>>>>>> Maybe setting autostart to the row number of the header row (probably 1)
>>>>>>> might work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>>>>>
>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>>>>>> Starting format detection on line 30 (the last non blank line in the
>>>>>>>> first 30)
>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>> Type codes: 003
>>>>>>>> Found first row with 3 fields occuring on line 1 (either column names
>>>>>>>> or first row of data)
>>>>>>>> The first data row has some non character fields. Treating as a data
>>>>>>>> row and using default column names.
>>>>>>>> Count of eol after pos: 1143699
>>>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>>>>> 1143698 data rows
>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data already read
>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>> 0.731s        Total
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi Matthew,
>>>>>>>>>
>>>>>>>>> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>>>>>>>>>
>>>>>>>>>        Date.and.Time Open High  Low Close Volume
>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>>
>>>>>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>>
>>>>>>>>> but when reading the same file using fread i get the following output
>>>>>>>>>
>>>>>>>>>    V1 V2                                             V3
>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>>
>>>>>>>>> This is because the autodetect is using the "/" as a separator...
>>>>>>>>>
>>>>>>>>> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>>>>>>>>>
>>>>>>>>> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>>>>>>>>>
>>>>>>>>> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> HLM
>>>>>>>>>
>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi datatablers,
>>>>>>>>>>
>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>>
>>>>>>>>>> =====
>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>>>>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>> read.csv("test.csv")                                   # 30-60 sec
>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>>>>>>> fread("test.csv")                                      #     3 sec
>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>>>>>>> fread("2008.csv")                                      #    50 sec
>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>>>>>>> discussions and beta testing.
>>>>>>>>>> =====
>>>>>>>>>>
>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>>>>>
>>>>>>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>>>>>>>>>> require(data.table)
>>>>>>>>>> ?fread
>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>>
>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>>>>>>>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>>>>>>>> has some optimizations that fread may benefit from. But interested to hear.
>>>>>>>>>>
>>>>>>>>>> Seasons greatings!
>>>>>>>>>>
>>>>>>>>>> Matthew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> datatable-help mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>>
>>>>>>>
>>>>>
>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by Matthew Dowle

Btw we like backticks in data.table :

     DT[,`Date and Time`]
     setkey(DT,`Date and Time`)   # [*]

although you'd probably  setnames(DT,"Date and Time","datetime")  for a
core column like that.

[*] which I've just noticed doesn't work, will file new bug report.


On 28.12.2012 22:06, Matthew Dowle wrote:

> Great. Thanks for confirm.
>
> The file itself has "Date and Time" as the column name doesn't it
> i.e. with spaces not dots? fread retains exactly what's in the file,
> whereas read.csv runs the column names through base::make.names()
> which converts the spaces to dots to make the column names
> syntactically valid, iiuc. data.table's general policy is to allow
> spaces and other unusual characters in columns names and retain them
> throughout (forgiving the odd bug now fixed caused by some make.names
> calls which should have been make.unique).
>
> To do the same as read.csv :
>
>     DT = fread(...)
>     setnames(DT,make.names(names(DT)))
>
> Not sure I understood correctly and I didn't test.
>
>
> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>> The sep argument now works thank you!
>>
>> But just out of curiosity…not a major problem of sorts but by using
>> fread(file.path,sep=",") on my csv file, the column names includes
>> "."
>> as shown in my original email… but the output result automatically
>> removes the "." in the column name…is there a way to stop it from
>> doing that?, i.e. the first column becomes "Data and Time"  when
>> using
>> fread, rather than the original "Date.and.Time" when using read.csv
>>
>>
>> On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]>
>> wrote:
>>
>>>
>>> sep is now passed through and have added your example as a test.
>>> Hope ok now.
>>>
>>> Thanks,
>>> Matthew
>>>
>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>> using autostart=1 gives the following error
>>>>
>>>> Error in fread(file.path, autostart = 1) :
>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>> Time,Open,High,Low,Close,Volume
>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>
>>>>
>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]>
>>>> wrote:
>>>>
>>>>>
>>>>> Yes autostart is the line it detects separators, then it searches
>>>>> upwards to find the first row with the same number of columns. If
>>>>> that row is all character then it deems that as the column name
>>>>> row. So if you start autostart on 1, it's already at the top and it
>>>>> might catch the right separator by avoiding the data rows for
>>>>> separator detection.
>>>>>
>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> I wasn't sure if I understood you correctly, but isn't the
>>>>>> problem
>>>>>> the way that autostart finds separators?
>>>>>>
>>>>>> and in my example, it had headers, so I think it would need to
>>>>>> start
>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header
>>>>>> values?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will
>>>>>>> fix.
>>>>>>> Maybe setting autostart to the row number of the header row
>>>>>>> (probably 1)
>>>>>>> might work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>> oups…forgot to add the output from the verbose part…here it
>>>>>>>> is...
>>>>>>>>
>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows
>>>>>>>> standard.
>>>>>>>> Starting format detection on line 30 (the last non blank line
>>>>>>>> in the
>>>>>>>> first 30)
>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>> Type codes: 003
>>>>>>>> Found first row with 3 fields occuring on line 1 (either
>>>>>>>> column names
>>>>>>>> or first row of data)
>>>>>>>> The first data row has some non character fields. Treating as
>>>>>>>> a data
>>>>>>>> row and using default column names.
>>>>>>>> Count of eol after pos: 1143699
>>>>>>>> Subtracted 1 for last eol and any trailing empty lines,
>>>>>>>> leaving
>>>>>>>> 1143698 data rows
>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data
>>>>>>>> already read
>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>> 0.731s        Total
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi Matthew,
>>>>>>>>>
>>>>>>>>> I am using the new `data.table` `fread()` function to read my
>>>>>>>>> csv files, which has the format as follows when using the
>>>>>>>>> read.csv function
>>>>>>>>>
>>>>>>>>>         Date.and.Time Open High  Low Close Volume
>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>>
>>>>>>>>> The value of the first column is all of: `2007/01/01
>>>>>>>>> 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>>
>>>>>>>>> but when reading the same file using fread i get the
>>>>>>>>> following output
>>>>>>>>>
>>>>>>>>>     V1 V2                                             V3
>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>>
>>>>>>>>> This is because the autodetect is using the "/" as a
>>>>>>>>> separator...
>>>>>>>>>
>>>>>>>>> I tried overriding this using the `sep=","` argument but this
>>>>>>>>> does not seem to be used in the function anywhere.
>>>>>>>>>
>>>>>>>>> Furthremore when using verbose I get the following output,
>>>>>>>>> which suggests that I was right in thinking that "/" is used as
>>>>>>>>> a separator rather than ",".
>>>>>>>>>
>>>>>>>>> Is there any way to fix this, so that it correctly reads all
>>>>>>>>> 6 columns separately?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> HLM
>>>>>>>>>
>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle
>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi datatablers,
>>>>>>>>>>
>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>>
>>>>>>>>>> =====
>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto
>>>>>>>>>> detected.
>>>>>>>>>> * integers>2^31 are detected and read natively as
>>>>>>>>>> bit64::integer64.
>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>> read.csv("test.csv")                                   #
>>>>>>>>>> 30-60 sec
>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    
>>>>>>>>>> 10 sec
>>>>>>>>>> fread("test.csv")                                      #    
>>>>>>>>>> 3 sec
>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #  
>>>>>>>>>> 360 sec
>>>>>>>>>> fread("2008.csv")                                      #    
>>>>>>>>>> 50 sec
>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for
>>>>>>>>>> ideas,
>>>>>>>>>> discussions and beta testing.
>>>>>>>>>> =====
>>>>>>>>>>
>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac
>>>>>>>>>> yet) :
>>>>>>>>>>
>>>>>>>>>> install.packages("data.table",
>>>>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>>>>> require(data.table)
>>>>>>>>>> ?fread
>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>>
>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>>>>>>> optimization rather
>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as
>>>>>>>>>> great on Win64
>>>>>>>>>> until that can be resolved on R-Forge, unless you compile
>>>>>>>>>> yourself. -O3
>>>>>>>>>> has some optimizations that fread may benefit from. But
>>>>>>>>>> interested to hear.
>>>>>>>>>>
>>>>>>>>>> Seasons greatings!
>>>>>>>>>>
>>>>>>>>>> Matthew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> datatable-help mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>>
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>>
>>>>>>>
>>>>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by Hideyoshi Maeda

It wasn't at the front of my mind to make it a drop in replacement.
Maybe it should be since it's not like data.table itself where a drop in
replacement for data.frame wasn't possible. If fread is supposed to be a
drop in replacement then it shouldn't output integer64 types, shouldn't
produce list columns for dual delimited files and stringsAsFactors
should be TRUE by default not FALSE, as well.

Perhaps an as.read.table=TRUE/FALSE option, then?


On 28.12.2012 22:21, Hideyoshi Maeda wrote:

> No problem for the confirm…Thanks again for fixing it.
>
> As for the file itself having "Date and Time", you are right….i just
> assumed that this function was designed to replace/speed up the
> read.csv function, i.e. work in exactly the same way but faster.
> Thanks for letting me know about the make.names call though.
>
>
>
> On 28 Dec 2012, at 22:06, Matthew Dowle <[hidden email]>
> wrote:
>
>>
>> Great. Thanks for confirm.
>>
>> The file itself has "Date and Time" as the column name doesn't it
>> i.e. with spaces not dots? fread retains exactly what's in the file,
>> whereas read.csv runs the column names through base::make.names()
>> which converts the spaces to dots to make the column names
>> syntactically valid, iiuc. data.table's general policy is to allow
>> spaces and other unusual characters in columns names and retain them
>> throughout (forgiving the odd bug now fixed caused by some make.names
>> calls which should have been make.unique).
>>
>> To do the same as read.csv :
>>
>>    DT = fread(...)
>>    setnames(DT,make.names(names(DT)))
>>
>> Not sure I understood correctly and I didn't test.
>>
>>
>> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>>> The sep argument now works thank you!
>>>
>>> But just out of curiosity…not a major problem of sorts but by using
>>> fread(file.path,sep=",") on my csv file, the column names includes
>>> "."
>>> as shown in my original email… but the output result automatically
>>> removes the "." in the column name…is there a way to stop it from
>>> doing that?, i.e. the first column becomes "Data and Time"  when
>>> using
>>> fread, rather than the original "Date.and.Time" when using read.csv
>>>
>>>
>>> On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]>
>>> wrote:
>>>
>>>>
>>>> sep is now passed through and have added your example as a test.
>>>> Hope ok now.
>>>>
>>>> Thanks,
>>>> Matthew
>>>>
>>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>>> using autostart=1 gives the following error
>>>>>
>>>>> Error in fread(file.path, autostart = 1) :
>>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>>> Time,Open,High,Low,Close,Volume
>>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>
>>>>>
>>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Yes autostart is the line it detects separators, then it
>>>>>> searches upwards to find the first row with the same number of
>>>>>> columns. If that row is all character then it deems that as the
>>>>>> column name row. So if you start autostart on 1, it's already at
>>>>>> the top and it might catch the right separator by avoiding the
>>>>>> data rows for separator detection.
>>>>>>
>>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>>> Thanks for the quick response.
>>>>>>>
>>>>>>> I wasn't sure if I understood you correctly, but isn't the
>>>>>>> problem
>>>>>>> the way that autostart finds separators?
>>>>>>>
>>>>>>> and in my example, it had headers, so I think it would need to
>>>>>>> start
>>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header
>>>>>>> values?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle
>>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies,
>>>>>>>> will fix.
>>>>>>>> Maybe setting autostart to the row number of the header row
>>>>>>>> (probably 1)
>>>>>>>> might work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>>> oups…forgot to add the output from the verbose part…here it
>>>>>>>>> is...
>>>>>>>>>
>>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows
>>>>>>>>> standard.
>>>>>>>>> Starting format detection on line 30 (the last non blank line
>>>>>>>>> in the
>>>>>>>>> first 30)
>>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>>> Type codes: 003
>>>>>>>>> Found first row with 3 fields occuring on line 1 (either
>>>>>>>>> column names
>>>>>>>>> or first row of data)
>>>>>>>>> The first data row has some non character fields. Treating as
>>>>>>>>> a data
>>>>>>>>> row and using default column names.
>>>>>>>>> Count of eol after pos: 1143699
>>>>>>>>> Subtracted 1 for last eol and any trailing empty lines,
>>>>>>>>> leaving
>>>>>>>>> 1143698 data rows
>>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data
>>>>>>>>> already read
>>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>>> 0.731s        Total
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Matthew,
>>>>>>>>>>
>>>>>>>>>> I am using the new `data.table` `fread()` function to read
>>>>>>>>>> my csv files, which has the format as follows when using the
>>>>>>>>>> read.csv function
>>>>>>>>>>
>>>>>>>>>>        Date.and.Time Open High  Low Close Volume
>>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>>>
>>>>>>>>>> The value of the first column is all of: `2007/01/01
>>>>>>>>>> 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>>>
>>>>>>>>>> but when reading the same file using fread i get the
>>>>>>>>>> following output
>>>>>>>>>>
>>>>>>>>>>    V1 V2                                             V3
>>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>>>
>>>>>>>>>> This is because the autodetect is using the "/" as a
>>>>>>>>>> separator...
>>>>>>>>>>
>>>>>>>>>> I tried overriding this using the `sep=","` argument but
>>>>>>>>>> this does not seem to be used in the function anywhere.
>>>>>>>>>>
>>>>>>>>>> Furthremore when using verbose I get the following output,
>>>>>>>>>> which suggests that I was right in thinking that "/" is used
>>>>>>>>>> as a separator rather than ",".
>>>>>>>>>>
>>>>>>>>>> Is there any way to fix this, so that it correctly reads all
>>>>>>>>>> 6 columns separately?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> HLM
>>>>>>>>>>
>>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle
>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi datatablers,
>>>>>>>>>>>
>>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>>>
>>>>>>>>>>> =====
>>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto
>>>>>>>>>>> detected.
>>>>>>>>>>> * integers>2^31 are detected and read natively as
>>>>>>>>>>> bit64::integer64.
>>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>>> read.csv("test.csv")                                   #
>>>>>>>>>>> 30-60 sec
>>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    
>>>>>>>>>>> 10 sec
>>>>>>>>>>> fread("test.csv")                                      #    
>>>>>>>>>>> 3 sec
>>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #  
>>>>>>>>>>> 360 sec
>>>>>>>>>>> fread("2008.csv")                                      #    
>>>>>>>>>>> 50 sec
>>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for
>>>>>>>>>>> ideas,
>>>>>>>>>>> discussions and beta testing.
>>>>>>>>>>> =====
>>>>>>>>>>>
>>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac
>>>>>>>>>>> yet) :
>>>>>>>>>>>
>>>>>>>>>>> install.packages("data.table",
>>>>>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>>>>>> require(data.table)
>>>>>>>>>>> ?fread
>>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>>>
>>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>>>>>>>> optimization rather
>>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as
>>>>>>>>>>> great on Win64
>>>>>>>>>>> until that can be resolved on R-Forge, unless you compile
>>>>>>>>>>> yourself. -O3
>>>>>>>>>>> has some optimizations that fread may benefit from. But
>>>>>>>>>>> interested to hear.
>>>>>>>>>>>
>>>>>>>>>>> Seasons greatings!
>>>>>>>>>>>
>>>>>>>>>>> Matthew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> datatable-help mailing list
>>>>>>>>>>> [hidden email]
>>>>>>>>>>>
>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle

Or, 2 new functions :

     fread.table
     fread.csv

that would be what you expected.  They would call fread first and do
the modifications afterwards such as convert character columns to
factors, call make.names on the names etc. That way we don't clutter
fread's argument list with arguments/options we only need for drop-in
compatibility.

If a user wanted to drop them in to be picked up by existing code, they
could just mask it themselves in .GlobalEnv by executing :

     read.table = fread.table

Since a user may want data.table just for fread, and not wish to change
to data.table syntax, I suppose fread.table and fread.csv should return
data.frame rather than data.table, too.

Just thinking out loud ...


On 28.12.2012 22:38, Matthew Dowle wrote:

> It wasn't at the front of my mind to make it a drop in replacement.
> Maybe it should be since it's not like data.table itself where a drop
> in replacement for data.frame wasn't possible. If fread is supposed
> to
> be a drop in replacement then it shouldn't output integer64 types,
> shouldn't produce list columns for dual delimited files and
> stringsAsFactors should be TRUE by default not FALSE, as well.
>
> Perhaps an as.read.table=TRUE/FALSE option, then?
>
>
> On 28.12.2012 22:21, Hideyoshi Maeda wrote:
>> No problem for the confirm…Thanks again for fixing it.
>>
>> As for the file itself having "Date and Time", you are right….i just
>> assumed that this function was designed to replace/speed up the
>> read.csv function, i.e. work in exactly the same way but faster.
>> Thanks for letting me know about the make.names call though.
>>
>>
>>
>> On 28 Dec 2012, at 22:06, Matthew Dowle <[hidden email]>
>> wrote:
>>
>>>
>>> Great. Thanks for confirm.
>>>
>>> The file itself has "Date and Time" as the column name doesn't it
>>> i.e. with spaces not dots? fread retains exactly what's in the file,
>>> whereas read.csv runs the column names through base::make.names()
>>> which converts the spaces to dots to make the column names
>>> syntactically valid, iiuc. data.table's general policy is to allow
>>> spaces and other unusual characters in columns names and retain them
>>> throughout (forgiving the odd bug now fixed caused by some make.names
>>> calls which should have been make.unique).
>>>
>>> To do the same as read.csv :
>>>
>>>    DT = fread(...)
>>>    setnames(DT,make.names(names(DT)))
>>>
>>> Not sure I understood correctly and I didn't test.
>>>
>>>
>>> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>>>> The sep argument now works thank you!
>>>>
>>>> But just out of curiosity…not a major problem of sorts but by
>>>> using
>>>> fread(file.path,sep=",") on my csv file, the column names includes
>>>> "."
>>>> as shown in my original email… but the output result automatically
>>>> removes the "." in the column name…is there a way to stop it from
>>>> doing that?, i.e. the first column becomes "Data and Time"  when
>>>> using
>>>> fread, rather than the original "Date.and.Time" when using
>>>> read.csv
>>>>
>>>>
>>>> On 26 Dec 2012, at 22:21, Matthew Dowle <[hidden email]>
>>>> wrote:
>>>>
>>>>>
>>>>> sep is now passed through and have added your example as a test.
>>>>> Hope ok now.
>>>>>
>>>>> Thanks,
>>>>> Matthew
>>>>>
>>>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>>>> using autostart=1 gives the following error
>>>>>>
>>>>>> Error in fread(file.path, autostart = 1) :
>>>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>>>> Time,Open,High,Low,Close,Volume
>>>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>
>>>>>>
>>>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Yes autostart is the line it detects separators, then it
>>>>>>> searches upwards to find the first row with the same number of
>>>>>>> columns. If that row is all character then it deems that as the
>>>>>>> column name row. So if you start autostart on 1, it's already at
>>>>>>> the top and it might catch the right separator by avoiding the
>>>>>>> data rows for separator detection.
>>>>>>>
>>>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>>>> Thanks for the quick response.
>>>>>>>>
>>>>>>>> I wasn't sure if I understood you correctly, but isn't the
>>>>>>>> problem
>>>>>>>> the way that autostart finds separators?
>>>>>>>>
>>>>>>>> and in my example, it had headers, so I think it would need to
>>>>>>>> start
>>>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header
>>>>>>>> values?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies,
>>>>>>>>> will fix.
>>>>>>>>> Maybe setting autostart to the row number of the header row
>>>>>>>>> (probably 1)
>>>>>>>>> might work.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Matthew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>>>> oups…forgot to add the output from the verbose part…here it
>>>>>>>>>> is...
>>>>>>>>>>
>>>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows
>>>>>>>>>> standard.
>>>>>>>>>> Starting format detection on line 30 (the last non blank
>>>>>>>>>> line in the
>>>>>>>>>> first 30)
>>>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>>>> Type codes: 003
>>>>>>>>>> Found first row with 3 fields occuring on line 1 (either
>>>>>>>>>> column names
>>>>>>>>>> or first row of data)
>>>>>>>>>> The first data row has some non character fields. Treating
>>>>>>>>>> as a data
>>>>>>>>>> row and using default column names.
>>>>>>>>>> Count of eol after pos: 1143699
>>>>>>>>>> Subtracted 1 for last eol and any trailing empty lines,
>>>>>>>>>> leaving
>>>>>>>>>> 1143698 data rows
>>>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data
>>>>>>>>>> already read
>>>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>>>> 0.731s        Total
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda
>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Matthew,
>>>>>>>>>>>
>>>>>>>>>>> I am using the new `data.table` `fread()` function to read
>>>>>>>>>>> my csv files, which has the format as follows when using the
>>>>>>>>>>> read.csv function
>>>>>>>>>>>
>>>>>>>>>>>        Date.and.Time Open High  Low Close Volume
>>>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>>>>
>>>>>>>>>>> The value of the first column is all of: `2007/01/01
>>>>>>>>>>> 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>>>>
>>>>>>>>>>> but when reading the same file using fread i get the
>>>>>>>>>>> following output
>>>>>>>>>>>
>>>>>>>>>>>    V1 V2                                             V3
>>>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>>>>
>>>>>>>>>>> This is because the autodetect is using the "/" as a
>>>>>>>>>>> separator...
>>>>>>>>>>>
>>>>>>>>>>> I tried overriding this using the `sep=","` argument but
>>>>>>>>>>> this does not seem to be used in the function anywhere.
>>>>>>>>>>>
>>>>>>>>>>> Furthremore when using verbose I get the following output,
>>>>>>>>>>> which suggests that I was right in thinking that "/" is used
>>>>>>>>>>> as a separator rather than ",".
>>>>>>>>>>>
>>>>>>>>>>> Is there any way to fix this, so that it correctly reads
>>>>>>>>>>> all 6 columns separately?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> HLM
>>>>>>>>>>>
>>>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle
>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi datatablers,
>>>>>>>>>>>>
>>>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>>>>
>>>>>>>>>>>> =====
>>>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto
>>>>>>>>>>>> detected.
>>>>>>>>>>>> * integers>2^31 are detected and read natively as
>>>>>>>>>>>> bit64::integer64.
>>>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>>>> read.csv("test.csv")                                   #
>>>>>>>>>>>> 30-60 sec
>>>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #  
>>>>>>>>>>>> 10 sec
>>>>>>>>>>>> fread("test.csv")                                      #  
>>>>>>>>>>>> 3 sec
>>>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #  
>>>>>>>>>>>> 360 sec
>>>>>>>>>>>> fread("2008.csv")                                      #  
>>>>>>>>>>>> 50 sec
>>>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for
>>>>>>>>>>>> ideas,
>>>>>>>>>>>> discussions and beta testing.
>>>>>>>>>>>> =====
>>>>>>>>>>>>
>>>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac
>>>>>>>>>>>> yet) :
>>>>>>>>>>>>
>>>>>>>>>>>> install.packages("data.table",
>>>>>>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>>>>>>> require(data.table)
>>>>>>>>>>>> ?fread
>>>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>>>>
>>>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2
>>>>>>>>>>>> optimization rather
>>>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be
>>>>>>>>>>>> as great on Win64
>>>>>>>>>>>> until that can be resolved on R-Forge, unless you compile
>>>>>>>>>>>> yourself. -O3
>>>>>>>>>>>> has some optimizations that fread may benefit from. But
>>>>>>>>>>>> interested to hear.
>>>>>>>>>>>>
>>>>>>>>>>>> Seasons greatings!
>>>>>>>>>>>>
>>>>>>>>>>>> Matthew
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> datatable-help mailing list
>>>>>>>>>>>> [hidden email]
>>>>>>>>>>>>
>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

Matthew Dowle
In reply to this post by Matthew Dowle

ASB,
All mid read column type bumps are now implemented and tests added.
This should work now.  Turn on verbose=TRUE to see the messages
telling you which field on which line caused the bump.
Need to upgrade to commit 785.

On 24.12.2012 13:52, Matthew Dowle wrote:

> Great. Looks like cols 3,4,9 and 12 are being detected as integer64
> ok (16 width is just about ok, limit is 18 digits for integer64), but
> later on in the file there is a . or more digits in one of those
> columns that causes the bump to real. There is a nice message telling
> you which line and which field and the contents is causing the bump,
> but the 'unimplemented' error happens first. oops,  will fix ..
>
> Thanks!
>
>
> On 24.12.2012 12:21, akhilsbehl wrote:
>> Here is a new problem:
>>
>> I have a csv that looks like this:
>>
>>
>> PO,CASH,2012080150000306,67389310793869,bbRELIANCE,EQ,74025,700,2012080150004326,1,3,2012080150001143,1,3
>>
>> PO,CASH,2012080150000307,67389310793884,bbRELIANCE,EQ,74025,2000,2012080150007969,1,3,2012080150001143,1,3
>>
>> PO,CASH,2012080150000308,67389310793896,bbRELIANCE,EQ,74025,1000,2012080150002222,1,3,2012080150001143,1,3
>>
>> read.csv(filename) gives me:
>>
>> 1 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025   700
>> 2.01208e+15   1
>> 3 2.01208e+15   1   3
>> 2 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025  2000
>> 2.01208e+15   1
>> 3 2.01208e+15   1   3
>> 3 PO CASH 2.01208e+15 6.738931e+13 bbRELIANCE EQ 74025  1000
>> 2.01208e+15   1
>> 3 2.01208e+15   1   3
>>
>> fread(filename, verbose=TRUE) gives me:
>>
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac
>> standard.
>> Starting format detection on line 30 (the last non blank line in the
>> first
>> 30)
>> Detected sep as ',' and 14 columns
>> Type codes: 33113300100100
>> Found first row with 14 fields occuring on line 1 (either column
>> names or
>> first row of data)
>> The first data row has some non character fields. Treating as a data
>> row and
>> using default column names.
>> Count of eol after pos: 54025
>> Subtracted 1 for last eol and any trailing empty lines, leaving
>> 54024 data
>> rows
>>
>> Error in fread(data.files[[2]], verbose = TRUE) :
>>   Coercing integer64 to real needs to be implemented
>>
>> Type codes show it is trying to read columns 3, 4, 9, 12 as real
>> numbers.
>> Now, I may be out of depth here but shouldn't they just be integers?
>> Am I
>> missing something?
>>
>> Thanks.
>>
>> --
>> ASB.
>>
>>
>>
>> --
>> View this message in context:
>>
>> http://r.789695.n4.nabble.com/New-function-fread-in-v1-8-7-tp4653745p4653872.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New function fread() in v1.8.7

patricknic
This post was updated on .
Hit a snag reading some imperfect data. I'm not sure what it was exported from, but the file has some lines with consecutive quotation marks (i.e., a character field actually contained quotation marks before it was written to a text file). Not sure if this is a known issue. A reproducible example:

text <- paste(rep(c('a,b,c,d,e,f\na,b,c,"d",e,f\na,b,c,""d"",e,f'), 10000), collapse="\n")
f <- tempfile()
writeLines(text, f)

df <- read.table(f, sep=",")
dt <- fread(f, sep=",", header=FALSE)


No error for read.table, but I get this error for fread:

Error in fread(f, sep = ",", header = FALSE) :
  Expected sep (',') but 'd' ends field 4 on line 30 when detecting types: a,b,c,""d


This also gave me an idea for a suggestion: text replacement in readfile.c. (I'm no C programmer, so I don't know if this would be more trouble than it's worth. Also, not sure if it is in your project scope.) An R mock-up (still using fread) of this would be something like:

freadWrapper <- function(input=f, eliminate='"', ...) {
  A <- readLines(input)
  B <- gsub(eliminate, "", A)
  C <- paste(B, collapse="\n")
  fread(C, ...)
}
freadWrapper(f, sep=",", stringsAsFactors=FALSE, header=FALSE)

12
Loading...