R package BibTex entries: looking for a more general solution

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

R package BibTex entries: looking for a more general solution

Michael Friendly
== Summary ==
* Problem: BibTeX entries extracted from R packages via citation()
require too much manual editing to be
of general use.
* Proposal: Date: fields should be made mandatory in package DESCRIPTION
files, perhaps
beginning with warnings from R CMD check
* Proposal: Package authors should be encouraged to use a (new)
Contributors: field in the DESCRIPTION file
rather than packing all information into the Author: field, which at
present cannot often be parsed by BibTeX.
* Files: All test files referred to here can be found at

http://euclid.psych.yorku.ca/SCS/Private/Rbibs/

== Details ==
Around 16 Dec. 2009, I queried R-help about automating the extraction of
citation()s from R packages. The stimulus
was that some journals, notably JSS, now require a reference and
citation of every R package mentioned,
and it is a pain to create these manually, no less maintain them for
current versions.

The result of that query was a function, Rpackage.bibs() by Achim
Zeileis that I have been using ever since.
Code in: http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpackages.bib.R
On one current system I get the following:

 > Rpackage.bibs(file="Rpackages-R.2.11.1.bib")
Converted 230 of 230 package citations to BibTex
Results written to file Rpackages-R.2.11.1.bib
Warning messages:
1: In citation(x) :
no date field in DESCRIPTION file of package 'codetools'
2: In citation(x) :
no date field in DESCRIPTION file of package 'gridBase'
3: In citation(x) : no date field in DESCRIPTION file of package 'iplots'
 >
See:
http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpkg-test.pdf
for the result of processing this .bib file with latex/bibtex using the
jss.bst bibliography style

I'm writing to R-Devel because the DESCRIPTION and inst/CITATION files
in R packages provide the
basic data used in citation() and any methods based on this, and yet the
information in these files is
often insufficient to generate well-formed BibTeX entries for use in
vignettes and publications.

One easy case is illustrated above, where 3 packages have no Date: field
so the BibTeX gets no
year = {},
and references get printed as Murrell, P (????) for gridBase. (In my
original test under R 2.9.1, there where
~ 20 such warnings.) Thus, I propose that Date: be a required field in
DESCRIPTION files, and
R CMD check complain if this is not found.

The more difficult case has to do with the Author: field in the
DESCRIPTION file (when no CITATION file is present)
People can write whatever they want here, and the result looks sort of
OK when printed by citation(), but confuses
BibTeX mightly. One example:

 > citation("akima")
To cite package ‘akima’ in publications use:

Fortran code by H. Akima R port by Albrecht Gebhardt aspline function
by Thomas Petzoldt <[hidden email]> enhancements and
corrections by Martin Maechler (2009). akima: Interpolation of
irregularly spaced data. R package version 0.5-4.
http://CRAN.R-project.org/package=akima

A BibTeX entry for LaTeX users is

@Manual{,
title = {akima: Interpolation of irregularly spaced data},
author = {Fortran code by H. Akima R port by Albrecht Gebhardt aspline
function by Thomas Petzoldt <[hidden email]>
enhancements and corrections by Martin Maechler},
year = {2009},
note = {R package version 0.5-4},
url = {http://CRAN.R-project.org/package=akima},
}

ATTENTION: This citation information has been auto-generated from the
package DESCRIPTION file and may need manual editing, see
‘help("citation")’ .
 >

Yes, the ATTENTION note does say that manual editing may be necessary,
but I think a worthy goal would be
to try to reduce the need for this.

One simple way to do that would be to support an extra Contributions:
field in the DESCRIPTION file,
so that Authors: can be more cleanly separated for the purpose of
creating well-structured BibTeX.
Perhaps others have better ideas.

-Michael

--
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    Web:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: R package BibTex entries: looking for a more general solution

Yihui Xie-2
I strongly support this proposal! I also find it inconvenient to cite
some R packages and really do not like edit the BibTeX entries
manually.

Regards,
Yihui
--
Yihui Xie <[hidden email]>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA



On Wed, Nov 3, 2010 at 8:44 AM, Michael Friendly <[hidden email]> wrote:

> == Summary ==
> * Problem: BibTeX entries extracted from R packages via citation() require
> too much manual editing to be
> of general use.
> * Proposal: Date: fields should be made mandatory in package DESCRIPTION
> files, perhaps
> beginning with warnings from R CMD check
> * Proposal: Package authors should be encouraged to use a (new)
> Contributors: field in the DESCRIPTION file
> rather than packing all information into the Author: field, which at present
> cannot often be parsed by BibTeX.
> * Files: All test files referred to here can be found at
>
> http://euclid.psych.yorku.ca/SCS/Private/Rbibs/
>
> == Details ==
> Around 16 Dec. 2009, I queried R-help about automating the extraction of
> citation()s from R packages. The stimulus
> was that some journals, notably JSS, now require a reference and citation of
> every R package mentioned,
> and it is a pain to create these manually, no less maintain them for current
> versions.
>
> The result of that query was a function, Rpackage.bibs() by Achim Zeileis
> that I have been using ever since.
> Code in: http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpackages.bib.R
> On one current system I get the following:
>
>> Rpackage.bibs(file="Rpackages-R.2.11.1.bib")
> Converted 230 of 230 package citations to BibTex
> Results written to file Rpackages-R.2.11.1.bib
> Warning messages:
> 1: In citation(x) :
> no date field in DESCRIPTION file of package 'codetools'
> 2: In citation(x) :
> no date field in DESCRIPTION file of package 'gridBase'
> 3: In citation(x) : no date field in DESCRIPTION file of package 'iplots'
>>
> See:
> http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpkg-test.pdf
> for the result of processing this .bib file with latex/bibtex using the
> jss.bst bibliography style
>
> I'm writing to R-Devel because the DESCRIPTION and inst/CITATION files in R
> packages provide the
> basic data used in citation() and any methods based on this, and yet the
> information in these files is
> often insufficient to generate well-formed BibTeX entries for use in
> vignettes and publications.
>
> One easy case is illustrated above, where 3 packages have no Date: field so
> the BibTeX gets no
> year = {},
> and references get printed as Murrell, P (????) for gridBase. (In my
> original test under R 2.9.1, there where
> ~ 20 such warnings.) Thus, I propose that Date: be a required field in
> DESCRIPTION files, and
> R CMD check complain if this is not found.
>
> The more difficult case has to do with the Author: field in the DESCRIPTION
> file (when no CITATION file is present)
> People can write whatever they want here, and the result looks sort of OK
> when printed by citation(), but confuses
> BibTeX mightly. One example:
>
>> citation("akima")
> To cite package ‘akima’ in publications use:
>
> Fortran code by H. Akima R port by Albrecht Gebhardt aspline function
> by Thomas Petzoldt <[hidden email]> enhancements and
> corrections by Martin Maechler (2009). akima: Interpolation of
> irregularly spaced data. R package version 0.5-4.
> http://CRAN.R-project.org/package=akima
>
> A BibTeX entry for LaTeX users is
>
> @Manual{,
> title = {akima: Interpolation of irregularly spaced data},
> author = {Fortran code by H. Akima R port by Albrecht Gebhardt aspline
> function by Thomas Petzoldt <[hidden email]> enhancements
> and corrections by Martin Maechler},
> year = {2009},
> note = {R package version 0.5-4},
> url = {http://CRAN.R-project.org/package=akima},
> }
>
> ATTENTION: This citation information has been auto-generated from the
> package DESCRIPTION file and may need manual editing, see
> ‘help("citation")’ .
>>
>
> Yes, the ATTENTION note does say that manual editing may be necessary, but I
> think a worthy goal would be
> to try to reduce the need for this.
>
> One simple way to do that would be to support an extra Contributions: field
> in the DESCRIPTION file,
> so that Authors: can be more cleanly separated for the purpose of creating
> well-structured BibTeX.
> Perhaps others have better ideas.
>
> -Michael
>
> --
> Michael Friendly     Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street    Web:http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: R package BibTex entries: looking for a more general solution

Michael Friendly
In reply to this post by Michael Friendly

On 11/3/2010 10:42 AM, John Fox wrote:
> Hi Michael,
>
> FWIW, I've added a CITATION file to the car package on R-Forge.
>
When I composed a CITATION file for the heplots package, I had to root
around in R/library to find something to
use as a template.  The main reason for doing this was to have
citation("heplots") also provide a reference to
the theory paper on which this was based, and which should also be cited
by users of the package.

  "Writing R Extensions", 1.10 gives a simple example of a CITATION
file, but even this is somewhat daunting for a
package author, and thus is generally ignored.

This would all be easier, and could evolve over time if there was a
function, e.g., prompt.citation(package)
in utils that would write a basic CITATION file from a package
DESCRIPTION, that a package author could
easily edit and then include in the package.

I can imagine it working something like:

 > prompt.citation("fubar")
A CITATION file for the fubar package was automatically generated from
the package DESCRIPTION
file.  Please edit this file and move it to the appropriate directory,
inst/ in the package tree.
Warning: no Date: field provided in the package DESCRIPTION
Warning: Authors: field does not appear to be a list of names joined
with 'and'

Providing something like this would obviate my suggestions to impose
restrictions on DESCRIPTION
files, and seems more in line with the way R packages and documentation
have developed.

-Michael



--
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    Web:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: R package BibTex entries: looking for a more general solution

Kurt Hornik-5
In reply to this post by Michael Friendly
>>>>> Michael Friendly writes:

Thanks for the suggestions. In fact, we are currently working on this
issue. A lot of improvements have already been done, see ?person and
?bibentry for R 2.12.0 or later, especially the details and examples
sections. Some more work still needs to be done, though. We will write a
primer that introduces the new features when all of them are available.

Best
-k

> == Summary ==
> * Problem: BibTeX entries extracted from R packages via citation()
> require too much manual editing to be
> of general use.
> * Proposal: Date: fields should be made mandatory in package DESCRIPTION
> files, perhaps
> beginning with warnings from R CMD check
> * Proposal: Package authors should be encouraged to use a (new)
> Contributors: field in the DESCRIPTION file
> rather than packing all information into the Author: field, which at
> present cannot often be parsed by BibTeX.
> * Files: All test files referred to here can be found at

> http://euclid.psych.yorku.ca/SCS/Private/Rbibs/

> == Details ==
> Around 16 Dec. 2009, I queried R-help about automating the extraction of
> citation()s from R packages. The stimulus
> was that some journals, notably JSS, now require a reference and
> citation of every R package mentioned,
> and it is a pain to create these manually, no less maintain them for
> current versions.

> The result of that query was a function, Rpackage.bibs() by Achim
> Zeileis that I have been using ever since.
> Code in: http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpackages.bib.R
> On one current system I get the following:

>> Rpackage.bibs(file="Rpackages-R.2.11.1.bib")
> Converted 230 of 230 package citations to BibTex
> Results written to file Rpackages-R.2.11.1.bib
> Warning messages:
> 1: In citation(x) :
> no date field in DESCRIPTION file of package 'codetools'
> 2: In citation(x) :
> no date field in DESCRIPTION file of package 'gridBase'
> 3: In citation(x) : no date field in DESCRIPTION file of package 'iplots'
>>
> See:
> http://euclid.psych.yorku.ca/SCS/Private/Rbibs/Rpkg-test.pdf
> for the result of processing this .bib file with latex/bibtex using the
> jss.bst bibliography style

> I'm writing to R-Devel because the DESCRIPTION and inst/CITATION files
> in R packages provide the
> basic data used in citation() and any methods based on this, and yet the
> information in these files is
> often insufficient to generate well-formed BibTeX entries for use in
> vignettes and publications.

> One easy case is illustrated above, where 3 packages have no Date: field
> so the BibTeX gets no
> year = {},
> and references get printed as Murrell, P (????) for gridBase. (In my
> original test under R 2.9.1, there where
> ~ 20 such warnings.) Thus, I propose that Date: be a required field in
> DESCRIPTION files, and
> R CMD check complain if this is not found.

> The more difficult case has to do with the Author: field in the
> DESCRIPTION file (when no CITATION file is present)
> People can write whatever they want here, and the result looks sort of
> OK when printed by citation(), but confuses
> BibTeX mightly. One example:

>> citation("akima")
> To cite package ‘akima’ in publications use:

> Fortran code by H. Akima R port by Albrecht Gebhardt aspline function
> by Thomas Petzoldt <[hidden email]> enhancements and
> corrections by Martin Maechler (2009). akima: Interpolation of
> irregularly spaced data. R package version 0.5-4.
> http://CRAN.R-project.org/package=akima

> A BibTeX entry for LaTeX users is

> @Manual{,
> title = {akima: Interpolation of irregularly spaced data},
> author = {Fortran code by H. Akima R port by Albrecht Gebhardt aspline
> function by Thomas Petzoldt <[hidden email]>
> enhancements and corrections by Martin Maechler},
> year = {2009},
> note = {R package version 0.5-4},
> url = {http://CRAN.R-project.org/package=akima},
> }

> ATTENTION: This citation information has been auto-generated from the
> package DESCRIPTION file and may need manual editing, see
> ‘help("citation")’ .
>>

> Yes, the ATTENTION note does say that manual editing may be necessary,
> but I think a worthy goal would be
> to try to reduce the need for this.

> One simple way to do that would be to support an extra Contributions:
> field in the DESCRIPTION file,
> so that Authors: can be more cleanly separated for the purpose of
> creating well-structured BibTeX.
> Perhaps others have better ideas.

> -Michael

> --
> Michael Friendly     Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street    Web:http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA

> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Bug in read.table?

jgarcia-2
In reply to this post by Michael Friendly
Hi,

I'm writting to this list as I'm puzzled about the behaviour of
read.table(). It is hard to believe that there is a bug in this utils'
function, but for my:

R version 2.12.0 alpha (2010-09-28 r53056)

I'm using scan and read.table to read a number of files, which are as:

---

Project:     Murta Sonda
Program:     GrafNav Version 8.30.1007
Profile:     javier
Source:      GPS Epochs(Combined)
ProcessInfo: Run (1) by Unknown on 11/04/2010 at 19:05:17

Datum:       WGS84, (processing datum)
Master 1:    Name LaMurta, Status ENABLED
             Antenna height 2.066 m, to L1-PC (NOV702GG, MeasDist 1.980 m
to mark/ARP)
             Position 37 49 38.15069, -1 12 27.55445, 368.197 m (WGS84,
Ellipsoidal hgt)
Remote:      Antenna height 1.781 m, to L1-PC (NOV702GG, MeasDist 1.695 m
to mark/ARP)
UTC Offset:  15 s
Local time:  +2.0 h, CEST [Central European Savings Time]
Geoid:       EGM2008-World.wpg (Absolute correction)

      Latitude      Longitude LonTextLoTextLongitudTextL
LatTextLaTextLatitudeTextL        H-Ell        H-MSL LocalUTCDa  
LocalUTC
         (Deg)          (Deg) (DeMi   (Sec)  (DeMi   (Sec)           (m)  
       (m)      (DMY)       (HMS)
 37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N    
368.998      318.059 25/10/2010    16:59:00
 37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N    
368.994      318.055 25/10/2010    16:59:15
 37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N    
368.997      318.058 25/10/2010    16:59:30
 37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N    
368.998      318.060 25/10/2010    16:59:45
 37.8275323099  -1.2078075891 001º12'28.10732"W 037º49'39.11632"N    
368.869      317.930 25/10/2010    17:00:00
 37.8275323374  -1.2078077002 001º12'28.10772"W 037º49'39.11641"N    
368.866      317.927 25/10/2010    17:00:15
 37.8275325076  -1.2078075314 001º12'28.10711"W 037º49'39.11703"N    
368.859      317.920 25/10/2010    17:00:30
 37.8275325306  -1.2078075056 001º12'28.10702"W 037º49'39.11711"N    
368.861      317.922 25/10/2010    17:00:45
 37.8275323639  -1.2078075917 001º12'28.10733"W 037º49'39.11651"N    
368.853      317.914 25/10/2010    17:01:00
 37.8275326222  -1.2078076861 001º12'28.10767"W 037º49'39.11744"N    
368.857      317.918 25/10/2010    17:01:15
---

with a number of different records for each file.

To read the data I'm using:

---
 dat.names <- scan(file.path("path_and_filename"),
                   what="character",
                   skip = 16, nlines=1)
 if(length(dat.names) != 8){
    stop("Input file seems to be wrong!")}

 dat <- read.table(file.path("path_and_filename),
                   header=FALSE, col.names=dat.names,
                   skip = 18, as.is=TRUE, blank.lines.skip=FALSE)
---
and systematically, I'm obtaining a number of repeated records at the
starting of the input table (6 in this example). It is easily seen by
looking at the field "LocalUTC":

> dat
   Latitude Longitude LonTextLoTextLongitudTextL
LatTextLaTextLatitudeTextL   H.Ell   H.MSL LocalUTCDa LocalUTC
1  37.82753 -1.207808          001º12'28.10732"W        
037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
2  37.82753 -1.207808          001º12'28.10772"W        
037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
3  37.82753 -1.207808          001º12'28.10711"W        
037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
4  37.82753 -1.207808          001º12'28.10702"W        
037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
5  37.82753 -1.207808          001º12'28.10733"W        
037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
6  37.82753 -1.207808          001º12'28.10767"W        
037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
7  37.82751 -1.207797          001º12'28.07013"W        
037º49'39.04345"N 368.998 318.059 25/10/2010 16:59:00
8  37.82751 -1.207797          001º12'28.07093"W        
037º49'39.04359"N 368.994 318.055 25/10/2010 16:59:15
9  37.82751 -1.207797          001º12'28.07076"W        
037º49'39.04267"N 368.997 318.058 25/10/2010 16:59:30
10 37.82751 -1.207797          001º12'28.07087"W        
037º49'39.04317"N 368.998 318.060 25/10/2010 16:59:45
11 37.82753 -1.207808          001º12'28.10732"W        
037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
12 37.82753 -1.207808          001º12'28.10772"W        
037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
13 37.82753 -1.207808          001º12'28.10711"W        
037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
14 37.82753 -1.207808          001º12'28.10702"W        
037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
15 37.82753 -1.207808          001º12'28.10733"W        
037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
16 37.82753 -1.207808          001º12'28.10767"W        
037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15

Thanks,

Javier
---

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

cberry
On Fri, 5 Nov 2010, [hidden email] wrote:

> Hi,
>
> I'm writting to this list as I'm puzzled about the behaviour of
> read.table(). It is hard to believe that there is a bug in this utils'
> function, but for my:
>
> R version 2.12.0 alpha (2010-09-28 r53056)
>
> I'm using scan and read.table to read a number of files, which are as:
>
There are line wraps here, so we can't just cut-and-paste.


> ---
>
> Project:     Murta Sonda
> Program:     GrafNav Version 8.30.1007
> Profile:     javier
> Source:      GPS Epochs(Combined)
> ProcessInfo: Run (1) by Unknown on 11/04/2010 at 19:05:17
>
> Datum:       WGS84, (processing datum)
> Master 1:    Name LaMurta, Status ENABLED
>             Antenna height 2.066 m, to L1-PC (NOV702GG, MeasDist 1.980 m
> to mark/ARP)
>             Position 37 49 38.15069, -1 12 27.55445, 368.197 m (WGS84,
> Ellipsoidal hgt)
> Remote:      Antenna height 1.781 m, to L1-PC (NOV702GG, MeasDist 1.695 m
> to mark/ARP)
> UTC Offset:  15 s
> Local time:  +2.0 h, CEST [Central European Savings Time]
> Geoid:       EGM2008-World.wpg (Absolute correction)
>
>      Latitude      Longitude LonTextLoTextLongitudTextL
> LatTextLaTextLatitudeTextL        H-Ell        H-MSL LocalUTCDa
> LocalUTC
>         (Deg)          (Deg) (DeMi   (Sec)  (DeMi   (Sec)           (m)
>       (m)      (DMY)       (HMS)
> 37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N
> 368.998      318.059 25/10/2010    16:59:00
> 37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N
> 368.994      318.055 25/10/2010    16:59:15
> 37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N
> 368.997      318.058 25/10/2010    16:59:30
> 37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N
> 368.998      318.060 25/10/2010    16:59:45
> 37.8275323099  -1.2078075891 001º12'28.10732"W 037º49'39.11632"N
> 368.869      317.930 25/10/2010    17:00:00
> 37.8275323374  -1.2078077002 001º12'28.10772"W 037º49'39.11641"N
> 368.866      317.927 25/10/2010    17:00:15
> 37.8275325076  -1.2078075314 001º12'28.10711"W 037º49'39.11703"N
> 368.859      317.920 25/10/2010    17:00:30
> 37.8275325306  -1.2078075056 001º12'28.10702"W 037º49'39.11711"N
> 368.861      317.922 25/10/2010    17:00:45
> 37.8275323639  -1.2078075917 001º12'28.10733"W 037º49'39.11651"N
> 368.853      317.914 25/10/2010    17:01:00
> 37.8275326222  -1.2078076861 001º12'28.10767"W 037º49'39.11744"N
> 368.857      317.918 25/10/2010    17:01:15
> ---
>
Uh, what about those quotes??

Using quote = '' yields 'dat' sans duplicates.

I'll leave it to others to decide if this is a bug.


> with a number of different records for each file.
>
> To read the data I'm using:
>
> ---
> dat.names <- scan(file.path("path_and_filename"),
>                   what="character",
>                   skip = 16, nlines=1)
> if(length(dat.names) != 8){
>    stop("Input file seems to be wrong!")}
>
> dat <- read.table(file.path("path_and_filename),
>                   header=FALSE, col.names=dat.names,
>                   skip = 18, as.is=TRUE, blank.lines.skip=FALSE)
> ---
> and systematically, I'm obtaining a number of repeated records at the
> starting of the input table (6 in this example). It is easily seen by
> looking at the field "LocalUTC":
Or looking at duplicated(dat)

HTH,

Chuck


>
>> dat
>   Latitude Longitude LonTextLoTextLongitudTextL
> LatTextLaTextLatitudeTextL   H.Ell   H.MSL LocalUTCDa LocalUTC
> 1  37.82753 -1.207808          001º12'28.10732"W
> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
> 2  37.82753 -1.207808          001º12'28.10772"W
> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
> 3  37.82753 -1.207808          001º12'28.10711"W
> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
> 4  37.82753 -1.207808          001º12'28.10702"W
> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
> 5  37.82753 -1.207808          001º12'28.10733"W
> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
> 6  37.82753 -1.207808          001º12'28.10767"W
> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
> 7  37.82751 -1.207797          001º12'28.07013"W
> 037º49'39.04345"N 368.998 318.059 25/10/2010 16:59:00
> 8  37.82751 -1.207797          001º12'28.07093"W
> 037º49'39.04359"N 368.994 318.055 25/10/2010 16:59:15
> 9  37.82751 -1.207797          001º12'28.07076"W
> 037º49'39.04267"N 368.997 318.058 25/10/2010 16:59:30
> 10 37.82751 -1.207797          001º12'28.07087"W
> 037º49'39.04317"N 368.998 318.060 25/10/2010 16:59:45
> 11 37.82753 -1.207808          001º12'28.10732"W
> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
> 12 37.82753 -1.207808          001º12'28.10772"W
> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
> 13 37.82753 -1.207808          001º12'28.10711"W
> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
> 14 37.82753 -1.207808          001º12'28.10702"W
> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
> 15 37.82753 -1.207808          001º12'28.10733"W
> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
> 16 37.82753 -1.207808          001º12'28.10767"W
> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
>
> Thanks,
>
> Javier
> ---
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
Charles C. Berry                            Dept of Family/Preventive Medicine
[hidden email]    UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

Tony Plate-3
In reply to this post by jgarcia-2
The problem has to do with the quote characters in the data (R is probably interpreting the 'minutes' and 'seconds' as delimiter characters).

With a smaller data file, I can reproduce the strange behavior.

read.table() can read the data correctly if given quote="" to disable the interpretation of quote chars.

Contents of tmp2.txt:
  37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N
  37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N
  37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N
  37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N


 > read.table(file.path("tmp2.txt"), header=FALSE, as.is=TRUE)
         V1        V2                V3                V4
1 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
2 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
3 37.82751 -1.207797 001º12'28.07013"W 037º49'39.04345"N
4 37.82751 -1.207797 001º12'28.07093"W 037º49'39.04359"N
5 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
6 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
Warning message:
In read.table(file.path("tmp2.txt"), header = FALSE, as.is = TRUE) :
   incomplete final line found by readTableHeader on 'tmp2.txt'
 > read.table(file.path("tmp2.txt"), header=FALSE, as.is=TRUE, quote="")
         V1        V2                V3                V4
1 37.82751 -1.207797 001º12'28.07013"W 037º49'39.04345"N
2 37.82751 -1.207797 001º12'28.07093"W 037º49'39.04359"N
3 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
4 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
 >

The docs for read.table() direct the reader to the docs for scan() regarding the behavior with embedded quote chars.  The behavior of read.table() on this data with the default quote chars is puzzling though.

-- Tony Plate

On 11/5/2010 5:22 PM, [hidden email] wrote:

> Hi,
>
> I'm writting to this list as I'm puzzled about the behaviour of
> read.table(). It is hard to believe that there is a bug in this utils'
> function, but for my:
>
> R version 2.12.0 alpha (2010-09-28 r53056)
>
> I'm using scan and read.table to read a number of files, which are as:
>
> ---
>
> Project:     Murta Sonda
> Program:     GrafNav Version 8.30.1007
> Profile:     javier
> Source:      GPS Epochs(Combined)
> ProcessInfo: Run (1) by Unknown on 11/04/2010 at 19:05:17
>
> Datum:       WGS84, (processing datum)
> Master 1:    Name LaMurta, Status ENABLED
>               Antenna height 2.066 m, to L1-PC (NOV702GG, MeasDist 1.980 m
> to mark/ARP)
>               Position 37 49 38.15069, -1 12 27.55445, 368.197 m (WGS84,
> Ellipsoidal hgt)
> Remote:      Antenna height 1.781 m, to L1-PC (NOV702GG, MeasDist 1.695 m
> to mark/ARP)
> UTC Offset:  15 s
> Local time:  +2.0 h, CEST [Central European Savings Time]
> Geoid:       EGM2008-World.wpg (Absolute correction)
>
>        Latitude      Longitude LonTextLoTextLongitudTextL
> LatTextLaTextLatitudeTextL        H-Ell        H-MSL LocalUTCDa
> LocalUTC
>           (Deg)          (Deg) (DeMi   (Sec)  (DeMi   (Sec)           (m)
>         (m)      (DMY)       (HMS)
>   37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N
> 368.998      318.059 25/10/2010    16:59:00
>   37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N
> 368.994      318.055 25/10/2010    16:59:15
>   37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N
> 368.997      318.058 25/10/2010    16:59:30
>   37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N
> 368.998      318.060 25/10/2010    16:59:45
>   37.8275323099  -1.2078075891 001º12'28.10732"W 037º49'39.11632"N
> 368.869      317.930 25/10/2010    17:00:00
>   37.8275323374  -1.2078077002 001º12'28.10772"W 037º49'39.11641"N
> 368.866      317.927 25/10/2010    17:00:15
>   37.8275325076  -1.2078075314 001º12'28.10711"W 037º49'39.11703"N
> 368.859      317.920 25/10/2010    17:00:30
>   37.8275325306  -1.2078075056 001º12'28.10702"W 037º49'39.11711"N
> 368.861      317.922 25/10/2010    17:00:45
>   37.8275323639  -1.2078075917 001º12'28.10733"W 037º49'39.11651"N
> 368.853      317.914 25/10/2010    17:01:00
>   37.8275326222  -1.2078076861 001º12'28.10767"W 037º49'39.11744"N
> 368.857      317.918 25/10/2010    17:01:15
> ---
>
> with a number of different records for each file.
>
> To read the data I'm using:
>
> ---
>   dat.names<- scan(file.path("path_and_filename"),
>                     what="character",
>                     skip = 16, nlines=1)
>   if(length(dat.names) != 8){
>      stop("Input file seems to be wrong!")}
>
>   dat<- read.table(file.path("path_and_filename),
>                     header=FALSE, col.names=dat.names,
>                     skip = 18, as.is=TRUE, blank.lines.skip=FALSE)
> ---
> and systematically, I'm obtaining a number of repeated records at the
> starting of the input table (6 in this example). It is easily seen by
> looking at the field "LocalUTC":
>
>> dat
>     Latitude Longitude LonTextLoTextLongitudTextL
> LatTextLaTextLatitudeTextL   H.Ell   H.MSL LocalUTCDa LocalUTC
> 1  37.82753 -1.207808          001º12'28.10732"W
> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
> 2  37.82753 -1.207808          001º12'28.10772"W
> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
> 3  37.82753 -1.207808          001º12'28.10711"W
> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
> 4  37.82753 -1.207808          001º12'28.10702"W
> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
> 5  37.82753 -1.207808          001º12'28.10733"W
> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
> 6  37.82753 -1.207808          001º12'28.10767"W
> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
> 7  37.82751 -1.207797          001º12'28.07013"W
> 037º49'39.04345"N 368.998 318.059 25/10/2010 16:59:00
> 8  37.82751 -1.207797          001º12'28.07093"W
> 037º49'39.04359"N 368.994 318.055 25/10/2010 16:59:15
> 9  37.82751 -1.207797          001º12'28.07076"W
> 037º49'39.04267"N 368.997 318.058 25/10/2010 16:59:30
> 10 37.82751 -1.207797          001º12'28.07087"W
> 037º49'39.04317"N 368.998 318.060 25/10/2010 16:59:45
> 11 37.82753 -1.207808          001º12'28.10732"W
> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
> 12 37.82753 -1.207808          001º12'28.10772"W
> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
> 13 37.82753 -1.207808          001º12'28.10711"W
> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
> 14 37.82753 -1.207808          001º12'28.10702"W
> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
> 15 37.82753 -1.207808          001º12'28.10733"W
> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
> 16 37.82753 -1.207808          001º12'28.10767"W
> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
>
> Thanks,
>
> Javier
> ---
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

jgarcia-2
Thanks. Yes, quote="" solves the problem.

I would never say, however, from the documentations, that this was causing
the duplicate records. Rather, I would have expected some kind of
warning/error message.

And, yes, I knew that, through duplicate(), R solves gracefully this
specific problem. Just thought this could be of interests for R devel.

Thanks to all,
Javier
---

The docs for read.table() direct the reader to the docs for scan()
regarding the
behavior with embedded quote chars.  The behavior of read.table() on this
data with
the default quote chars is puzzling though.


> The problem has to do with the quote characters in the data (R is probably
> interpreting the 'minutes' and 'seconds' as delimiter characters).
>
> With a smaller data file, I can reproduce the strange behavior.
>
> read.table() can read the data correctly if given quote="" to disable the
> interpretation of quote chars.
>
> Contents of tmp2.txt:
>   37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N
>   37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N
>   37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N
>   37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N
>
>
>  > read.table(file.path("tmp2.txt"), header=FALSE, as.is=TRUE)
>          V1        V2                V3                V4
> 1 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
> 2 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
> 3 37.82751 -1.207797 001º12'28.07013"W 037º49'39.04345"N
> 4 37.82751 -1.207797 001º12'28.07093"W 037º49'39.04359"N
> 5 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
> 6 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
> Warning message:
> In read.table(file.path("tmp2.txt"), header = FALSE, as.is = TRUE) :
>    incomplete final line found by readTableHeader on 'tmp2.txt'
>  > read.table(file.path("tmp2.txt"), header=FALSE, as.is=TRUE, quote="")
>          V1        V2                V3                V4
> 1 37.82751 -1.207797 001º12'28.07013"W 037º49'39.04345"N
> 2 37.82751 -1.207797 001º12'28.07093"W 037º49'39.04359"N
> 3 37.82751 -1.207797 001º12'28.07076"W 037º49'39.04267"N
> 4 37.82751 -1.207797 001º12'28.07087"W 037º49'39.04317"N
>  >
>
> The docs for read.table() direct the reader to the docs for scan()
> regarding the behavior with embedded quote chars.  The behavior of
> read.table() on this data with the default quote chars is puzzling though.
>
> -- Tony Plate
>
> On 11/5/2010 5:22 PM, [hidden email] wrote:
>> Hi,
>>
>> I'm writting to this list as I'm puzzled about the behaviour of
>> read.table(). It is hard to believe that there is a bug in this utils'
>> function, but for my:
>>
>> R version 2.12.0 alpha (2010-09-28 r53056)
>>
>> I'm using scan and read.table to read a number of files, which are as:
>>
>> ---
>>
>> Project:     Murta Sonda
>> Program:     GrafNav Version 8.30.1007
>> Profile:     javier
>> Source:      GPS Epochs(Combined)
>> ProcessInfo: Run (1) by Unknown on 11/04/2010 at 19:05:17
>>
>> Datum:       WGS84, (processing datum)
>> Master 1:    Name LaMurta, Status ENABLED
>>               Antenna height 2.066 m, to L1-PC (NOV702GG, MeasDist 1.980
>> m
>> to mark/ARP)
>>               Position 37 49 38.15069, -1 12 27.55445, 368.197 m (WGS84,
>> Ellipsoidal hgt)
>> Remote:      Antenna height 1.781 m, to L1-PC (NOV702GG, MeasDist 1.695
>> m
>> to mark/ARP)
>> UTC Offset:  15 s
>> Local time:  +2.0 h, CEST [Central European Savings Time]
>> Geoid:       EGM2008-World.wpg (Absolute correction)
>>
>>        Latitude      Longitude LonTextLoTextLongitudTextL
>> LatTextLaTextLatitudeTextL        H-Ell        H-MSL LocalUTCDa
>> LocalUTC
>>           (Deg)          (Deg) (DeMi   (Sec)  (DeMi   (Sec)
>> (m)
>>         (m)      (DMY)       (HMS)
>>   37.8275120694  -1.2077972583 001º12'28.07013"W 037º49'39.04345"N
>> 368.998      318.059 25/10/2010    16:59:00
>>   37.8275121083  -1.2077974806 001º12'28.07093"W 037º49'39.04359"N
>> 368.994      318.055 25/10/2010    16:59:15
>>   37.8275118539  -1.2077974338 001º12'28.07076"W 037º49'39.04267"N
>> 368.997      318.058 25/10/2010    16:59:30
>>   37.8275119923  -1.2077974626 001º12'28.07087"W 037º49'39.04317"N
>> 368.998      318.060 25/10/2010    16:59:45
>>   37.8275323099  -1.2078075891 001º12'28.10732"W 037º49'39.11632"N
>> 368.869      317.930 25/10/2010    17:00:00
>>   37.8275323374  -1.2078077002 001º12'28.10772"W 037º49'39.11641"N
>> 368.866      317.927 25/10/2010    17:00:15
>>   37.8275325076  -1.2078075314 001º12'28.10711"W 037º49'39.11703"N
>> 368.859      317.920 25/10/2010    17:00:30
>>   37.8275325306  -1.2078075056 001º12'28.10702"W 037º49'39.11711"N
>> 368.861      317.922 25/10/2010    17:00:45
>>   37.8275323639  -1.2078075917 001º12'28.10733"W 037º49'39.11651"N
>> 368.853      317.914 25/10/2010    17:01:00
>>   37.8275326222  -1.2078076861 001º12'28.10767"W 037º49'39.11744"N
>> 368.857      317.918 25/10/2010    17:01:15
>> ---
>>
>> with a number of different records for each file.
>>
>> To read the data I'm using:
>>
>> ---
>>   dat.names<- scan(file.path("path_and_filename"),
>>                     what="character",
>>                     skip = 16, nlines=1)
>>   if(length(dat.names) != 8){
>>      stop("Input file seems to be wrong!")}
>>
>>   dat<- read.table(file.path("path_and_filename),
>>                     header=FALSE, col.names=dat.names,
>>                     skip = 18, as.is=TRUE, blank.lines.skip=FALSE)
>> ---
>> and systematically, I'm obtaining a number of repeated records at the
>> starting of the input table (6 in this example). It is easily seen by
>> looking at the field "LocalUTC":
>>
>>> dat
>>     Latitude Longitude LonTextLoTextLongitudTextL
>> LatTextLaTextLatitudeTextL   H.Ell   H.MSL LocalUTCDa LocalUTC
>> 1  37.82753 -1.207808          001º12'28.10732"W
>> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
>> 2  37.82753 -1.207808          001º12'28.10772"W
>> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
>> 3  37.82753 -1.207808          001º12'28.10711"W
>> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
>> 4  37.82753 -1.207808          001º12'28.10702"W
>> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
>> 5  37.82753 -1.207808          001º12'28.10733"W
>> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
>> 6  37.82753 -1.207808          001º12'28.10767"W
>> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
>> 7  37.82751 -1.207797          001º12'28.07013"W
>> 037º49'39.04345"N 368.998 318.059 25/10/2010 16:59:00
>> 8  37.82751 -1.207797          001º12'28.07093"W
>> 037º49'39.04359"N 368.994 318.055 25/10/2010 16:59:15
>> 9  37.82751 -1.207797          001º12'28.07076"W
>> 037º49'39.04267"N 368.997 318.058 25/10/2010 16:59:30
>> 10 37.82751 -1.207797          001º12'28.07087"W
>> 037º49'39.04317"N 368.998 318.060 25/10/2010 16:59:45
>> 11 37.82753 -1.207808          001º12'28.10732"W
>> 037º49'39.11632"N 368.869 317.930 25/10/2010 17:00:00
>> 12 37.82753 -1.207808          001º12'28.10772"W
>> 037º49'39.11641"N 368.866 317.927 25/10/2010 17:00:15
>> 13 37.82753 -1.207808          001º12'28.10711"W
>> 037º49'39.11703"N 368.859 317.920 25/10/2010 17:00:30
>> 14 37.82753 -1.207808          001º12'28.10702"W
>> 037º49'39.11711"N 368.861 317.922 25/10/2010 17:00:45
>> 15 37.82753 -1.207808          001º12'28.10733"W
>> 037º49'39.11651"N 368.853 317.914 25/10/2010 17:01:00
>> 16 37.82753 -1.207808          001º12'28.10767"W
>> 037º49'39.11744"N 368.857 317.918 25/10/2010 17:01:15
>>
>> Thanks,
>>
>> Javier
>> ---
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

bbolker
 <jgarcia <at> ija.csic.es> writes:

>
> Thanks. Yes, quote="" solves the problem.
>
> I would never say, however, from the documentations, that this was causing
> the duplicate records. Rather, I would have expected some kind of
> warning/error message.
>
> And, yes, I knew that, through duplicate(), R solves gracefully this
> specific problem. Just thought this could be of interests for R devel.
>

  A bit of a meta- point here: there may indeed be a bug here
(it's the kind of obscure "corner case" that someone may not have
tested), but it's unlikely to get noted as such and fixed unless you
can come up with a clear analysis of what is happening and how the
misinterpretation of quote characters is leading to duplication of
records. (You, or someone else -- recognizing that this may be beyond
your skill level.  It might be that 'just' very careful thought
and analysis of the behavior described in the documentation would
explain this, or one might have to dig through source code in R or C.)
Problems with unescaped/unrecognized quote characters are very
common.

 Otherwise, this will likely be dismissed as a ("doctor, it hurts
when I do this"; "well then, don't do that!") sort of situation.

  Ben Bolker

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

bbolker
Ben Bolker <bbolker <at> gmail.com> writes:

>
>  <jgarcia <at> ija.csic.es> writes:
>
> >
> > Thanks. Yes, quote="" solves the problem.
> >
> > I would never say, however, from the documentations, that this was causing
> > the duplicate records. Rather, I would have expected some kind of
> > warning/error message.
> >
> > And, yes, I knew that, through duplicate(), R solves gracefully this
> > specific problem. Just thought this could be of interests for R devel.
> >
>
>   A bit of a meta- point here: there may indeed be a bug here
> (it's the kind of obscure "corner case" that someone may not have
> tested), but it's unlikely to get noted as such and fixed unless you
> can come up with a clear analysis of what is happening and how the
> misinterpretation of quote characters is leading to duplication of
> records. (You, or someone else -- recognizing that this may be beyond
> your skill level.  It might be that 'just' very careful thought
> and analysis of the behavior described in the documentation would
> explain this, or one might have to dig through source code in R or C.)
> Problems with unescaped/unrecognized quote characters are very
> common.
>
>  Otherwise, this will likely be dismissed as a ("doctor, it hurts
> when I do this"; "well then, don't do that!") sort of situation.
>
>   Ben Bolker

  Following up on my own point:

    The bottom line is that the internal readTableHead() command
handles newlines within quoted strings differently from scan().

  Explanation:

a simpler file that replicates the problem is

a b'c"d"e
f g'h"i"j
k l'm"n"o

(didn't want to try reading this from a textConnection --
escaping all the quotes properly would have driven me nuts).

 One of the first things that happens in read.table is that
the first few lines are read with readTableHead:

  lines <- .Internal(readTableHead(file, nlines, comment.char,
       blank.lines.skip, quote, sep))

  in this case, this reads the first two lines as one line;
the single quote at pos. 4 of the first line closes on pos.
4 of the second line, preventing the first new line from
ending a line.

  R then pushes back two copies of the lines that have
been read (this is normal behavior; I don't quite follow the
logic).

  The rest of the file is read with scan(), 1 line at a time.
However, there is the discrepancy between the way
that readTableHead interprets new lines in the middle of
quoted strings (it ignores them) and the way that scan()
interprets them (it takes them as the end of the quoted string).

In particular, if the  file "tmp3.txt" is as shown above, then
the command

.Internal(readTableHead(file("tmp3.txt"),nlines=1L,"#",FALSE,quote="\"'",sep=""))

produces

[1] "a b'c\"d\"e\nf g'h\"i\"j"
 
(i.e. it grabs the first two lines, including the \n)

and

scan(file("tmp3.txt"),nlines=1L,quote="\"'",what="")

produces

Read 2 items
[1] "a"         "b'c\"d\"e"

(it terminates the line in the middle of the string opened
by the single quote).

 I don't know what the consequences would be of changing
readTableHead to match scan()'s behavior, or how much
trouble it would be to do so.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

bbolker
Ben Bolker <bbolker <at> gmail.com> writes:

>
>

   Can simplify this still farther:

a b'c
d e'f
g h'i


>  One of the first things that happens in read.table is that
> the first few lines are read with readTableHead:
>
>   lines <- .Internal(readTableHead(file, nlines, comment.char,
>        blank.lines.skip, quote, sep))
>
  in this case, this reads the first two lines as one line;
the single quote at pos. 4 of the first line closes on pos.
4 of the second line, preventing the first new line from
ending a line.

  R then pushes back two copies of the lines that have
been read (this is normal behavior; I don't quite follow the
logic).
 
  The rest of the file is read with scan(), 1 line at a time.
However, there is the discrepancy between the way
that readTableHead interprets new lines in the middle of
quoted strings (it ignores them) and the way that scan()
interprets them (it takes them as the end of the quoted string).

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

bbolker
Ben Bolker <bbolker <at> gmail.com> writes:

>
> Ben Bolker <bbolker <at> gmail.com> writes:
>
> >
> >
>
>    Can simplify this still farther:
>
> a b'c
> d e'f
> g h'i

  This example file leads to duplicate lines.
Arguably it should have behavior analogous to:

> scan(what="")
1: a b'c
3: d e'f
5: g h'i
7: Read 6 items
[1] "a"   "b'c" "d"   "e'f" "g"   "h'i"


>
> >  One of the first things that happens in read.table is that
> > the first few lines are read with readTableHead:
> >
> >   lines <- .Internal(readTableHead(file, nlines, comment.char,
> >        blank.lines.skip, quote, sep))
> >
>   in this case, this reads the first two lines as one line;
> the single quote at pos. 4 of the first line closes on pos.
> 4 of the second line, preventing the first new line from
> ending a line.
>
>   R then pushes back two copies of the lines that have
> been read (this is normal behavior; I don't quite follow the
> logic).
>
>   The rest of the file is read with scan(), 1 line at a time.
> However, there is the discrepancy between the way
> that readTableHead interprets new lines in the middle of
> quoted strings (it ignores them) and the way that scan()
> interprets them (it takes them as the end of the quoted string).


  Ping?
  I think this counts as a small, but real, bug. Should I go ahead
and report it as such, or would someone explain why it's not a bug?

  cheers
    Ben Bolker

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug in read.table?

Peter Dalgaard-2

On Nov 16, 2010, at 02:59 , Ben Bolker wrote:

> Ben Bolker <bbolker <at> gmail.com> writes:
>
>>
>> Ben Bolker <bbolker <at> gmail.com> writes:
>>
>>>
>>>
>>
>>   Can simplify this still farther:
>>
>> a b'c
>> d e'f
>> g h'i
>
>  This example file leads to duplicate lines.
> Arguably it should have behavior analogous to:
>
>> scan(what="")
> 1: a b'c
> 3: d e'f
> 5: g h'i
> 7: Read 6 items
> [1] "a"   "b'c" "d"   "e'f" "g"   "h'i"
>
>
>>
>>> One of the first things that happens in read.table is that
>>> the first few lines are read with readTableHead:
>>>
>>>  lines <- .Internal(readTableHead(file, nlines, comment.char,
>>>       blank.lines.skip, quote, sep))
>>>
>>  in this case, this reads the first two lines as one line;
>> the single quote at pos. 4 of the first line closes on pos.
>> 4 of the second line, preventing the first new line from
>> ending a line.
>>
>>  R then pushes back two copies of the lines that have
>> been read (this is normal behavior; I don't quite follow the
>> logic).
>>
>>  The rest of the file is read with scan(), 1 line at a time.
>> However, there is the discrepancy between the way
>> that readTableHead interprets new lines in the middle of
>> quoted strings (it ignores them) and the way that scan()
>> interprets them (it takes them as the end of the quoted string).
>
>
>  Ping?
>  I think this counts as a small, but real, bug. Should I go ahead
> and report it as such, or would someone explain why it's not a bug?
>

I think it can be defended to file as a bug, but it is tricky to pinpoint exactly what the issue is. E.g., notice that adding a few spaces changes the behaviour of scan() considerably:

> scan(what="")
1:  a b 'c
1: d e' f
5: g h' i
8:
Read 7 items
[1] "a"      "b"      "c\nd e" "f"      "g"      "h'"     "i"    

(I'm confused... What is it that we really want here?)

Also, as you noted originally, beware the "Well don't do that then" aspect...

--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel