Inconsistency, may be bug in read.delim ?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistency, may be bug in read.delim ?

Detlef Steuer-2
Dear friends,

I stumbled into beheaviour of read.delim which I would consider a bug
or at least an inconsistency that should be improved upon.

Recently we had to work with data that used "", two double quotes, as
symbol to start and end character input.

Essentially the data looked like this

data.csv
========
V1, V2, V3
""data"", 3, """"

The last sequence of """" indicating a missing.

One obvious solution to read in this data is using some gsub(),
but that's not the point I want to make.

Consider this case we found during tests:

test.csv
========
V1, V2, V3, V4
"""", """", 3, ""

and read it with
> read.delim("test.csv", sep=",", header=TRUE, na.strings="\"")  

you get the following

  V1 V2 V3 V4
1 NA  "  3 NA  

(and a warning)

I would have assumed to get some error message or at
least the same result for both appearances of """" in the
input file.
(the setting na.strings="\"" turned out to be working for
 a colleague and his specific data, while I think it shouldn't)

My main concern is the different interpretation for the two """"
sequences.

Real bug? Minor inconsistency? I don't know.

All the best
Detlef


--
'People who say "I have nothing to hide" misunderstand the purpose of
surveillance. It was never about privacy. It's about power.' E. Snowden

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency, may be bug in read.delim ?

Tomas Kalibera
On 03/19/2018 02:23 PM, Detlef Steuer wrote:

> Dear friends,
>
> I stumbled into beheaviour of read.delim which I would consider a bug
> or at least an inconsistency that should be improved upon.
>
> Recently we had to work with data that used "", two double quotes, as
> symbol to start and end character input.
>
> Essentially the data looked like this
>
> data.csv
> ========
> V1, V2, V3
> ""data"", 3, """"
>
> The last sequence of """" indicating a missing.
After processing the quotes, this is internally parsed as

data 3 "

Which I think is correct; in particular, """" represents single quote.
This is correct and it conforms to RFC 4180. "" in contrast represents
an empty string.

Based on my reading of RFC4180, ""data"" is not a valid field, but not
every CSV file follows that RFC, and R supports this pattern as expected
in your data. So you should be fine here.

> One obvious solution to read in this data is using some gsub(),
> but that's not the point I want to make.
>
> Consider this case we found during tests:
>
> test.csv
> ========
> V1, V2, V3, V4
> """", """", 3, ""
>
> and read it with
>> read.delim("test.csv", sep=",", header=TRUE, na.strings="\"")
After processing the quotes, this is internally parsed as
" " 3 <empty_string>

which is again I think correct (and conforms to RFC 4180)

> you get the following
>
>    V1 V2 V3 V4
> 1 NA  "  3 NA
>
> (and a warning)

I do not get the warning on my system. The reason why the second " is
not translated to NA by na.strings is white space after the comma in the
CSV file, this works more consistently:

 > read.delim("test.csv", sep=",", header=TRUE, na.strings="\"",
strip.white=TRUE)
   V1 V2 V3 V4
1 NA NA  3 NA

If one needed to differentiate between " and <empty_string>, then it
might be necessary to run without the na.strings argument.

Best
Tomas

> I would have assumed to get some error message or at
> least the same result for both appearances of """" in the
> input file.
> (the setting na.strings="\"" turned out to be working for
>   a colleague and his specific data, while I think it shouldn't)
>
> My main concern is the different interpretation for the two """"
> sequences.
>
> Real bug? Minor inconsistency? I don't know.
>
> All the best
> Detlef
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel