Re: download.file does not process gz files correctly (truncates them?)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

hadley wickham
On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
<[hidden email]> wrote:

> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
>>
>> Also, as mentioned in my
>> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
>> not specifying the mode argument, the default on Windows is mode = "w"
>> *except* for certain, case-sensitive, filename extensions:
>>
>>      if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>> url)))
>>          mode <- "wb"
>>
>> Just like the need for mode = "wb" on Windows, the above
>> special-file-extension-hack is only happening on Windows, and is only
>> documented in ?download.file if you're on Windows; so someone who's on
>> Linux/macOS trying to help someone on Windows may not be aware of
>> this. This adds to even more confusions, e.g. "works for me".
>
> If we were designing the API today, it would probably make more sense not to
> convert any line endings by default. Today's editors _usually_ can cope with
> different line endings and it is probably easier to detect that a text file
> has incorrect line endings rather than detecting that a binary file has been
> corrupted by an attempt to convert line endings. But whether to change
> existing, documented behavior is a different question. In order to help
> users and programmers who do not read the documentation carefully we would
> create problems for users and programmers who do. The current heuristic/hack
> is in line with the compatibility approach: it detects files that are
> obviously binary, so it changes the default behavior only for cases when it
> would obviously cause damage.

From a purely utilitarian standpoint, there are far more users who do
not carefully read the documentation than users who do ;)

(I'd also argue that basing the decision on the file extension is
suboptimal, and it would be better to use the mime type if provided by
the server)

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

hadley wickham
On Tue, May 8, 2018 at 8:15 AM, Hadley Wickham <[hidden email]> wrote:

> On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
> <[hidden email]> wrote:
>> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
>>>
>>> Also, as mentioned in my
>>> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
>>> not specifying the mode argument, the default on Windows is mode = "w"
>>> *except* for certain, case-sensitive, filename extensions:
>>>
>>>      if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>>> url)))
>>>          mode <- "wb"
>>>
>>> Just like the need for mode = "wb" on Windows, the above
>>> special-file-extension-hack is only happening on Windows, and is only
>>> documented in ?download.file if you're on Windows; so someone who's on
>>> Linux/macOS trying to help someone on Windows may not be aware of
>>> this. This adds to even more confusions, e.g. "works for me".
>>
>> If we were designing the API today, it would probably make more sense not to
>> convert any line endings by default. Today's editors _usually_ can cope with
>> different line endings and it is probably easier to detect that a text file
>> has incorrect line endings rather than detecting that a binary file has been
>> corrupted by an attempt to convert line endings. But whether to change
>> existing, documented behavior is a different question. In order to help
>> users and programmers who do not read the documentation carefully we would
>> create problems for users and programmers who do. The current heuristic/hack
>> is in line with the compatibility approach: it detects files that are
>> obviously binary, so it changes the default behavior only for cases when it
>> would obviously cause damage.
>
> From a purely utilitarian standpoint, there are far more users who do
> not carefully read the documentation than users who do ;)
>
> (I'd also argue that basing the decision on the file extension is
> suboptimal, and it would be better to use the mime type if provided by
> the server)

Also note that MS just announced support for unix line endings in notepad

https://blogs.msdn.microsoft.com/commandline/2018/05/08/extended-eol-in-notepad/

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

Tomas Kalibera
In reply to this post by hadley wickham
On 05/08/2018 05:15 PM, Hadley Wickham wrote:

> On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
> <[hidden email]> wrote:
>> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
>>> Also, as mentioned in my
>>> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
>>> not specifying the mode argument, the default on Windows is mode = "w"
>>> *except* for certain, case-sensitive, filename extensions:
>>>
>>>       if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>>> url)))
>>>           mode <- "wb"
>>>
>>> Just like the need for mode = "wb" on Windows, the above
>>> special-file-extension-hack is only happening on Windows, and is only
>>> documented in ?download.file if you're on Windows; so someone who's on
>>> Linux/macOS trying to help someone on Windows may not be aware of
>>> this. This adds to even more confusions, e.g. "works for me".
>> If we were designing the API today, it would probably make more sense not to
>> convert any line endings by default. Today's editors _usually_ can cope with
>> different line endings and it is probably easier to detect that a text file
>> has incorrect line endings rather than detecting that a binary file has been
>> corrupted by an attempt to convert line endings. But whether to change
>> existing, documented behavior is a different question. In order to help
>> users and programmers who do not read the documentation carefully we would
>> create problems for users and programmers who do. The current heuristic/hack
>> is in line with the compatibility approach: it detects files that are
>> obviously binary, so it changes the default behavior only for cases when it
>> would obviously cause damage.
>  From a purely utilitarian standpoint, there are far more users who do
> not carefully read the documentation than users who do ;)
And for that reason the behavior should be as intuitive as possible when
designed. What was intuitive 15-20 years ago may not be intuitive now,
but that should probably not be a justification for a change in
documented behavior.
> (I'd also argue that basing the decision on the file extension is
> suboptimal, and it would be better to use the mime type if provided by
> the server)
Yes, that would be nice. Also some binary files could be detected via
magic numbers (yet not all, e.g. RDS do not have them). It won't be as
trivial as decoding the URL, though.

Tomas

>
> Hadley
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

Duncan Murdoch-2
In reply to this post by hadley wickham
On 08/05/2018 4:47 PM, Hadley Wickham wrote:

> On Tue, May 8, 2018 at 8:15 AM, Hadley Wickham <[hidden email]> wrote:
>> On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
>> <[hidden email]> wrote:
>>> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
>>>>
>>>> Also, as mentioned in my
>>>> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
>>>> not specifying the mode argument, the default on Windows is mode = "w"
>>>> *except* for certain, case-sensitive, filename extensions:
>>>>
>>>>       if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>>>> url)))
>>>>           mode <- "wb"
>>>>
>>>> Just like the need for mode = "wb" on Windows, the above
>>>> special-file-extension-hack is only happening on Windows, and is only
>>>> documented in ?download.file if you're on Windows; so someone who's on
>>>> Linux/macOS trying to help someone on Windows may not be aware of
>>>> this. This adds to even more confusions, e.g. "works for me".
>>>
>>> If we were designing the API today, it would probably make more sense not to
>>> convert any line endings by default. Today's editors _usually_ can cope with
>>> different line endings and it is probably easier to detect that a text file
>>> has incorrect line endings rather than detecting that a binary file has been
>>> corrupted by an attempt to convert line endings. But whether to change
>>> existing, documented behavior is a different question. In order to help
>>> users and programmers who do not read the documentation carefully we would
>>> create problems for users and programmers who do. The current heuristic/hack
>>> is in line with the compatibility approach: it detects files that are
>>> obviously binary, so it changes the default behavior only for cases when it
>>> would obviously cause damage.
>>
>>  From a purely utilitarian standpoint, there are far more users who do
>> not carefully read the documentation than users who do ;)
>>
>> (I'd also argue that basing the decision on the file extension is
>> suboptimal, and it would be better to use the mime type if provided by
>> the server)
>
> Also note that MS just announced support for unix line endings in notepad
>
> https://blogs.msdn.microsoft.com/commandline/2018/05/08/extended-eol-in-notepad/

Perhaps soon RStudio will follow Notepad's lead, and not convert line
endings when it saves a non-native file.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

Peter Dalgaard-2
In reply to this post by hadley wickham
There was a hint in the Twitterverse that Excel has issues with line endings in .csv. Can anyone elaborate on that? Then again, Excel goes belly-up on comma separators in central European locales anyway...

-pd

> On 8 May 2018, at 22:47 , Hadley Wickham <[hidden email]> wrote:
>
>
> Also note that MS just announced support for unix line endings in notepad
>
> https://blogs.msdn.microsoft.com/commandline/2018/05/08/extended-eol-in-notepad/
>
> Hadley
>

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

Dirk Eddelbuettel
In reply to this post by Tomas Kalibera

On 9 May 2018 at 10:37, Tomas Kalibera wrote:
| And for that reason the behavior should be as intuitive as possible when
| designed. What was intuitive 15-20 years ago may not be intuitive now,
| but that should probably not be a justification for a change in
| documented behavior.

Time for downloadFile() (or download_file()) to complement the existing
download.file() but providing what we now think of as intuitive behaviour?

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: download.file does not process gz files correctly (truncates them?)

Joris FA Meys
In reply to this post by Peter Dalgaard-2
I can confirm that Excel does all kind of strange things when opening a csv
file and saving it from Excel, including adding unnecessarily another set
of quotes around already qouted text fields. But I never had problems with
Excel not getting linux-type line endings correctly. I'll see if I can make
Excel mess it up, but given the amount of excel crap I had to endure over
the years, I'd be surprised if I missed such behaviour until now.

Cheers
Joris

On Wed, May 9, 2018 at 3:09 PM, peter dalgaard <[hidden email]> wrote:

> There was a hint in the Twitterverse that Excel has issues with line
> endings in .csv. Can anyone elaborate on that? Then again, Excel goes
> belly-up on comma separators in central European locales anyway...
>
> -pd
>
> > On 8 May 2018, at 22:47 , Hadley Wickham <[hidden email]> wrote:
> >
> >
> > Also note that MS just announced support for unix line endings in notepad
> >
> > https://blogs.msdn.microsoft.com/commandline/2018/05/08/
> extended-eol-in-notepad/
> >
> > Hadley
> >
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email]  Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel