read.table() fails with https in R 3.6 but not in R 3.5

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

read.table() fails with https in R 3.6 but not in R 3.5

Stephen Berman
In versions of R prior to 3.6.0 the following invocation succeeds,
returning the data frame shown:

> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)
   Dekade   Anzahl
1    1900 11467254
2    1910 13023370
3    1920 13434601
4    1930 13296355
5    1940 12121250
6    1950 13191131
7    1960 10587420
8    1970 10944129
9    1980 11279439
10   1990 12052652

But in version 3.6.0 it fails:

> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)
Error in file(file, "rt") :
  cannot open the connection to 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text'
In addition: Warning message:
In file(file, "rt") :
  cannot open URL 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': HTTP status was '403 Forbidden'

The table at this URL is generated by a query processor and the same
failure happens in 3.6.0 with other queries at this website.  This
website does not appear to serve data via http: replacing https by http
in the above gives the same results, and in 3.6.0 the error message
contains the URL with http but in the warning message the URL is with
https.  I have also tried a few other websites that serve
(non-generated) tabular data via https
(e.g. https://graphchallenge.s3.amazonaws.com/synthetic/gc3/Theory-16-25-81-Bk.tsv)
and with these read.table() succeeds in 3.6.0, so the problem isn't
https in general.  Maybe it has to do with the page being generated
rather than static?  There's only one reference to https in the 3.6.0
NEWS, concerning libcurl; I can't tell if it's relevant.

In case it matters, this is with R packaged for openSUSE, and I've found
the above difference between 3.5 and 3.6 on both openSUSE Leap 15.0 and
openSUSE Tumbleweed.

Steve Berman

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.table() fails with https in R 3.6 but not in R 3.5

Ralf Stubner
On 04.05.19 19:04, Stephen Berman wrote:

> In versions of R prior to 3.6.0 the following invocation succeeds,
> returning the data frame shown:
>
>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)
>    Dekade   Anzahl
> 1    1900 11467254
> 2    1910 13023370
> 3    1920 13434601
> 4    1930 13296355
> 5    1940 12121250
> 6    1950 13191131
> 7    1960 10587420
> 8    1970 10944129
> 9    1980 11279439
> 10   1990 12052652
>
> But in version 3.6.0 it fails:
>
>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)
> Error in file(file, "rt") :
>   cannot open the connection to 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text'
> In addition: Warning message:
> In file(file, "rt") :
>   cannot open URL 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': HTTP status was '403 Forbidden'
I can reproduce the behavior on Debian using the CRAN supplied package
for R 3.6.0. Trying to read the page with 'curl' produces also a 403
error plus some HTML text (in German) explaining that I am treated as a
'robot' due to the supplied User-Agent (here: curl/7.52.1). One
suggested solution is to adjust that value which does solve the issue:

 > options(HTTPUserAgent='mozilla')
>
read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
header=TRUE)
   Dekade   Anzahl
1    1900 11467254
2    1910 13023370
3    1920 13434601
4    1930 13296355
5    1940 12121250
6    1950 13191131
7    1960 10587420
8    1970 10944129
9    1980 11279439
10   1990 12052652

Other solutions are to simulate a login or to get in touch with DWDS
directly.

Greetings
Ralf

--
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: [hidden email]

Sitz: Potsdam
Register: AG Potsdam HRB 27966
Ust.-IdNr.: DE300072622
Geschäftsführer: Dr.-Ing. Stefan Knirsch, Prof. Dr. Dr. Karl-Kuno Kunze


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: read.table() fails with https in R 3.6 but not in R 3.5

Stephen Berman
On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <[hidden email]> wrote:

> On 04.05.19 19:04, Stephen Berman wrote:
>> In versions of R prior to 3.6.0 the following invocation succeeds,
>> returning the data frame shown:
>>
>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>> header=TRUE)
>>    Dekade   Anzahl
>> 1    1900 11467254
>> 2    1910 13023370
>> 3    1920 13434601
>> 4    1930 13296355
>> 5    1940 12121250
>> 6    1950 13191131
>> 7    1960 10587420
>> 8    1970 10944129
>> 9    1980 11279439
>> 10   1990 12052652
>>
>> But in version 3.6.0 it fails:
>>
>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>> header=TRUE)
>> Error in file(file, "rt") :
>>   cannot open the connection to
>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text'
>> In addition: Warning message:
>> In file(file, "rt") :
>>   cannot open URL
>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text':
>> HTTP status was '403 Forbidden'
>
> I can reproduce the behavior on Debian using the CRAN supplied package
> for R 3.6.0. Trying to read the page with 'curl' produces also a 403
> error plus some HTML text (in German) explaining that I am treated as a
> 'robot' due to the supplied User-Agent (here: curl/7.52.1). One
> suggested solution is to adjust that value which does solve the issue:
>
>  > options(HTTPUserAgent='mozilla')

I confirm that works for me, too.  Thanks!  FWIW, the default value of
HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64
linux-gnu)", and using this (in R 3.6) fails as I reported, while the
default value of HTTPUserAgent in R 3.5 here is "R (3.5.0
x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5)
succeeds.  However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0"
fails just as it does in 3.6.  It's not clear to me if this particular
website is being too restrictive or if R 3.6 should deal with it, or at
least mention the issue in NEWS or somewhere else.

Steve Berman

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.table() fails with https in R 3.6 but not in R 3.5

Tomas Kalibera
On 5/6/19 2:27 PM, Stephen Berman wrote:

> On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <[hidden email]> wrote:
>
>> On 04.05.19 19:04, Stephen Berman wrote:
>>> In versions of R prior to 3.6.0 the following invocation succeeds,
>>> returning the data frame shown:
>>>
>>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>>> header=TRUE)
>>>     Dekade   Anzahl
>>> 1    1900 11467254
>>> 2    1910 13023370
>>> 3    1920 13434601
>>> 4    1930 13296355
>>> 5    1940 12121250
>>> 6    1950 13191131
>>> 7    1960 10587420
>>> 8    1970 10944129
>>> 9    1980 11279439
>>> 10   1990 12052652
>>>
>>> But in version 3.6.0 it fails:
>>>
>>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>>> header=TRUE)
>>> Error in file(file, "rt") :
>>>    cannot open the connection to
>>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text'
>>> In addition: Warning message:
>>> In file(file, "rt") :
>>>    cannot open URL
>>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text':
>>> HTTP status was '403 Forbidden'
>> I can reproduce the behavior on Debian using the CRAN supplied package
>> for R 3.6.0. Trying to read the page with 'curl' produces also a 403
>> error plus some HTML text (in German) explaining that I am treated as a
>> 'robot' due to the supplied User-Agent (here: curl/7.52.1). One
>> suggested solution is to adjust that value which does solve the issue:
>>
>>   > options(HTTPUserAgent='mozilla')
> I confirm that works for me, too.  Thanks!  FWIW, the default value of
> HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64
> linux-gnu)", and using this (in R 3.6) fails as I reported, while the
> default value of HTTPUserAgent in R 3.5 here is "R (3.5.0
> x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5)
> succeeds.  However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0"
> fails just as it does in 3.6.  It's not clear to me if this particular
> website is being too restrictive or if R 3.6 should deal with it, or at
> least mention the issue in NEWS or somewhere else.

This is because (from NEWS:)

The default ‘user agent’ has been changed when accessing http://
       and https:// sites using libcurl.  (A site was found which caused
       libcurl to infinite-loop with the previous default.)

This website is ok with the default R user agent specification (also for
R 3.6 and R-devel), but it is not ok with "libcurl/...". Setting the
user agent to anything starting with "R (" will not help in R 3.6,
because it will get automatically changed to "libcurl/..." when libcurl
is used (note using wget and curl on the command line fails on this
website). I am afraid it has to be solved on the user side (e.g. as
hinted in that German text one gets when requesting the page using curl)
- R should not attempt to circumvent access restrictions on external
websites.

Best
Tomas

>
> Steve Berman
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.table() fails with https in R 3.6 but not in R 3.5

Gábor Csárdi
Hi Tomas,

On Mon, May 13, 2019 at 11:42 AM Tomas Kalibera
<[hidden email]> wrote:
[...]
> This is because (from NEWS:)
>
> The default ‘user agent’ has been changed when accessing http://
>        and https:// sites using libcurl.  (A site was found which caused
>        libcurl to infinite-loop with the previous default.)

Which site was this? Maybe it can be fixed on their end?

The current behavior is not really ideal, because the `libcurl/x,y,z`
string is not only a default, but as you mention above, anything that
start with `R (` is replaced with it, so it is basically impossible to
send out a UserAgent that starts with `R (`. This was very surprising
to me, and I had to go to the C source code to see why R does not
respect my `HTTPUserAgent` option. Would it make sense to document
this in `?options`?

Actually, the default that includes R's version number seems more
sensible to me. Maybe we can just add `libcurl/x.y.z` to that to work
around that buggy site? I would be happy to test this and send a
patch, if you could let me know which website it was.

Thanks!
Gabor

[...]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel