URL checks

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

URL checks

R devel mailing list
Hi


The URL checks in R CMD check test all links in the README and vignettes
for broken or redirected links. In many cases this improves
documentation, I see problems with this approach which I have detailed
below.

I'm writing to this mailing list because I think the change needs to
happen in R's check routines. I propose to introduce an "allow-list" for
URLs, to reduce the burden on both CRAN and package maintainers.

Comments are greatly appreciated.


Best regards

Kirill


# Problems with the detection of broken/redirected URLs

## 301 should often be 307, how to change?

Many web sites use a 301 redirection code that probably should be a 307.
For example, https://www.oracle.com and https://www.oracle.com/ both
redirect to https://www.oracle.com/index.html with a 301. I suspect the
company still wants oracle.com to be recognized as the primary entry
point of their web presence (to reserve the right to move the
redirection to a different location later), I haven't checked with their
PR department though. If that's true, the redirect probably should be a
307, which should be fixed by their IT department which I haven't
contacted yet either.

$ curl -i https://www.oracle.com
HTTP/2 301
server: AkamaiGHost
content-length: 0
location: https://www.oracle.com/index.html
...

## User agent detection

twitter.com responds with a 400 error for requests without a user agent
string hinting at an accepted browser.

$ curl -i https://twitter.com/
HTTP/2 400
...
<body>...<p>Please switch to a supported browser...</p>...</body>

$ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
HTTP/2 200

# Impact

While the latter problem *could* be fixed by supplying a browser-like
user agent string, the former problem is virtually unfixable -- so many
web sites should use 307 instead of 301 but don't. The above list is
also incomplete -- think of unreliable links, HTTP links, other failure
modes...

This affects me as a package maintainer, I have the choice to either
change the links to incorrect versions, or remove them altogether.

I can also choose to explain each broken link to CRAN, this subjects the
team to undue burden I think. Submitting a package with NOTEs delays the
release for a package which I must release very soon to avoid having it
pulled from CRAN, I'd rather not risk that -- hence I need to remove the
link and put it back later.

I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
problem but ultimately doesn't solve it.

# Proposed solution

## Allow-list

A file inst/URL that lists all URLs where failures are allowed --
possibly with a list of the HTTP codes accepted for that link.

Example:

https://oracle.com/ 301
https://twitter.com/drob/status/1224851726068527106 400

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

R devel mailing list
One other failure mode: SSL certificates trusted by browsers that are
not installed on the check machine, e.g. the "GEANT Vereniging"
certificate from https://relational.fit.cvut.cz/ .


K


On 07.01.21 12:14, Kirill Müller via R-devel wrote:

> Hi
>
>
> The URL checks in R CMD check test all links in the README and
> vignettes for broken or redirected links. In many cases this improves
> documentation, I see problems with this approach which I have detailed
> below.
>
> I'm writing to this mailing list because I think the change needs to
> happen in R's check routines. I propose to introduce an "allow-list"
> for URLs, to reduce the burden on both CRAN and package maintainers.
>
> Comments are greatly appreciated.
>
>
> Best regards
>
> Kirill
>
>
> # Problems with the detection of broken/redirected URLs
>
> ## 301 should often be 307, how to change?
>
> Many web sites use a 301 redirection code that probably should be a
> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
> both redirect to https://www.oracle.com/index.html with a 301. I
> suspect the company still wants oracle.com to be recognized as the
> primary entry point of their web presence (to reserve the right to
> move the redirection to a different location later), I haven't checked
> with their PR department though. If that's true, the redirect probably
> should be a 307, which should be fixed by their IT department which I
> haven't contacted yet either.
>
> $ curl -i https://www.oracle.com
> HTTP/2 301
> server: AkamaiGHost
> content-length: 0
> location: https://www.oracle.com/index.html
> ...
>
> ## User agent detection
>
> twitter.com responds with a 400 error for requests without a user
> agent string hinting at an accepted browser.
>
> $ curl -i https://twitter.com/
> HTTP/2 400
> ...
> <body>...<p>Please switch to a supported browser...</p>...</body>
>
> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
> HTTP/2 200
>
> # Impact
>
> While the latter problem *could* be fixed by supplying a browser-like
> user agent string, the former problem is virtually unfixable -- so
> many web sites should use 307 instead of 301 but don't. The above list
> is also incomplete -- think of unreliable links, HTTP links, other
> failure modes...
>
> This affects me as a package maintainer, I have the choice to either
> change the links to incorrect versions, or remove them altogether.
>
> I can also choose to explain each broken link to CRAN, this subjects
> the team to undue burden I think. Submitting a package with NOTEs
> delays the release for a package which I must release very soon to
> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
> need to remove the link and put it back later.
>
> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
> problem but ultimately doesn't solve it.
>
> # Proposed solution
>
> ## Allow-list
>
> A file inst/URL that lists all URLs where failures are allowed --
> possibly with a list of the HTTP codes accepted for that link.
>
> Example:
>
> https://oracle.com/ 301
> https://twitter.com/drob/status/1224851726068527106 400
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Hugo Gruson

I encountered the same issue today with https://astrostatistics.psu.edu/.

This is a trust chain issue, as explained here:
https://whatsmychaincert.com/?astrostatistics.psu.edu.

I've worked for a couple of years on a project to increase HTTPS
adoption on the web and we noticed that this type of error is very
common, and that website maintainers are often unresponsive to requests
to fix this issue.

Therefore, I totally agree with Kirill that a list of known
false-positive/exceptions would be a great addition to save time to both
the CRAN team and package developers.

Hugo

On 07/01/2021 15:45, Kirill Müller via R-devel wrote:

> One other failure mode: SSL certificates trusted by browsers that are
> not installed on the check machine, e.g. the "GEANT Vereniging"
> certificate from https://relational.fit.cvut.cz/ .
>
>
> K
>
>
> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>> Hi
>>
>>
>> The URL checks in R CMD check test all links in the README and
>> vignettes for broken or redirected links. In many cases this improves
>> documentation, I see problems with this approach which I have detailed
>> below.
>>
>> I'm writing to this mailing list because I think the change needs to
>> happen in R's check routines. I propose to introduce an "allow-list"
>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>
>> Comments are greatly appreciated.
>>
>>
>> Best regards
>>
>> Kirill
>>
>>
>> # Problems with the detection of broken/redirected URLs
>>
>> ## 301 should often be 307, how to change?
>>
>> Many web sites use a 301 redirection code that probably should be a
>> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
>> both redirect to https://www.oracle.com/index.html with a 301. I
>> suspect the company still wants oracle.com to be recognized as the
>> primary entry point of their web presence (to reserve the right to
>> move the redirection to a different location later), I haven't checked
>> with their PR department though. If that's true, the redirect probably
>> should be a 307, which should be fixed by their IT department which I
>> haven't contacted yet either.
>>
>> $ curl -i https://www.oracle.com
>> HTTP/2 301
>> server: AkamaiGHost
>> content-length: 0
>> location: https://www.oracle.com/index.html
>> ...
>>
>> ## User agent detection
>>
>> twitter.com responds with a 400 error for requests without a user
>> agent string hinting at an accepted browser.
>>
>> $ curl -i https://twitter.com/
>> HTTP/2 400
>> ...
>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>
>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>> HTTP/2 200
>>
>> # Impact
>>
>> While the latter problem *could* be fixed by supplying a browser-like
>> user agent string, the former problem is virtually unfixable -- so
>> many web sites should use 307 instead of 301 but don't. The above list
>> is also incomplete -- think of unreliable links, HTTP links, other
>> failure modes...
>>
>> This affects me as a package maintainer, I have the choice to either
>> change the links to incorrect versions, or remove them altogether.
>>
>> I can also choose to explain each broken link to CRAN, this subjects
>> the team to undue burden I think. Submitting a package with NOTEs
>> delays the release for a package which I must release very soon to
>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>> need to remove the link and put it back later.
>>
>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>> problem but ultimately doesn't solve it.
>>
>> # Proposed solution
>>
>> ## Allow-list
>>
>> A file inst/URL that lists all URLs where failures are allowed --
>> possibly with a list of the HTTP codes accepted for that link.
>>
>> Example:
>>
>> https://oracle.com/ 301
>> https://twitter.com/drob/status/1224851726068527106 400
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Spencer Graves-3
          I also would be pleased to be allowed to provide "a list of known
false-positive/exceptions" to the URL tests.  I've been challenged
multiple times regarding URLs that worked fine when I checked them.  We
should not be required to do a partial lobotomy to pass R CMD check ;-)


          Spencer Graves


On 2021-01-07 09:53, Hugo Gruson wrote:

>
> I encountered the same issue today with https://astrostatistics.psu.edu/.
>
> This is a trust chain issue, as explained here:
> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>
> I've worked for a couple of years on a project to increase HTTPS
> adoption on the web and we noticed that this type of error is very
> common, and that website maintainers are often unresponsive to requests
> to fix this issue.
>
> Therefore, I totally agree with Kirill that a list of known
> false-positive/exceptions would be a great addition to save time to both
> the CRAN team and package developers.
>
> Hugo
>
> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>> One other failure mode: SSL certificates trusted by browsers that are
>> not installed on the check machine, e.g. the "GEANT Vereniging"
>> certificate from https://relational.fit.cvut.cz/ .
>>
>>
>> K
>>
>>
>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>>> Hi
>>>
>>>
>>> The URL checks in R CMD check test all links in the README and
>>> vignettes for broken or redirected links. In many cases this improves
>>> documentation, I see problems with this approach which I have
>>> detailed below.
>>>
>>> I'm writing to this mailing list because I think the change needs to
>>> happen in R's check routines. I propose to introduce an "allow-list"
>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>>
>>> Comments are greatly appreciated.
>>>
>>>
>>> Best regards
>>>
>>> Kirill
>>>
>>>
>>> # Problems with the detection of broken/redirected URLs
>>>
>>> ## 301 should often be 307, how to change?
>>>
>>> Many web sites use a 301 redirection code that probably should be a
>>> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
>>> both redirect to https://www.oracle.com/index.html with a 301. I
>>> suspect the company still wants oracle.com to be recognized as the
>>> primary entry point of their web presence (to reserve the right to
>>> move the redirection to a different location later), I haven't
>>> checked with their PR department though. If that's true, the redirect
>>> probably should be a 307, which should be fixed by their IT
>>> department which I haven't contacted yet either.
>>>
>>> $ curl -i https://www.oracle.com
>>> HTTP/2 301
>>> server: AkamaiGHost
>>> content-length: 0
>>> location: https://www.oracle.com/index.html
>>> ...
>>>
>>> ## User agent detection
>>>
>>> twitter.com responds with a 400 error for requests without a user
>>> agent string hinting at an accepted browser.
>>>
>>> $ curl -i https://twitter.com/
>>> HTTP/2 400
>>> ...
>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>>
>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>> HTTP/2 200
>>>
>>> # Impact
>>>
>>> While the latter problem *could* be fixed by supplying a browser-like
>>> user agent string, the former problem is virtually unfixable -- so
>>> many web sites should use 307 instead of 301 but don't. The above
>>> list is also incomplete -- think of unreliable links, HTTP links,
>>> other failure modes...
>>>
>>> This affects me as a package maintainer, I have the choice to either
>>> change the links to incorrect versions, or remove them altogether.
>>>
>>> I can also choose to explain each broken link to CRAN, this subjects
>>> the team to undue burden I think. Submitting a package with NOTEs
>>> delays the release for a package which I must release very soon to
>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>>> need to remove the link and put it back later.
>>>
>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>>> problem but ultimately doesn't solve it.
>>>
>>> # Proposed solution
>>>
>>> ## Allow-list
>>>
>>> A file inst/URL that lists all URLs where failures are allowed --
>>> possibly with a list of the HTTP codes accepted for that link.
>>>
>>> Example:
>>>
>>> https://oracle.com/ 301
>>> https://twitter.com/drob/status/1224851726068527106 400
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Viechtbauer, Wolfgang (SP)
Instead of a separate file to store such a list, would it be an idea to add versions of the \href{}{} and \url{} markup commands that are skipped by the URL checks?

Best,
Wolfgang

>-----Original Message-----
>From: R-devel [mailto:[hidden email]] On Behalf Of Spencer
>Graves
>Sent: Friday, 08 January, 2021 13:04
>To: [hidden email]
>Subject: Re: [Rd] URL checks
>
>  I also would be pleased to be allowed to provide "a list of known
>false-positive/exceptions" to the URL tests.  I've been challenged
>multiple times regarding URLs that worked fine when I checked them.  We
>should not be required to do a partial lobotomy to pass R CMD check ;-)
>
>  Spencer Graves
>
>On 2021-01-07 09:53, Hugo Gruson wrote:
>>
>> I encountered the same issue today with https://astrostatistics.psu.edu/.
>>
>> This is a trust chain issue, as explained here:
>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>>
>> I've worked for a couple of years on a project to increase HTTPS
>> adoption on the web and we noticed that this type of error is very
>> common, and that website maintainers are often unresponsive to requests
>> to fix this issue.
>>
>> Therefore, I totally agree with Kirill that a list of known
>> false-positive/exceptions would be a great addition to save time to both
>> the CRAN team and package developers.
>>
>> Hugo
>>
>> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>>> One other failure mode: SSL certificates trusted by browsers that are
>>> not installed on the check machine, e.g. the "GEANT Vereniging"
>>> certificate from https://relational.fit.cvut.cz/ .
>>>
>>> K
>>>
>>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>>>> Hi
>>>>
>>>> The URL checks in R CMD check test all links in the README and
>>>> vignettes for broken or redirected links. In many cases this improves
>>>> documentation, I see problems with this approach which I have
>>>> detailed below.
>>>>
>>>> I'm writing to this mailing list because I think the change needs to
>>>> happen in R's check routines. I propose to introduce an "allow-list"
>>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>>>
>>>> Comments are greatly appreciated.
>>>>
>>>> Best regards
>>>>
>>>> Kirill
>>>>
>>>> # Problems with the detection of broken/redirected URLs
>>>>
>>>> ## 301 should often be 307, how to change?
>>>>
>>>> Many web sites use a 301 redirection code that probably should be a
>>>> 307. For example, https://www.oracle.com and https://www.oracle.com/
>>>> both redirect to https://www.oracle.com/index.html with a 301. I
>>>> suspect the company still wants oracle.com to be recognized as the
>>>> primary entry point of their web presence (to reserve the right to
>>>> move the redirection to a different location later), I haven't
>>>> checked with their PR department though. If that's true, the redirect
>>>> probably should be a 307, which should be fixed by their IT
>>>> department which I haven't contacted yet either.
>>>>
>>>> $ curl -i https://www.oracle.com
>>>> HTTP/2 301
>>>> server: AkamaiGHost
>>>> content-length: 0
>>>> location: https://www.oracle.com/index.html
>>>> ...
>>>>
>>>> ## User agent detection
>>>>
>>>> twitter.com responds with a 400 error for requests without a user
>>>> agent string hinting at an accepted browser.
>>>>
>>>> $ curl -i https://twitter.com/
>>>> HTTP/2 400
>>>> ...
>>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>>>
>>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>>> HTTP/2 200
>>>>
>>>> # Impact
>>>>
>>>> While the latter problem *could* be fixed by supplying a browser-like
>>>> user agent string, the former problem is virtually unfixable -- so
>>>> many web sites should use 307 instead of 301 but don't. The above
>>>> list is also incomplete -- think of unreliable links, HTTP links,
>>>> other failure modes...
>>>>
>>>> This affects me as a package maintainer, I have the choice to either
>>>> change the links to incorrect versions, or remove them altogether.
>>>>
>>>> I can also choose to explain each broken link to CRAN, this subjects
>>>> the team to undue burden I think. Submitting a package with NOTEs
>>>> delays the release for a package which I must release very soon to
>>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>>>> need to remove the link and put it back later.
>>>>
>>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>>>> problem but ultimately doesn't solve it.
>>>>
>>>> # Proposed solution
>>>>
>>>> ## Allow-list
>>>>
>>>> A file inst/URL that lists all URLs where failures are allowed --
>>>> possibly with a list of the HTTP codes accepted for that link.
>>>>
>>>> Example:
>>>>
>>>> https://oracle.com/ 301
>>>> https://twitter.com/drob/status/1224851726068527106 400
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Dirk Eddelbuettel
In reply to this post by Spencer Graves-3

The idea of 'white lists' to prevent known (and 'tolerated') issues, note,
warnings, ... from needlessly reappearing is very powerful and general, and
can go much further than just URL checks.

I suggested several times in the past that we can look at the format Debian
uses in its 'lintian' package checker and its override files -- which are
used across thousands of packages there.  But that went nowhere so I stopped.

This issue needs a champion or two to implement a prototype as well as a
potential R Core / CRAN sponsor to adopt it.  But in all those years no smoke
has come out of any chimneys so ...  ¯\_(ツ)_/¯ is all we get.

Dirk

--
https://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

J C Nash
Is this a topic for Google Summer of Code? See
https://github.com/rstats-gsoc/gsoc2021/wiki


On 2021-01-09 12:34 p.m., Dirk Eddelbuettel wrote:

>
> The idea of 'white lists' to prevent known (and 'tolerated') issues, note,
> warnings, ... from needlessly reappearing is very powerful and general, and
> can go much further than just URL checks.
>
> I suggested several times in the past that we can look at the format Debian
> uses in its 'lintian' package checker and its override files -- which are
> used across thousands of packages there.  But that went nowhere so I stopped.
>
> This issue needs a champion or two to implement a prototype as well as a
> potential R Core / CRAN sponsor to adopt it.  But in all those years no smoke
> has come out of any chimneys so ...  ¯\_(ツ)_/¯ is all we get.
>
> Dirk
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Martin Maechler
In reply to this post by Viechtbauer, Wolfgang (SP)
>>>>> Viechtbauer, Wolfgang (SP)
>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:

    > Instead of a separate file to store such a list, would it be an idea to add versions of the \href{}{} and \url{} markup commands that are skipped by the URL checks?
    > Best,
    > Wolfgang

I think John Nash and you misunderstood -- or then I
misunderstood -- the original proposal:

I've been understanding that there should be a  "central repository" of URL
exceptions that is maintained by volunteers.

And rather *not* that package authors should get ways to skip
URL checking..

Martin


    >> -----Original Message-----
    >> From: R-devel [mailto:[hidden email]] On Behalf Of Spencer
    >> Graves
    >> Sent: Friday, 08 January, 2021 13:04
    >> To: [hidden email]
    >> Subject: Re: [Rd] URL checks
    >>
    >> I also would be pleased to be allowed to provide "a list of known
    >> false-positive/exceptions" to the URL tests.  I've been challenged
    >> multiple times regarding URLs that worked fine when I checked them.  We
    >> should not be required to do a partial lobotomy to pass R CMD check ;-)
    >>
    >> Spencer Graves
    >>
    >> On 2021-01-07 09:53, Hugo Gruson wrote:
    >>>
    >>> I encountered the same issue today with https://astrostatistics.psu.edu/.
    >>>
    >>> This is a trust chain issue, as explained here:
    >>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
    >>>
    >>> I've worked for a couple of years on a project to increase HTTPS
    >>> adoption on the web and we noticed that this type of error is very
    >>> common, and that website maintainers are often unresponsive to requests
    >>> to fix this issue.
    >>>
    >>> Therefore, I totally agree with Kirill that a list of known
    >>> false-positive/exceptions would be a great addition to save time to both
    >>> the CRAN team and package developers.
    >>>
    >>> Hugo
    >>>
    >>> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
    >>>> One other failure mode: SSL certificates trusted by browsers that are
    >>>> not installed on the check machine, e.g. the "GEANT Vereniging"
    >>>> certificate from https://relational.fit.cvut.cz/ .
    >>>>
    >>>> K
    >>>>
    >>>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
    >>>>> Hi
    >>>>>
    >>>>> The URL checks in R CMD check test all links in the README and
    >>>>> vignettes for broken or redirected links. In many cases this improves
    >>>>> documentation, I see problems with this approach which I have
    >>>>> detailed below.
    >>>>>
    >>>>> I'm writing to this mailing list because I think the change needs to
    >>>>> happen in R's check routines. I propose to introduce an "allow-list"
    >>>>> for URLs, to reduce the burden on both CRAN and package maintainers.
    >>>>>
    >>>>> Comments are greatly appreciated.
    >>>>>
    >>>>> Best regards
    >>>>>
    >>>>> Kirill
    >>>>>
    >>>>> # Problems with the detection of broken/redirected URLs
    >>>>>
    >>>>> ## 301 should often be 307, how to change?
    >>>>>
    >>>>> Many web sites use a 301 redirection code that probably should be a
    >>>>> 307. For example, https://www.oracle.com and https://www.oracle.com/
    >>>>> both redirect to https://www.oracle.com/index.html with a 301. I
    >>>>> suspect the company still wants oracle.com to be recognized as the
    >>>>> primary entry point of their web presence (to reserve the right to
    >>>>> move the redirection to a different location later), I haven't
    >>>>> checked with their PR department though. If that's true, the redirect
    >>>>> probably should be a 307, which should be fixed by their IT
    >>>>> department which I haven't contacted yet either.
    >>>>>
    >>>>> $ curl -i https://www.oracle.com
    >>>>> HTTP/2 301
    >>>>> server: AkamaiGHost
    >>>>> content-length: 0
    >>>>> location: https://www.oracle.com/index.html
    >>>>> ...
    >>>>>
    >>>>> ## User agent detection
    >>>>>
    >>>>> twitter.com responds with a 400 error for requests without a user
    >>>>> agent string hinting at an accepted browser.
    >>>>>
    >>>>> $ curl -i https://twitter.com/
    >>>>> HTTP/2 400
    >>>>> ...
    >>>>> <body>...<p>Please switch to a supported browser...</p>...</body>
    >>>>>
    >>>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
    >>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
    >>>>> HTTP/2 200
    >>>>>
    >>>>> # Impact
    >>>>>
    >>>>> While the latter problem *could* be fixed by supplying a browser-like
    >>>>> user agent string, the former problem is virtually unfixable -- so
    >>>>> many web sites should use 307 instead of 301 but don't. The above
    >>>>> list is also incomplete -- think of unreliable links, HTTP links,
    >>>>> other failure modes...
    >>>>>
    >>>>> This affects me as a package maintainer, I have the choice to either
    >>>>> change the links to incorrect versions, or remove them altogether.
    >>>>>
    >>>>> I can also choose to explain each broken link to CRAN, this subjects
    >>>>> the team to undue burden I think. Submitting a package with NOTEs
    >>>>> delays the release for a package which I must release very soon to
    >>>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
    >>>>> need to remove the link and put it back later.
    >>>>>
    >>>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
    >>>>> problem but ultimately doesn't solve it.
    >>>>>
    >>>>> # Proposed solution
    >>>>>
    >>>>> ## Allow-list
    >>>>>
    >>>>> A file inst/URL that lists all URLs where failures are allowed --
    >>>>> possibly with a list of the HTTP codes accepted for that link.
    >>>>>
    >>>>> Example:
    >>>>>
    >>>>> https://oracle.com/ 301
    >>>>> https://twitter.com/drob/status/1224851726068527106 400
    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Viechtbauer, Wolfgang (SP)
>>>>>> Viechtbauer, Wolfgang (SP)
>>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:
>
>    > Instead of a separate file to store such a list, would it be an idea
>to add versions of the \href{}{} and \url{} markup commands that are skipped
>by the URL checks?
>    > Best,
>    > Wolfgang
>
>I think John Nash and you misunderstood -- or then I
>misunderstood -- the original proposal:
>
>I've been understanding that there should be a  "central repository" of URL
>exceptions that is maintained by volunteers.
>
>And rather *not* that package authors should get ways to skip
>URL checking..
>
>Martin

Hi Martin,

Kirill suggested: "A file inst/URL that lists all URLs where failures are allowed -- possibly with a list of the HTTP codes accepted for that link."

So, if it is a file in inst/, then this sounds to me like this is part of the package and not part of some central repository.

Best,
Wolfgang

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

Martin Maechler
>>>>> Viechtbauer, Wolfgang (SP)
>>>>>     on Mon, 11 Jan 2021 10:41:03 +0000 writes:

    >>>>>>> Viechtbauer, Wolfgang (SP)
    >>>>>>> on Fri, 8 Jan 2021 13:50:14 +0000 writes:
    >>
    >> > Instead of a separate file to store such a list, would it be an idea
    >> to add versions of the \href{}{} and \url{} markup commands that are skipped
    >> by the URL checks?
    >> > Best,
    >> > Wolfgang
    >>
    >> I think John Nash and you misunderstood -- or then I
    >> misunderstood -- the original proposal:
    >>
    >> I've been understanding that there should be a  "central repository" of URL
    >> exceptions that is maintained by volunteers.
    >>
    >> And rather *not* that package authors should get ways to skip
    >> URL checking..
    >>
    >> Martin

    > Hi Martin,

    > Kirill suggested: "A file inst/URL that lists all URLs where failures are allowed -- possibly with a list of the HTTP codes accepted for that link."

    > So, if it is a file in inst/, then this sounds to me like this is part of the package and not part of some central repository.

    > Best,
    > Wolfgang

Dear Wolfgang,
you are right and indeed it's *me* who misunderstood.

But then I don't think it's a particularly good idea: From a
CRAN point of view it is important that URLs in documents it
hosts do not raise errors (*), hence the validity checking of URLs.

So, CRAN (and other repository hosts) would need another option
to still check all URLs .. and definitely would want to do that before
accepting a package and also regularly do such checks on a per
package basis in a way that it is reported as part of the CRAN checks of
the respective package, right?

So this will get envolved, ... and maybe it *is* good idea for a
Google Summer of Code (GSoC) project ... well *if* it that is
supervised by someone who's in close contact with CRAN or Bioc
maintainer teams.

Martin

--
*) Such URL errors then lead to e-mails or other reports of web
 site checking engines reporting that you are hosting (too)
 (many) web pages with invalid links.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: URL checks

J C Nash
In reply to this post by Martin Maechler

Sorry, Martin, but I've NOT commented on this matter, unless someone has been impersonating me.
Someone else?

JN


On 2021-01-11 4:51 a.m., Martin Maechler wrote:

>>>>>> Viechtbauer, Wolfgang (SP)
>>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:
>
>     > Instead of a separate file to store such a list, would it be an idea to add versions of the \href{}{} and \url{} markup commands that are skipped by the URL checks?
>     > Best,
>     > Wolfgang
>
> I think John Nash and you misunderstood -- or then I
> misunderstood -- the original proposal:
>
> I've been understanding that there should be a  "central repository" of URL
> exceptions that is maintained by volunteers.
>
> And rather *not* that package authors should get ways to skip
> URL checking..
>
> Martin
>
>
>     >> -----Original Message-----
>     >> From: R-devel [mailto:[hidden email]] On Behalf Of Spencer
>     >> Graves
>     >> Sent: Friday, 08 January, 2021 13:04
>     >> To: [hidden email]
>     >> Subject: Re: [Rd] URL checks
>     >>
>     >> I also would be pleased to be allowed to provide "a list of known
>     >> false-positive/exceptions" to the URL tests.  I've been challenged
>     >> multiple times regarding URLs that worked fine when I checked them.  We
>     >> should not be required to do a partial lobotomy to pass R CMD check ;-)
>     >>
>     >> Spencer Graves
>     >>
>     >> On 2021-01-07 09:53, Hugo Gruson wrote:
>     >>>
>     >>> I encountered the same issue today with https://astrostatistics.psu.edu/.
>     >>>
>     >>> This is a trust chain issue, as explained here:
>     >>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>     >>>
>     >>> I've worked for a couple of years on a project to increase HTTPS
>     >>> adoption on the web and we noticed that this type of error is very
>     >>> common, and that website maintainers are often unresponsive to requests
>     >>> to fix this issue.
>     >>>
>     >>> Therefore, I totally agree with Kirill that a list of known
>     >>> false-positive/exceptions would be a great addition to save time to both
>     >>> the CRAN team and package developers.
>     >>>
>     >>> Hugo
>     >>>
>     >>> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>     >>>> One other failure mode: SSL certificates trusted by browsers that are
>     >>>> not installed on the check machine, e.g. the "GEANT Vereniging"
>     >>>> certificate from https://relational.fit.cvut.cz/ .
>     >>>>
>     >>>> K
>     >>>>
>     >>>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>     >>>>> Hi
>     >>>>>
>     >>>>> The URL checks in R CMD check test all links in the README and
>     >>>>> vignettes for broken or redirected links. In many cases this improves
>     >>>>> documentation, I see problems with this approach which I have
>     >>>>> detailed below.
>     >>>>>
>     >>>>> I'm writing to this mailing list because I think the change needs to
>     >>>>> happen in R's check routines. I propose to introduce an "allow-list"
>     >>>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>     >>>>>
>     >>>>> Comments are greatly appreciated.
>     >>>>>
>     >>>>> Best regards
>     >>>>>
>     >>>>> Kirill
>     >>>>>
>     >>>>> # Problems with the detection of broken/redirected URLs
>     >>>>>
>     >>>>> ## 301 should often be 307, how to change?
>     >>>>>
>     >>>>> Many web sites use a 301 redirection code that probably should be a
>     >>>>> 307. For example, https://www.oracle.com and https://www.oracle.com/
>     >>>>> both redirect to https://www.oracle.com/index.html with a 301. I
>     >>>>> suspect the company still wants oracle.com to be recognized as the
>     >>>>> primary entry point of their web presence (to reserve the right to
>     >>>>> move the redirection to a different location later), I haven't
>     >>>>> checked with their PR department though. If that's true, the redirect
>     >>>>> probably should be a 307, which should be fixed by their IT
>     >>>>> department which I haven't contacted yet either.
>     >>>>>
>     >>>>> $ curl -i https://www.oracle.com
>     >>>>> HTTP/2 301
>     >>>>> server: AkamaiGHost
>     >>>>> content-length: 0
>     >>>>> location: https://www.oracle.com/index.html
>     >>>>> ...
>     >>>>>
>     >>>>> ## User agent detection
>     >>>>>
>     >>>>> twitter.com responds with a 400 error for requests without a user
>     >>>>> agent string hinting at an accepted browser.
>     >>>>>
>     >>>>> $ curl -i https://twitter.com/
>     >>>>> HTTP/2 400
>     >>>>> ...
>     >>>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>     >>>>>
>     >>>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>     >>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>     >>>>> HTTP/2 200
>     >>>>>
>     >>>>> # Impact
>     >>>>>
>     >>>>> While the latter problem *could* be fixed by supplying a browser-like
>     >>>>> user agent string, the former problem is virtually unfixable -- so
>     >>>>> many web sites should use 307 instead of 301 but don't. The above
>     >>>>> list is also incomplete -- think of unreliable links, HTTP links,
>     >>>>> other failure modes...
>     >>>>>
>     >>>>> This affects me as a package maintainer, I have the choice to either
>     >>>>> change the links to incorrect versions, or remove them altogether.
>     >>>>>
>     >>>>> I can also choose to explain each broken link to CRAN, this subjects
>     >>>>> the team to undue burden I think. Submitting a package with NOTEs
>     >>>>> delays the release for a package which I must release very soon to
>     >>>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>     >>>>> need to remove the link and put it back later.
>     >>>>>
>     >>>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>     >>>>> problem but ultimately doesn't solve it.
>     >>>>>
>     >>>>> # Proposed solution
>     >>>>>
>     >>>>> ## Allow-list
>     >>>>>
>     >>>>> A file inst/URL that lists all URLs where failures are allowed --
>     >>>>> possibly with a list of the HTTP codes accepted for that link.
>     >>>>>
>     >>>>> Example:
>     >>>>>
>     >>>>> https://oracle.com/ 301
>     >>>>> https://twitter.com/drob/status/1224851726068527106 400
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel