readLines interaction with gsub different in R-dev

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

readLines interaction with gsub different in R-dev

Hugh Parsonage
I was told to re-raise this issue with R-dev:

In the documentation of R-dev and R-3.4.3, under ?gsub

> replacement
>    ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.

However, the following code runs differently:

tempf <- tempfile()
writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
entry <- readLines(tempf, encoding = "UTF-8")
gsub("(\\w)", "\\U\\1", entry, perl = TRUE)


"AUTHOR: AMÉLIE"  # R-3.4.3

"A"                              # R-dev



Best,

Hugh Parsonage.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines interaction with gsub different in R-dev

Dirk Eddelbuettel

On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AMÉLIE"  # R-3.4.3
|
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AMÉLIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines interaction with gsub different in R-dev

Hugh Parsonage
| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Amélie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
 "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AMéLIE"  # latin1 encoding


A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.




On 18 February 2018 at 02:15, Dirk Eddelbuettel <[hidden email]> wrote:

>
> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> | I was told to re-raise this issue with R-dev:
> |
> | In the documentation of R-dev and R-3.4.3, under ?gsub
> |
> | > replacement
> | >    ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
> |
> | However, the following code runs differently:
> |
> | tempf <- tempfile()
> | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> | entry <- readLines(tempf, encoding = "UTF-8")
> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> |
> |
> | "AUTHOR: AMÉLIE"  # R-3.4.3
> |
> | "A"                              # R-dev
>
> Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
> you use wrong, ie isn't R-devel giving the correct answer?
>
> R> tempf <- tempfile()
> R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> R> entry <- readLines(tempf, encoding = "UTF-8")
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR"
> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR: AMÉLIE"
> R>
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines interaction with gsub different in R-dev

R devel mailing list
I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Amélie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "<a" "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1])
#[1] "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2])
#[1] "<aM><eL><iA>"

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <[hidden email]>
wrote:

> | Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
> regexp
> | you use wrong, ie isn't R-devel giving the correct answer?
>
> No, I don't think R-devel is correct (or at least consistent with the
> documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
> perl = TRUE) is "Take every word character and replace it with itself,
> converted to uppercase."
>
> Perhaps my example was too minimal. Consider the following:
>
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
>
> R> gsub("(\\w)", "\\1", entry, perl = TRUE)
> [1] "author: Amélie"   # OK, but very different to 'A', despite only
> not specifying uppercase
>
> R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
> [1] "AUTHOR: AMELIE"  # OK, but very different to 'A',
>
> R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
>  "AUTHOR"  # Where did everything after the first group go?
>
> I should note the following example too:
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
> [1] "AUTHOR: AMéLIE"  # latin1 encoding
>
>
> A call to `readLines` (possibly `scan()` and `read.table` and friends)
> is essential.
>
>
>
>
> On 18 February 2018 at 02:15, Dirk Eddelbuettel <[hidden email]> wrote:
> >
> > On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> > | I was told to re-raise this issue with R-dev:
> > |
> > | In the documentation of R-dev and R-3.4.3, under ?gsub
> > |
> > | > replacement
> > | >    ... For perl = TRUE only, it can also contain "\U" or "\L" to
> convert the rest of the replacement to upper or lower case and "\E" to end
> case conversion.
> > |
> > | However, the following code runs differently:
> > |
> > | tempf <- tempfile()
> > | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> > | entry <- readLines(tempf, encoding = "UTF-8")
> > | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> > |
> > |
> > | "AUTHOR: AMÉLIE"  # R-3.4.3
> > |
> > | "A"                              # R-dev
> >
> > Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
> regexp
> > you use wrong, ie isn't R-devel giving the correct answer?
> >
> > R> tempf <- tempfile()
> > R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> > R> entry <- readLines(tempf, encoding = "UTF-8")
> > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> > [1] "A"
> > R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> > [1] "AUTHOR"
> > R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> > [1] "AUTHOR: AMÉLIE"
> > R>
> >
> > Dirk
> >
> > --
> > http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines interaction with gsub different in R-dev

Tomas Kalibera
Thank you for the report and analysis. Now fixed in R-devel.
Tomas

On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:

> I think the problem in R-devel happens when there are non-ASCII characters
> in any
> of the strings passed to gsub.
>
> txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
> as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
> txt
> #[1] "Amélie" "Amelia"
> Encoding(txt)
> #[1] "unknown" "unknown"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
> #[1] "<a" "<a"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1])
> #[1] "<a"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2])
> #[1] "<aM><eL><iA>"
>
> I can change the Encoding to "latin1" or "UTF-8" and get similar results
> from gsub.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <[hidden email]>
> wrote:
>
>> | Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
>> regexp
>> | you use wrong, ie isn't R-devel giving the correct answer?
>>
>> No, I don't think R-devel is correct (or at least consistent with the
>> documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
>> perl = TRUE) is "Take every word character and replace it with itself,
>> converted to uppercase."
>>
>> Perhaps my example was too minimal. Consider the following:
>>
>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>> [1] "A"
>>
>> R> gsub("(\\w)", "\\1", entry, perl = TRUE)
>> [1] "author: Amélie"   # OK, but very different to 'A', despite only
>> not specifying uppercase
>>
>> R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
>> [1] "AUTHOR: AMELIE"  # OK, but very different to 'A',
>>
>> R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
>>   "AUTHOR"  # Where did everything after the first group go?
>>
>> I should note the following example too:
>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
>> [1] "AUTHOR: AMéLIE"  # latin1 encoding
>>
>>
>> A call to `readLines` (possibly `scan()` and `read.table` and friends)
>> is essential.
>>
>>
>>
>>
>> On 18 February 2018 at 02:15, Dirk Eddelbuettel <[hidden email]> wrote:
>>> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
>>> | I was told to re-raise this issue with R-dev:
>>> |
>>> | In the documentation of R-dev and R-3.4.3, under ?gsub
>>> |
>>> | > replacement
>>> | >    ... For perl = TRUE only, it can also contain "\U" or "\L" to
>> convert the rest of the replacement to upper or lower case and "\E" to end
>> case conversion.
>>> |
>>> | However, the following code runs differently:
>>> |
>>> | tempf <- tempfile()
>>> | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
>>> | entry <- readLines(tempf, encoding = "UTF-8")
>>> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>>> |
>>> |
>>> | "AUTHOR: AMÉLIE"  # R-3.4.3
>>> |
>>> | "A"                              # R-dev
>>>
>>> Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
>> regexp
>>> you use wrong, ie isn't R-devel giving the correct answer?
>>>
>>> R> tempf <- tempfile()
>>> R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
>>> R> entry <- readLines(tempf, encoding = "UTF-8")
>>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>>> [1] "A"
>>> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
>>> [1] "AUTHOR"
>>> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
>>> [1] "AUTHOR: AMÉLIE"
>>> R>
>>>
>>> Dirk
>>>
>>> --
>>> http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel