Possible encoding bug in sub()

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible encoding bug in sub()

Korpela Mikko (MML)
I noticed that sub() gives unexpected results for the following test
case. In the test case, the (initial) input is ASCII but the
replacements are UTF-8. The first sub() produces an UTF-8 result with
an "unknown" Encoding. This makes the result garbled in Windows (no
UTF-8 locale there). The second sub() produces a correct result,
although for some reason it is converted to the native Encoding in
Windows.

I think the best result would be UTF-8 output marked as such.

foo <- c("a", "b")
foo <- sub("a", "\u00e4", foo)
print(Encoding(foo))
## [1] "unknown" "unknown"
foo <- sub("b", "\u00f6", foo)
print(Encoding(foo))
## [1] "unknown" "unknown" # Windows
## [1] "unknown" "UTF-8"   # Linux
print(foo)
## [1] "ä" "ö"            # Windows
## [1] "ä" "ö"             # Linux

The output of sessionInfo() for both test systems follows.

> sessionInfo()
R version 3.5.1 Patched (2018-11-28 r75713)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252
[3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C
[5] LC_TIME=Finnish_Finland.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

> sessionInfo()
R Under development (unstable) (2018-12-08 r75801)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fi_FI.UTF-8        LC_COLLATE=fi_FI.UTF-8    
 [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=fi_FI.UTF-8  
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] compiler_3.6.0

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Possible encoding bug in sub()

Martin Maechler
>>>>> Korpela Mikko (MML)
>>>>>     on Sat, 8 Dec 2018 18:42:30 +0000 writes:

    > I noticed that sub() gives unexpected results for the following test
    > case. In the test case, the (initial) input is ASCII but the
    > replacements are UTF-8. The first sub() produces an UTF-8 result with
    > an "unknown" Encoding. This makes the result garbled in Windows (no
    > UTF-8 locale there). The second sub() produces a correct result,
    > although for some reason it is converted to the native Encoding in
    > Windows.

    > I think the best result would be UTF-8 output marked as such.

    > foo <- c("a", "b")
    > foo <- sub("a", "\u00e4", foo)
    > print(Encoding(foo))
    > ## [1] "unknown" "unknown"
    > foo <- sub("b", "\u00f6", foo)
    > print(Encoding(foo))
    > ## [1] "unknown" "unknown" # Windows
    > ## [1] "unknown" "UTF-8"   # Linux
    > print(foo)
    > ## [1] "ä" "ö"            # Windows
    > ## [1] "ä" "ö"             # Linux

I can confirm the problem on Windows,
also for a recent version of R-devel.

Why not filing this as a proper bug report at R's bugzilla?
There's still no certainty that it will be fixed quickly, but
the bug PR's there are less easily forgotten.

Martin


    > The output of sessionInfo() for both test systems follows.

    >> sessionInfo()
    > R version 3.5.1 Patched (2018-11-28 r75713)
    > Platform: x86_64-w64-mingw32/x64 (64-bit)
    > Running under: Windows 7 x64 (build 7601) Service Pack 1

    > Matrix products: default

    > locale:
    > [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252
    > [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C
    > [5] LC_TIME=Finnish_Finland.1252

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > loaded via a namespace (and not attached):
    > [1] compiler_3.5.1

    >> sessionInfo()
    > R Under development (unstable) (2018-12-08 r75801)
    > Platform: x86_64-pc-linux-gnu (64-bit)
    > Running under: Ubuntu 18.04.1 LTS

    > Matrix products: default
    > BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
    > LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so

    > locale:
    > [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
    > [3] LC_TIME=fi_FI.UTF-8        LC_COLLATE=fi_FI.UTF-8    
    > [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=fi_FI.UTF-8  
    > [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                
    > [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    > [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C      

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base    

    > loaded via a namespace (and not attached):
    > [1] compiler_3.6.0

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Possible encoding bug in sub()

Korpela Mikko (MML)
Thanks for the confirmation. The bug report is now online at
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17509

- Mikko

-----Original Message-----
From: Martin Maechler [mailto:[hidden email]]
Sent: Monday, December 10, 2018 12:09 PM
To: Korpela Mikko (MML)
Cc: [hidden email]
Subject: Re: [Rd] Possible encoding bug in sub()

>>>>> Korpela Mikko (MML)
>>>>>     on Sat, 8 Dec 2018 18:42:30 +0000 writes:

    > I noticed that sub() gives unexpected results for the following test
    > case. In the test case, the (initial) input is ASCII but the
    > replacements are UTF-8. The first sub() produces an UTF-8 result with
    > an "unknown" Encoding. This makes the result garbled in Windows (no
    > UTF-8 locale there). The second sub() produces a correct result,
    > although for some reason it is converted to the native Encoding in
    > Windows.

    > I think the best result would be UTF-8 output marked as such.

    > foo <- c("a", "b")
    > foo <- sub("a", "\u00e4", foo)
    > print(Encoding(foo))
    > ## [1] "unknown" "unknown"
    > foo <- sub("b", "\u00f6", foo)
    > print(Encoding(foo))
    > ## [1] "unknown" "unknown" # Windows
    > ## [1] "unknown" "UTF-8"   # Linux
    > print(foo)
    > ## [1] "ä" "ö"            # Windows
    > ## [1] "ä" "ö"             # Linux

I can confirm the problem on Windows,
also for a recent version of R-devel.

Why not filing this as a proper bug report at R's bugzilla?
There's still no certainty that it will be fixed quickly, but the bug PR's there are less easily forgotten.

Martin


    > The output of sessionInfo() for both test systems follows.

    >> sessionInfo()
    > R version 3.5.1 Patched (2018-11-28 r75713)
    > Platform: x86_64-w64-mingw32/x64 (64-bit)
    > Running under: Windows 7 x64 (build 7601) Service Pack 1

    > Matrix products: default

    > locale:
    > [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252
    > [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C
    > [5] LC_TIME=Finnish_Finland.1252

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > loaded via a namespace (and not attached):
    > [1] compiler_3.5.1

    >> sessionInfo()
    > R Under development (unstable) (2018-12-08 r75801)
    > Platform: x86_64-pc-linux-gnu (64-bit)
    > Running under: Ubuntu 18.04.1 LTS

    > Matrix products: default
    > BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
    > LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so

    > locale:
    > [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
    > [3] LC_TIME=fi_FI.UTF-8        LC_COLLATE=fi_FI.UTF-8    
    > [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=fi_FI.UTF-8  
    > [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                
    > [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    > [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C      

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base    

    > loaded via a namespace (and not attached):
    > [1] compiler_3.6.0

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel