Encoding issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding issue

Sebastien Bihorel
Hi,

I am having problems getting similar output when processing the same markdown files on 2 different Linux systems (one is a laptop with Linux Mint 18.3, the other is a production server running on CentOS 7). I think this boils down to an encoding issue but I am not sure if this is a system-wide issue or an R issue. So, this is what I have so far.

I have this very small dummy html file (with the same md5sum on both systems) which only contains 3 characters. A "od -cx" call provides the same output in both systems:
0000000   r 342 200 231   s  \n
           e272    9980    0a73

The middle character is some form of single quote produced by the conversion of a ' character from markdown to html. Reading the same file in both systems and applying a gsub replace provide widely different results.

####On my laptop
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "r’s"
> gsub('\\s{2,}', ' ', x)
[1] "r’s"
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] compiler_3.4.4

####On the server
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "râs"
> gsub('\\s{2,}', ' ', x)
[1] " "
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3

(The overarching issue is that I have to use the production server for SOP reasons, so I cannot simply ignore the problem and use my laptop).

I would appreciate any suggestions on how to approach this issue.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issue

Ivan Krylov
On Mon, 5 Nov 2018 08:36:13 -0500 (EST)
Sebastien Bihorel <[hidden email]> wrote:

> [1] "râs"

Interesting. This is what I get if I decode the bytes 72 e2 80 99 73 0a
as latin-1 instead of UTF-8. They look like there is only three
characters, but, actually, there is more:

$ perl -CSD -Mcharnames=:full -MEncode=decode \
 -E'for (split //, decode latin1 => pack "H*", "72e28099730a")
 { say ord, " ", $_, " ", charnames::viacode(ord) }'
114 r LATIN SMALL LETTER R
226 â LATIN SMALL LETTER A WITH CIRCUMFLEX
128  PADDING CHARACTER
153  SINGLE GRAPHIC CHARACTER INTRODUCER
115 s LATIN SMALL LETTER S
10
 LINE FEED

Does it help if you explicitly specify the file encoding by passing
fileEncoding="UTF-8" argument to scan()?

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issue

Sebastien Bihorel

Hi Ivan,

0xe2 0x80 0x99 seems to be the UTF-8 hex code for Unicode Character 'RIGHT SINGLE QUOTATION MARK', which would make sense in the context.

Using the encoding argument for the scan call does not change the outcome.

Looking at the server side a bit more, some colleagues pointed out that the "râs" display could be a side-effect of encoding issue with Putty (which I used to connect to the remote server). Changing the setting of Putty display, I get the correct display "r’s"... However, that does not change anything to the gsub issue...

Sebastien

----- Original Message -----
From: "Ivan Krylov" <[hidden email]>
To: "Sebastien Bihorel" <[hidden email]>
Cc: [hidden email]
Sent: Monday, November 5, 2018 2:34:02 PM
Subject: Re: [R] Encoding issue

On Mon, 5 Nov 2018 08:36:13 -0500 (EST)
Sebastien Bihorel <[hidden email]> wrote:

> [1] "râs"

Interesting. This is what I get if I decode the bytes 72 e2 80 99 73 0a
as latin-1 instead of UTF-8. They look like there is only three
characters, but, actually, there is more:

$ perl -CSD -Mcharnames=:full -MEncode=decode \
 -E'for (split //, decode latin1 => pack "H*", "72e28099730a")
 { say ord, " ", $_, " ", charnames::viacode(ord) }'
114 r LATIN SMALL LETTER R
226 â LATIN SMALL LETTER A WITH CIRCUMFLEX
128  PADDING CHARACTER
153  SINGLE GRAPHIC CHARACTER INTRODUCER
115 s LATIN SMALL LETTER S
10
 LINE FEED

Does it help if you explicitly specify the file encoding by passing
fileEncoding="UTF-8" argument to scan()?

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.