Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

IAGO GINÉ VÁZQUEZ
I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is 會 and the code is U+6703. Following the code I arrive to the instruction

> base::format.default("會")

which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it follows on most recent versions of R.

Is it expected?

Thank you!

Iago

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Tomas Kalibera
On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:

> I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is 會 and the code is U+6703. Following the code I arrive to the instruction
>
>> base::format.default("會")
> which prints
>
> [1] "<U+6703>"
>
> I do not know which is the extent of this behaviour either if it follows on most recent versions of R.
>
> Is it expected?

If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

>
> Thank you!
>
> Iago
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

IAGO GINÉ VÁZQUEZ
But if I type
> "會"
the output is

[1] "會"

so seemingly it can be represented. Or, am I wrong?

Best
Iago
________________________________
De: Tomas Kalibera <[hidden email]>
Enviat el: divendres, 13 de setembre de 2019 11:24
Per a: IAGO GINÉ VÁZQUEZ <[hidden email]>; [hidden email] <[hidden email]>
Tema: Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:

> I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is 會 and the code is U+6703. Following the code I arrive to the instruction
>
>> base::format.default("會")
> which prints
>
> [1] "<U+6703>"
>
> I do not know which is the extent of this behaviour either if it follows on most recent versions of R.
>
> Is it expected?

If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

>
> Thank you!
>
> Iago
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Tomas Kalibera
On 9/13/19 11:37 AM, IAGO GINÉ VÁZQUEZ wrote:
> But if I type
> >"會"
> the output is
> [1] "會"
> so seemingly it can be represented. Or, am I wrong?

In RGui you can print the string, because RGui is a Windows Unicode
application (uses UTF16-LE and bypasses the C runtime for strings). But
it is just the gui, R itself (and hence also packages) use the current
native encoding as defined by the C runtime. RGui will make sure R gets
the string in UTF-8, but as soon as you do anything even slightly
non-trivial, which includes formatting, the string will be converted to
the current native encoding. Some R functions allow you to do certain
things in UTF-8 without conversion to native encoding, you'd have to
read very carefully the documentation for each function - but for
practical use, you either need to live with the misinterpretation of
some characters, or use Windows in the locale where your characters can
be represented (e.g. Chinese locale when working with Chinese strings),
or use Linux/maOS. On Linux/macOS the current native encoding can be
UTF-8, so there is no problem. On Windows, with the current toolchain
based on mingw, this is not possible.


Best
Tomas

>
> Best
> Iago
> ------------------------------------------------------------------------
> *De:* Tomas Kalibera <[hidden email]>
> *Enviat el:* divendres, 13 de setembre de 2019 11:24
> *Per a:* IAGO GINÉ VÁZQUEZ <[hidden email]>; [hidden email]
> <[hidden email]>
> *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
> -windows 10
> On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:
> > I have a chinese character on a data frame, but the output of
> printing it is its UTF-8 code. Concretely, the character is 會 and the
> code is U+6703. Following the code I arrive to the instruction
> >
> >> base::format.default("會")
> > which prints
> >
> > [1] "<U+6703>"
> >
> > I do not know which is the extent of this behaviour either if it
> follows on most recent versions of R.
> >
> > Is it expected?
>
> If you are running this on Windows in an encoding where the character
> cannot be represented (e.g. non-Chinese locale), then yes, this is
> expected behavior.
>
> On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
> character will be formatted/displayed properly.
>
> Best
> Tomas
>
> >
> > Thank you!
> >
> > Iago
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Ray Donnelly-2
On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera <[hidden email]>
wrote:

> On 9/13/19 11:37 AM, IAGO GINÉ VÁZQUEZ wrote:
> > But if I type
> > >"會"
> > the output is
> > [1] "會"
> > so seemingly it can be represented. Or, am I wrong?
>
> In RGui you can print the string, because RGui is a Windows Unicode
> application (uses UTF16-LE and bypasses the C runtime for strings). But
> it is just the gui, R itself (and hence also packages) use the current
> native encoding as defined by the C runtime. RGui will make sure R gets
> the string in UTF-8, but as soon as you do anything even slightly
> non-trivial, which includes formatting, the string will be converted to
> the current native encoding. Some R functions allow you to do certain
> things in UTF-8 without conversion to native encoding, you'd have to
> read very carefully the documentation for each function - but for
> practical use, you either need to live with the misinterpretation of
> some characters, or use Windows in the locale where your characters can
> be represented (e.g. Chinese locale when working with Chinese strings),
> or use Linux/maOS. On Linux/macOS the current native encoding can be
> UTF-8, so there is no problem. On Windows, with the current toolchain
> based on mingw, this is not possible.
>

mingw-w64 is capable of processing utf-8 (it can process bytes after all).
Can you explain what you mean here? Would any other compiler on Windows not
suffer from this problem?


>
>
> Best
> Tomas
>
> >
> > Best
> > Iago
> > ------------------------------------------------------------------------
> > *De:* Tomas Kalibera <[hidden email]>
> > *Enviat el:* divendres, 13 de setembre de 2019 11:24
> > *Per a:* IAGO GINÉ VÁZQUEZ <[hidden email]>; [hidden email]
> > <[hidden email]>
> > *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
> > -windows 10
> > On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:
> > > I have a chinese character on a data frame, but the output of
> > printing it is its UTF-8 code. Concretely, the character is 會 and the
> > code is U+6703. Following the code I arrive to the instruction
> > >
> > >> base::format.default("會")
> > > which prints
> > >
> > > [1] "<U+6703>"
> > >
> > > I do not know which is the extent of this behaviour either if it
> > follows on most recent versions of R.
> > >
> > > Is it expected?
> >
> > If you are running this on Windows in an encoding where the character
> > cannot be represented (e.g. non-Chinese locale), then yes, this is
> > expected behavior.
> >
> > On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
> > character will be formatted/displayed properly.
> >
> > Best
> > Tomas
> >
> > >
> > > Thank you!
> > >
> > > Iago
> > >
> > >        [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Tomas Kalibera
On 9/13/19 1:33 PM, Ray Donnelly wrote:

> On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 9/13/19 11:37 AM, IAGO GINÉ VÁZQUEZ wrote:
>     > But if I type
>     > >"會"
>     > the output is
>     > [1] "會"
>     > so seemingly it can be represented. Or, am I wrong?
>
>     In RGui you can print the string, because RGui is a Windows Unicode
>     application (uses UTF16-LE and bypasses the C runtime for
>     strings). But
>     it is just the gui, R itself (and hence also packages) use the
>     current
>     native encoding as defined by the C runtime. RGui will make sure R
>     gets
>     the string in UTF-8, but as soon as you do anything even slightly
>     non-trivial, which includes formatting, the string will be
>     converted to
>     the current native encoding. Some R functions allow you to do certain
>     things in UTF-8 without conversion to native encoding, you'd have to
>     read very carefully the documentation for each function - but for
>     practical use, you either need to live with the misinterpretation of
>     some characters, or use Windows in the locale where your
>     characters can
>     be represented (e.g. Chinese locale when working with Chinese
>     strings),
>     or use Linux/maOS. On Linux/macOS the current native encoding can be
>     UTF-8, so there is no problem. On Windows, with the current toolchain
>     based on mingw, this is not possible.
>
>
> mingw-w64 is capable of processing utf-8 (it can process bytes after
> all). Can you explain what you mean here? Would any other compiler on
> Windows not suffer from this problem?

The problem is using UTF-8 as the current locale as understood by the C
runtime/C library. By default mingw uses msvcrt, which does not allow
UTF-8 as current locale (via setlocale()). Now mingw also allows to
build with UCRT (recently), and I hope one day we will be able to use
it, but it is not yet the default, msys2 does not use it yet for its
mingw_ packages and we need also the external packages . Note that R
(CRAN, and also BIOC) provide binary versions of all packages for
Windows, they need to build them and they need all library dependencies.
All of those would have to be rebuilt with UCRT, which will be a huge
task. Fixing R on its own to support UTF-8 natively on Windows when the
C runtime allows it won't be hard, because R already can do it on Unix,
but the problem is all the dependencies.

Tomas



>
>
>     Best
>     Tomas
>
>     >
>     > Best
>     > Iago
>     >
>     ------------------------------------------------------------------------
>     > *De:* Tomas Kalibera <[hidden email]
>     <mailto:[hidden email]>>
>     > *Enviat el:* divendres, 13 de setembre de 2019 11:24
>     > *Per a:* IAGO GINÉ VÁZQUEZ <[hidden email]
>     <mailto:[hidden email]>>; [hidden email]
>     <mailto:[hidden email]>
>     > <[hidden email] <mailto:[hidden email]>>
>     > *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
>     > -windows 10
>     > On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:
>     > > I have a chinese character on a data frame, but the output of
>     > printing it is its UTF-8 code. Concretely, the character is 會
>     and the
>     > code is U+6703. Following the code I arrive to the instruction
>     > >
>     > >> base::format.default("會")
>     > > which prints
>     > >
>     > > [1] "<U+6703>"
>     > >
>     > > I do not know which is the extent of this behaviour either if it
>     > follows on most recent versions of R.
>     > >
>     > > Is it expected?
>     >
>     > If you are running this on Windows in an encoding where the
>     character
>     > cannot be represented (e.g. non-Chinese locale), then yes, this is
>     > expected behavior.
>     >
>     > On Unix systems where R can run in UTF-8 encoding (Linux,
>     macOS), the
>     > character will be formatted/displayed properly.
>     >
>     > Best
>     > Tomas
>     >
>     > >
>     > > Thank you!
>     > >
>     > > Iago
>     > >
>     > >        [[alternative HTML version deleted]]
>     > >
>     > > ______________________________________________
>     > > [hidden email] <mailto:[hidden email]> mailing list
>     > > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >
>     >
>
>
>             [[alternative HTML version deleted]]
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Ray Donnelly-2
On Fri, Sep 13, 2019 at 1:46 PM Tomas Kalibera <[hidden email]>
wrote:

> On 9/13/19 1:33 PM, Ray Donnelly wrote:
>
> On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera <[hidden email]>
> wrote:
>
>> On 9/13/19 11:37 AM, IAGO GINÉ VÁZQUEZ wrote:
>> > But if I type
>> > >"會"
>> > the output is
>> > [1] "會"
>> > so seemingly it can be represented. Or, am I wrong?
>>
>> In RGui you can print the string, because RGui is a Windows Unicode
>> application (uses UTF16-LE and bypasses the C runtime for strings). But
>> it is just the gui, R itself (and hence also packages) use the current
>> native encoding as defined by the C runtime. RGui will make sure R gets
>> the string in UTF-8, but as soon as you do anything even slightly
>> non-trivial, which includes formatting, the string will be converted to
>> the current native encoding. Some R functions allow you to do certain
>> things in UTF-8 without conversion to native encoding, you'd have to
>> read very carefully the documentation for each function - but for
>> practical use, you either need to live with the misinterpretation of
>> some characters, or use Windows in the locale where your characters can
>> be represented (e.g. Chinese locale when working with Chinese strings),
>> or use Linux/maOS. On Linux/macOS the current native encoding can be
>> UTF-8, so there is no problem. On Windows, with the current toolchain
>> based on mingw, this is not possible.
>>
>
> mingw-w64 is capable of processing utf-8 (it can process bytes after all).
> Can you explain what you mean here? Would any other compiler on Windows not
> suffer from this problem?
>
> The problem is using UTF-8 as the current locale as understood by the C
> runtime/C library. By default mingw uses msvcrt, which does not allow UTF-8
> as current locale (via setlocale()). Now mingw also allows to build with
> UCRT (recently), and I hope one day we will be able to use it, but it is
> not yet the default, msys2 does not use it yet for its mingw_ packages and
> we need also the external packages . Note that R (CRAN, and also BIOC)
> provide binary versions of all packages for Windows, they need to build
> them and they need all library dependencies. All of those would have to be
> rebuilt with UCRT, which will be a huge task. Fixing R on its own to
> support UTF-8 natively on Windows when the C runtime allows it won't be
> hard, because R already can do it on Unix, but the problem is all the
> dependencies.
>
Thanks. We build R for the Anaconda Distribution and are considering our
options around our Windows compilers, including the UCRT (and clang,
possibly from MSYS2, possibly from conda-forge, or a hybrid of some sort if
necessary).


> Tomas
>
>
>
>
>
>>
>>
>> Best
>> Tomas
>>
>> >
>> > Best
>> > Iago
>> > ------------------------------------------------------------------------
>> > *De:* Tomas Kalibera <[hidden email]>
>> > *Enviat el:* divendres, 13 de setembre de 2019 11:24
>> > *Per a:* IAGO GINÉ VÁZQUEZ <[hidden email]>; [hidden email]
>> > <[hidden email]>
>> > *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
>> > -windows 10
>> > On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:
>> > > I have a chinese character on a data frame, but the output of
>> > printing it is its UTF-8 code. Concretely, the character is 會 and the
>> > code is U+6703. Following the code I arrive to the instruction
>> > >
>> > >> base::format.default("會")
>> > > which prints
>> > >
>> > > [1] "<U+6703>"
>> > >
>> > > I do not know which is the extent of this behaviour either if it
>> > follows on most recent versions of R.
>> > >
>> > > Is it expected?
>> >
>> > If you are running this on Windows in an encoding where the character
>> > cannot be represented (e.g. non-Chinese locale), then yes, this is
>> > expected behavior.
>> >
>> > On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
>> > character will be formatted/displayed properly.
>> >
>> > Best
>> > Tomas
>> >
>> > >
>> > > Thank you!
>> > >
>> > > Iago
>> > >
>> > >        [[alternative HTML version deleted]]
>> > >
>> > > ______________________________________________
>> > > [hidden email] mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel