Unicode display problem with data frames under Windows

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode display problem with data frames under Windows

Richard Cotton-2
Here's a data frame with some Unicode symbols (set intersection and union).

d <- data.frame(x = "A \u222a B \u2229 C")

Printing this data frame under R 3.2.0 patched (r68378) and Windows 7, I see

d
##                  x
## 1 A <U+222A> B n C

Printing the column itself works fine.

d$x
## [1] A ∪ B ∩ C
## Levels: A ∪ B ∩ C

The encoding is correctly UTF-8.

Encoding(as.character(d$x))
## [1] "UTF-8"

Under Linux both forms of printing are fine for me.

I'm not quite sure whether I've missed a setting or if this is a bug, so

Am I doing something silly?
Can anyone else reproduce this?

--
Regards,
Richie

Learning R
4dpiecharts.com

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Ista Zahn
AFAIK this is the way it works on Windows. It has been discussed in several
places, e.g.
http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
,
http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
(both of these came up when I googled the subject line of your email).

Best,
Ista
On May 25, 2015 9:39 AM, "Richard Cotton" <[hidden email]> wrote:

> Here's a data frame with some Unicode symbols (set intersection and union).
>
> d <- data.frame(x = "A \u222a B \u2229 C")
>
> Printing this data frame under R 3.2.0 patched (r68378) and Windows 7, I
> see
>
> d
> ##                  x
> ## 1 A <U+222A> B n C
>
> Printing the column itself works fine.
>
> d$x
> ## [1] A ∪ B ∩ C
> ## Levels: A ∪ B ∩ C
>
> The encoding is correctly UTF-8.
>
> Encoding(as.character(d$x))
> ## [1] "UTF-8"
>
> Under Linux both forms of printing are fine for me.
>
> I'm not quite sure whether I've missed a setting or if this is a bug, so
>
> Am I doing something silly?
> Can anyone else reproduce this?
>
> --
> Regards,
> Richie
>
> Learning R
> 4dpiecharts.com
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Duncan Murdoch-2
On 25/05/2015 11:37 AM, Ista Zahn wrote:
> AFAIK this is the way it works on Windows. It has been discussed in several
> places, e.g.
> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> ,
> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> (both of these came up when I googled the subject line of your email).

Yes, but it is a bug, just a hard one to fix.  It needs someone to
dedicate a serious amount of time to deal with it.

Since most of the people who tend to do that generally use systems in
UTF-8 locales where this isn't a problem, or don't use Windows, it is
languishing.

Duncan Murdoch

>
> Best,
> Ista
> On May 25, 2015 9:39 AM, "Richard Cotton" <[hidden email]> wrote:
>
> > Here's a data frame with some Unicode symbols (set intersection and union).
> >
> > d <- data.frame(x = "A \u222a B \u2229 C")
> >
> > Printing this data frame under R 3.2.0 patched (r68378) and Windows 7, I
> > see
> >
> > d
> > ##                  x
> > ## 1 A <U+222A> B n C
> >
> > Printing the column itself works fine.
> >
> > d$x
> > ## [1] A ∪ B ∩ C
> > ## Levels: A ∪ B ∩ C
> >
> > The encoding is correctly UTF-8.
> >
> > Encoding(as.character(d$x))
> > ## [1] "UTF-8"
> >
> > Under Linux both forms of printing are fine for me.
> >
> > I'm not quite sure whether I've missed a setting or if this is a bug, so
> >
> > Am I doing something silly?
> > Can anyone else reproduce this?
> >
> > --
> > Regards,
> > Richie
> >
> > Learning R
> > 4dpiecharts.com
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Duncan Murdoch-2
On 25/05/2015 12:43 PM, Duncan Murdoch wrote:

> On 25/05/2015 11:37 AM, Ista Zahn wrote:
> > AFAIK this is the way it works on Windows. It has been discussed in several
> > places, e.g.
> > http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> > ,
> > http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> > (both of these came up when I googled the subject line of your email).
>
> Yes, but it is a bug, just a hard one to fix.  It needs someone to
> dedicate a serious amount of time to deal with it.
>
> Since most of the people who tend to do that generally use systems in
> UTF-8 locales where this isn't a problem, or don't use Windows, it is
> languishing.

Oops, I meant to write "or don't use non-ascii characters", the UTF-8
locales implies non-Windows.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Peter Meissner-3
In reply to this post by Duncan Murdoch-2
Am .05.2015, 18:43 Uhr, schrieb Duncan Murdoch <[hidden email]>:

> On 25/05/2015 11:37 AM, Ista Zahn wrote:
>> AFAIK this is the way it works on Windows. It has been discussed in  
>> several
>> places, e.g.
>> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
>> ,
>> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
>> (both of these came up when I googled the subject line of your email).
>
> Yes, but it is a bug, just a hard one to fix.  It needs someone to  
> dedicate a serious amount of time to deal with it.
>
> Since most of the people who tend to do that generally use systems in  
> UTF-8 locales where this isn't a problem, or don't use Windows, it is  
> languishing.
>
> Duncan Murdoch


I understand that these problems are not easy to fix but ...

I think that
"most of the people who tend to do that generally use systems in UTF-8  
locales"
is a biased perception. Developers might tend to use Mac or Linux most  
often. For others Windows still is and probably will be the OS most often  
used. For most of them switching to something else is a major hurdle.

What I often witness is that those non existent Windows users try to  
muddle through with numerous calls to Encoding() , iconv() and the like  
while at the same time never being sure if the strange behavior is due to  
their lack of understanding, Windows specifics or due to R. In the end  
they either succeed with their muddling or give up,  - but do not change  
the system.

So whoever might attempt the Hercules task will be praised by thousands ;-)

Best, Peter


>>
>> Best,
>> Ista
>> On May 25, 2015 9:39 AM, "Richard Cotton" <[hidden email]> wrote:
>>
>> > Here's a data frame with some Unicode symbols (set intersection and  
>> union).
>> >
>> > d <- data.frame(x = "A \u222a B \u2229 C")
>> >
>> > Printing this data frame under R 3.2.0 patched (r68378) and Windows  
>> 7, I
>> > see
>> >
>> > d
>> > ##                  x
>> > ## 1 A <U+222A> B n C
>> >
>> > Printing the column itself works fine.
>> >
>> > d$x
>> > ## [1] A ∪ B ∩ C
>> > ## Levels: A ∪ B ∩ C
>> >
>> > The encoding is correctly UTF-8.
>> >
>> > Encoding(as.character(d$x))
>> > ## [1] "UTF-8"
>> >
>> > Under Linux both forms of printing are fine for me.
>> >
>> > I'm not quite sure whether I've missed a setting or if this is a bug,  
>> so
>> >
>> > Am I doing something silly?
>> > Can anyone else reproduce this?
>> >
>> > --
>> > Regards,
>> > Richie
>> >
>> > Learning R
>> > 4dpiecharts.com
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Duncan Murdoch-2
On 25/05/2015 3:12 PM, Peter Meissner wrote:

> Am .05.2015, 18:43 Uhr, schrieb Duncan Murdoch <[hidden email]>:
>
> > On 25/05/2015 11:37 AM, Ista Zahn wrote:
> >> AFAIK this is the way it works on Windows. It has been discussed in
> >> several
> >> places, e.g.
> >> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> >> ,
> >> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
> >> (both of these came up when I googled the subject line of your email).
> >
> > Yes, but it is a bug, just a hard one to fix.  It needs someone to
> > dedicate a serious amount of time to deal with it.
> >
> > Since most of the people who tend to do that generally use systems in
> > UTF-8 locales where this isn't a problem, or don't use Windows, it is
> > languishing.
> >
> > Duncan Murdoch
>
>
> I understand that these problems are not easy to fix but ...
>
> I think that
> "most of the people who tend to do that generally use systems in UTF-8
> locales"
> is a biased perception. Developers might tend to use Mac or Linux most
> often. For others Windows still is and probably will be the OS most often
> used. For most of them switching to something else is a major hurdle.
>
> What I often witness is that those non existent Windows users try to
> muddle through with numerous calls to Encoding() , iconv() and the like
> while at the same time never being sure if the strange behavior is due to
> their lack of understanding, Windows specifics or due to R. In the end
> they either succeed with their muddling or give up,  - but do not change
> the system.
>
> So whoever might attempt the Hercules task will be praised by thousands ;-)
I'm not sure we disagree.  R is a volunteer project, and the things that
get done are the things that someone volunteers to do.  But in this
particular case, the volunteer needs a lot of knowledge about R
internals to make progress, and there just aren't that many people like
that.   They are all "developers".

If you aren't one of those people, you need to motivate one of them to
volunteer to take this on.  I don't think a financial contribution would
work, but people do return favours:  so do something that makes one of
the developers' lives a lot easier, and then point out how this
particular bug is causing trouble for you, and maybe they'll choose to
return the favour.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Richard Cotton-2
In reply to this post by Duncan Murdoch-2
On 25 May 2015 at 19:43, Duncan Murdoch <[hidden email]> wrote:
>> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r

> Yes, but it is a bug, just a hard one to fix.  It needs someone to dedicate
> a serious amount of time to deal with it.
>
> Since most of the people who tend to do that generally use systems in UTF-8
> locales where this isn't a problem, or don't use Windows, it is languishing.

Thanks for the link and the explanation of why the bug exists.

>> On May 25, 2015 9:39 AM, "Richard Cotton" <[hidden email]> wrote:
>>
>> > Here's a data frame with some Unicode symbols (set intersection and
>> > union).
>> >
>> > d <- data.frame(x = "A \u222a B \u2229 C")
>> >
>> > Printing this data frame under R 3.2.0 patched (r68378) and Windows 7, I
>> > see
>> >
>> > d
>> > ##                  x
>> > ## 1 A <U+222A> B n C

For future readers searching for a solution to this, you can get
correct printing by setting the CTYPE part of the locale to
Chinese/Japanese/Korean.

Sys.setlocale("LC_CTYPE", "Chinese")
## [1] "Chinese (Simplified)_People's Republic of China.936"

d
##            x
## 1 A ∪ B ∩ C

--
Regards,
Richie

Learning R
4dpiecharts.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unicode display problem with data frames under Windows

Peter Meissner-3
Am .05.2015, 09:01 Uhr, schrieb Richard Cotton <[hidden email]>:

> On 25 May 2015 at 19:43, Duncan Murdoch <[hidden email]> wrote:
>>> http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r
>
>> Yes, but it is a bug, just a hard one to fix.  It needs someone to  
>> dedicate
>> a serious amount of time to deal with it.
>>
>> Since most of the people who tend to do that generally use systems in  
>> UTF-8
>> locales where this isn't a problem, or don't use Windows, it is  
>> languishing.
>
> Thanks for the link and the explanation of why the bug exists.
>
>>> On May 25, 2015 9:39 AM, "Richard Cotton" <[hidden email]>  
>>> wrote:
>>>
>>> > Here's a data frame with some Unicode symbols (set intersection and
>>> > union).
>>> >
>>> > d <- data.frame(x = "A \u222a B \u2229 C")
>>> >
>>> > Printing this data frame under R 3.2.0 patched (r68378) and Windows  
>>> 7, I
>>> > see
>>> >
>>> > d
>>> > ##                  x
>>> > ## 1 A <U+222A> B n C
>
> For future readers searching for a solution to this, you can get
> correct printing by setting the CTYPE part of the locale to
> Chinese/Japanese/Korean.
>
> Sys.setlocale("LC_CTYPE", "Chinese")
> ## [1] "Chinese (Simplified)_People's Republic of China.936"
>
> d
> ##            x
> ## 1 A ∪ B ∩ C
>


There is another workaround.

The problem with the character transformation on printing data frames  
stems from format() used within print.default(). Defining your own class  
and print function that does not use format() allows for correct printing  
in all locales.

Like this:


d <- data.frame(x = "A \u222a B \u2229 C")
d
##                  x
## 1 A <U+222A> B n C


class(d) <- c("unicode_df","data.frame")

# this is print.default from base R with only two lines modified, see #old#
print.unicode_df <- function (x, ..., digits = NULL, quote = FALSE, right  
= TRUE,
     row.names = TRUE)
{
     n <- length(row.names(x))
     if (length(x) == 0L) {
         cat(sprintf(ngettext(n, "data frame with 0 columns and %d row",
             "data frame with 0 columns and %d rows", domain = "R-base"),
             n), "\n", sep = "")
     }
     else if (n == 0L) {
         print.default(names(x), quote = FALSE)
         cat(gettext("<0 rows> (or 0-length row.names)\n"))
     }
     else {
         #old# m <- as.matrix(format.data.frame(x, digits = digits,
         #old#     na.encode = FALSE))
         m <- as.matrix(x)
         if (!isTRUE(row.names))
             dimnames(m)[[1L]] <- if (identical(row.names, FALSE))
                 rep.int("", n)
             else row.names
         print(m, ..., quote = quote, right = right)
     }
     invisible(x)
}


d
##              x
## [1,] A ∪ B ∩ C




--
Erstellt mit Operas E-Mail-Modul: http://www.opera.com/mail/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel