Problem comparing two strings

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem comparing two strings

Björn Fisseler
Hello,

I'm struggling comparing two strings, which come from different data
sets. This strings are identical: "Alexander Jäger"

But when I compare these strings: string1 == string2
the result is FALSE.

Looking at the raw bytes used to encode the strings, the results are
different:

string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72

string2 comes from the file names of different files on my machine
(macOS), string1 comes from a data file (csv, UTF8 encoding).

It's obviously the umlaut "ä" in this example which is encoded with two
respectively three bytes. The question is how to change this? This
problem makes it impossible to join the two data sets based on the
names. I already checked the settings on my machine: Sys.getlocale()
returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
Changing/forcing the encoding of the data didn't bring the results I
expected.

What else can I try?

Best regards

         Björn


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem comparing two strings

Ivan Krylov
On Mon, 18 Nov 2019 16:11:44 +0100
"Björn Fisseler" <[hidden email]> wrote:

> It's obviously the umlaut "ä" in this example which is encoded with
> two respectively three bytes. The question is how to change this?

Welcome to the wonderful world of Unicode-related problems! It is,
indeed, possible to represent the same glyph using either one
code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points
(LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other
combinations of code points resulting in the same glyph are probably
also possible.)

What you are looking for is called "Unicode normalization" and it is
implemented in the stringi package, in functions stri_trans_nfc
(normalization: there are multiple normal forms to choose from but W3C
guidelines recommend NFC) and stri_compare / stri_cmp (test for
canonical equivalence).

See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem comparing two strings

Björn Fisseler
Thank you! That solved my problem!

Best

         Björn

Am 18.11.19 um 16:34 schrieb Ivan Krylov:

> On Mon, 18 Nov 2019 16:11:44 +0100
> "Björn Fisseler" <[hidden email]> wrote:
>
>> It's obviously the umlaut "ä" in this example which is encoded with
>> two respectively three bytes. The question is how to change this?
> Welcome to the wonderful world of Unicode-related problems! It is,
> indeed, possible to represent the same glyph using either one
> code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points
> (LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other
> combinations of code points resulting in the same glyph are probably
> also possible.)
>
> What you are looking for is called "Unicode normalization" and it is
> implemented in the stringi package, in functions stri_trans_nfc
> (normalization: there are multiple normal forms to choose from but W3C
> guidelines recommend NFC) and stri_compare / stri_cmp (test for
> canonical equivalence).
>
> See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem comparing two strings

Duncan Murdoch-2
In reply to this post by Björn Fisseler
On 18/11/2019 10:11 a.m., Björn Fisseler wrote:

> Hello,
>
> I'm struggling comparing two strings, which come from different data
> sets. This strings are identical: "Alexander Jäger"
>
> But when I compare these strings: string1 == string2
> the result is FALSE.
>
> Looking at the raw bytes used to encode the strings, the results are
> different:
>
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
>
> string2 comes from the file names of different files on my machine
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
>
> It's obviously the umlaut "ä" in this example which is encoded with two
> respectively three bytes. The question is how to change this? This
> problem makes it impossible to join the two data sets based on the
> names. I already checked the settings on my machine: Sys.getlocale()
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
> Changing/forcing the encoding of the data didn't bring the results I
> expected.
>
> What else can I try?

Characters like ä have two (or more) representations in Unicode:  a
single code point, or the code point for "a" followed by a code point
that says "add an umlaut".

If you want to compare strings, you need a consistent representation.
This is called normalizing the string.

There are several possible normalizations; for your purposes it doesn't
matter which one you use, as long as you use the same normalization for
both strings.  See
<https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html>
for details.

In R, there are several functions that do the normalization for you.
Two are utf8::utf8_normalize or stringi::stri_trans_nfc.  So you'd want
something like

   library(utf8)
   string1 <- utf8_normalize(string1)
   string2 <- utf8_normalize(string2)
   string1 == string2  # Should now work as expected

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem comparing two strings

Peter Dalgaard-2
In reply to this post by Björn Fisseler
A version of this came up not long ago in a slightly different context (bug 17369: parse() doesn't honor unicode in NFD normalization).

The basic issue is that there are different unicode normalizations (look it up...).

Briefly, accented characters exist in two forms, one as a single code point and another as the base letter followed by the accent.

I.e. there is the single letter "ä" and then "a\u308" which is a followed by "combining diaeresis" which effectively put a ¨ on top of the preceding character.

The utf8 package has code for normalizing strings.

-pd

> On 18 Nov 2019, at 16:11 , Björn Fisseler <[hidden email]> wrote:
>
> Hello,
>
> I'm struggling comparing two strings, which come from different data
> sets. This strings are identical: "Alexander Jäger"
>
> But when I compare these strings: string1 == string2
> the result is FALSE.
>
> Looking at the raw bytes used to encode the strings, the results are
> different:
>
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
>
> string2 comes from the file names of different files on my machine
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
>
> It's obviously the umlaut "ä" in this example which is encoded with two
> respectively three bytes. The question is how to change this? This
> problem makes it impossible to join the two data sets based on the
> names. I already checked the settings on my machine: Sys.getlocale()
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
> Changing/forcing the encoding of the data didn't bring the results I
> expected.
>
> What else can I try?
>
> Best regards
>
>         Björn
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.