Error in substring: invalid multibyte string

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Error in substring: invalid multibyte string

Toby Hocking-2
Hi all,
I'm getting the following error from substring:

> substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)
Error in substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) :
  invalid multibyte string at '<e4>gel-A<6b>iyoshi'

Is that normal / intended? I've tried setting the Encoding/locale to
Latin-1/UTF-8 but that does not help. nchar gives me something similar

> nchar("<I>Jens Oehlschl\xe4gel-Akiyoshi")
Error in nchar("<I>Jens Oehlschl\xe4gel-Akiyoshi") :
  invalid multibyte string, element 1

I find it strange that substr/nchar give an error but regexpr works for
telling me the length:

> regexpr(".*", "<I>Jens Oehlschl\xe4gel-Akiyoshi")
[1] 1
attr(,"match.length")
[1] 29

Is that inconsistency normal/intended?

btw this example comes from our very own list:

> readLines("
https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html")[28]
[1] "<I>Jens Oehlschl\xe4gel-Akiyoshi"

Best,
Toby

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Error in substring: invalid multibyte string

Ivan Krylov
On Fri, 26 Jun 2020 15:57:06 -0700
Toby Hocking <[hidden email]> wrote:

>invalid multibyte string at '<e4>gel-A<6b>iyoshi'

>https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html

The server says that the text is UTF-8:

curl -sI \
 https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
 grep Content-Type
# Content-Type: text/html; charset=UTF-8

But it's not, at least not all of it. If you ask readLines to mark
the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
mojibake and invalid multi-byte characters:

x <- readLines(
 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
 encoding = 'latin1'
)[28]
substr(x, 1, 100)
# [1] "<I>Jens Oehlschlägel-Akiyoshi"

The behaviour we observe when encoding = 'latin1' is not specified
results from returned lines having "unknown" encoding. The substr()
implementation tries to interpret such strings according to multi-byte C
locale rules (using mbrtowc(3)). On my system (yours too, probably, if
it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
and this Latin-1 string does not result in valid code points when
decoded as UTF-8.

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Error in substring: invalid multibyte string

Toby Hocking-2
Thanks for the quick response Ivan. readLines with encoding='latin1' works
for me (on Ubuntu).

However I was more concerned with the inconsistency in results between
substr and regexpr. I was expecting that if one of them errors because of
an unknown encoding then the other should as well. Even better, if regexpr
works, why shouldn't substr work as well?

Incidentally the analogous stringi function stri_sub works fine in this
case:

> stringi::stri_sub("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)
[1] "<I>Jens Oehlschl\xe4gel-Akiyoshi"

But the stringi analog to nchar gives a similar warning:

> stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi")
[1] NA
Warning message:
In stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi") :
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()


On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov <[hidden email]> wrote:

> On Fri, 26 Jun 2020 15:57:06 -0700
> Toby Hocking <[hidden email]> wrote:
>
> >invalid multibyte string at '<e4>gel-A<6b>iyoshi'
>
> >https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html
>
> The server says that the text is UTF-8:
>
> curl -sI \
>  https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
>  grep Content-Type
> # Content-Type: text/html; charset=UTF-8
>
> But it's not, at least not all of it. If you ask readLines to mark
> the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
> mojibake and invalid multi-byte characters:
>
> x <- readLines(
>  'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
>  encoding = 'latin1'
> )[28]
> substr(x, 1, 100)
> # [1] "<I>Jens Oehlschlägel-Akiyoshi"
>
> The behaviour we observe when encoding = 'latin1' is not specified
> results from returned lines having "unknown" encoding. The substr()
> implementation tries to interpret such strings according to multi-byte C
> locale rules (using mbrtowc(3)). On my system (yours too, probably, if
> it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
> and this Latin-1 string does not result in valid code points when
> decoded as UTF-8.
>
> --
> Best regards,
> Ivan
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Error in substring: invalid multibyte string

Tomas Kalibera
 From the user's (or package author's) point, all strings should always
be valid in their declared encoding. If they are not, the result of
string operations is undefined - it may be an error or warning, but also
silently produced correct or incorrect result. There are R functions
that check if a string is valid. In this example, the string was invalid
in its declared encoding.

 From the viewpoint of R implementation (or of external software), some
operations such as substring can be carried out in a well defined way
even on strings with invalid characters or characters invalid in
specific ways, usually only in some encodings (e.g. UTF-8), and the
implementation is then more complicated. Some operations can't be well
defined on such strings.

It may seem it would make sense to ban all invalid strings (not allow
their creation) as not to mask errors like the one you have encountered,
but it is sometimes better for debugging to be able to include invalid
strings in error and diagnostic messages. Moreover, some systems support
invalid strings in some operations also as they may appear in file
names. On Windows, file names may include unpaired UTF-16 surrogates,
which can't be represented in UTF-8. Some systems allow representing
invalid strings in a custom way that is a valid string but preserves the
information, only in some encodings (e.g. in UTF-8).

So differences in how invalid strings are treated by different R
functions are to be expected. The same applies to differences wrt to
external software. Some may be optimized for UTF-8 and support invalid
strings in more cases (R does not support substring on invalid strings),
of course other may have bugs or intentionally may not check strings for
validity when that is perceived too slow in given operation.

Best
Tomas


On 6/28/20 12:38 AM, Toby Hocking wrote:

> Thanks for the quick response Ivan. readLines with encoding='latin1' works
> for me (on Ubuntu).
>
> However I was more concerned with the inconsistency in results between
> substr and regexpr. I was expecting that if one of them errors because of
> an unknown encoding then the other should as well. Even better, if regexpr
> works, why shouldn't substr work as well?
>
> Incidentally the analogous stringi function stri_sub works fine in this
> case:
>
>> stringi::stri_sub("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)
> [1] "<I>Jens Oehlschl\xe4gel-Akiyoshi"
>
> But the stringi analog to nchar gives a similar warning:
>
>> stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi")
> [1] NA
> Warning message:
> In stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi") :
>    invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
>
>
> On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov <[hidden email]> wrote:
>
>> On Fri, 26 Jun 2020 15:57:06 -0700
>> Toby Hocking <[hidden email]> wrote:
>>
>>> invalid multibyte string at '<e4>gel-A<6b>iyoshi'
>>> https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html
>> The server says that the text is UTF-8:
>>
>> curl -sI \
>>   https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
>>   grep Content-Type
>> # Content-Type: text/html; charset=UTF-8
>>
>> But it's not, at least not all of it. If you ask readLines to mark
>> the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
>> mojibake and invalid multi-byte characters:
>>
>> x <- readLines(
>>   'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
>>   encoding = 'latin1'
>> )[28]
>> substr(x, 1, 100)
>> # [1] "<I>Jens Oehlschlägel-Akiyoshi"
>>
>> The behaviour we observe when encoding = 'latin1' is not specified
>> results from returned lines having "unknown" encoding. The substr()
>> implementation tries to interpret such strings according to multi-byte C
>> locale rules (using mbrtowc(3)). On my system (yours too, probably, if
>> it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
>> and this Latin-1 string does not result in valid code points when
>> decoded as UTF-8.
>>
>> --
>> Best regards,
>> Ivan
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel