Quantcast

[PATCH] Improve utf8clen and remove utf8_table4

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH] Improve utf8clen and remove utf8_table4

Sahil Kang
Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.

Before this patch, the utf8clen function would return either:
     * 1 if `c' was an ascii character or a utf8 continuation byte
     * An int in the range [2, 6] indicating the byte length of the utf8
character

With this patch, the utf8clen function will now return either:
     * -1 if `c' is not a valid utf8 start byte
     * The byte length of the utf8 character (the number of leading 1's,
really)

I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.

Sahil

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

patch.diff (1K) Download Attachment
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Improve utf8clen and remove utf8_table4

Duncan Murdoch-2
On 19/03/2017 2:31 AM, Sahil Kang wrote:

> Given a char `c' which should be the start byte of a utf8 character,
> the utf8clen function returns the byte length of the utf8 character.
>
> Before this patch, the utf8clen function would return either:
>      * 1 if `c' was an ascii character or a utf8 continuation byte
>      * An int in the range [2, 6] indicating the byte length of the utf8
> character
>
> With this patch, the utf8clen function will now return either:
>      * -1 if `c' is not a valid utf8 start byte
>      * The byte length of the utf8 character (the number of leading 1's,
> really)
>
> I believe returning -1 for continuation bytes makes utf8clen less error
> prone.
> The utf8_table4 array is no longer needed and has been removed.

utf8clen is used internally by R in more than a dozen places, and is
likely used in packages as well.  Have you checked that this change in
semantics won't break any of those uses?

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Improve utf8clen and remove utf8_table4

Sahil Kang
Some of the code that uses utf8clen checks the validity of the utf8
string before making the call.
However, there were some hairy areas where I felt that the new semantics
may cause issues (if not now, then in future changes).

I've attached two patches:
     * new_semantics.diff keeps the new semantics and updates those
hairy areas above.
     * old_semantics.diff maintains the old semantics (return 1 even for
continuation bytes).

I don't think the new semantics will cause issues, especially with the
updates, but we can err on the side of caution and keep the old
semantics. I feel that the new semantics provide a clearer interface
though (the function expects a start byte and should return an error if
a start byte is not supplied).
In either case, the utf8_table4 array has been removed.

Sahil

On 03/19/2017 05:38 AM, Duncan Murdoch wrote:

> On 19/03/2017 2:31 AM, Sahil Kang wrote:
>> Given a char `c' which should be the start byte of a utf8 character,
>> the utf8clen function returns the byte length of the utf8 character.
>>
>> Before this patch, the utf8clen function would return either:
>>      * 1 if `c' was an ascii character or a utf8 continuation byte
>>      * An int in the range [2, 6] indicating the byte length of the utf8
>> character
>>
>> With this patch, the utf8clen function will now return either:
>>      * -1 if `c' is not a valid utf8 start byte
>>      * The byte length of the utf8 character (the number of leading 1's,
>> really)
>>
>> I believe returning -1 for continuation bytes makes utf8clen less error
>> prone.
>> The utf8_table4 array is no longer needed and has been removed.
>
> utf8clen is used internally by R in more than a dozen places, and is
> likely used in packages as well.  Have you checked that this change in
> semantics won't break any of those uses?
>
> Duncan Murdoch
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

old_semantics.diff (1K) Download Attachment
new_semantics.diff (6K) Download Attachment
Loading...