A question about the API mkchar()

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

A question about the API mkchar()

Fán Lóng
Hi guys,


I've got a question about the API mkchar(). I have met some difficulty
in parsing utf-8 string to mkchar() in R-2.7.0.



I was intending to parse an utf-8 string str_jan (some Japanese
characters such asふ, whose utf-8 code is E381B5) to R API SEXP
mkChar(const char *name) , we only need to create the SEXP using the
string that we parsed.



Unfortunately, I found when parsing the variable str_jan, R will
automatically convert the str_jan according to the current locale
setting, so only in the English locale could the function work
correctly, under other locale, such as Japanese or Chinese, the string
will be convert incorrectly. As a matter of fact, those utf-8 code
already is Unicode string, and don't need to be converted at all.



I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,
Parsing the CE_UTF8 as the argument of cetype_t, but the result is
worse. It returned the result as ucs code, an kind of Unicode under
windows platform.



All I want to get is just a SEXP object containing the original utf-8
string, no matter what locale is set currently. Normally what can I
do?





Thanks,

Long

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: A question about the API mkchar()

Simon Urbanek
On Oct 28, 2008, at 6:26 , Fán Lóng wrote:

> Hi guys,
>

Hey guy :)


> I've got a question about the API mkchar(). I have met some  
> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>

There is no mkchar() in R. Did you perhaps mean mkChar()?


> I was intending to parse an utf-8 string str_jan (some Japanese
> characters such asふ, whose utf-8 code is E381B5

There is no such "UTF-8" code. I'm not sure if you meant Unicode, but  
that would be \u3075 (Hiragana hu) for that character. The UTF-8  
encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if  
that's what you meant.


> ) to R API SEXP
> mkChar(const char *name) , we only need to create the SEXP using the
> string that we parsed.
>
>
>
> Unfortunately, I found when parsing the variable str_jan, R will
> automatically convert the str_jan according to the current locale
> setting,

That is not true - it will be kept as-is regardless of the encoding.  
Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No  
conversion takes place when the string is created, but you have told R  
that it is in the native encoding. If that is not true (which is your  
case probably isn't), all bets are off since you're lying to R ;).


> so only in the English locale could the function work correctly,  
> under other locale, such as Japanese or Chinese, the string will be  
> convert incorrectly.

That is clearly a nonsense since the encoding has nothing to do with  
the locale language itself (Japanese, Chinese, ..). We are talking  
about the encoding (note that both English and Japanese locales can  
use UTF-8 encoding, but don't have to). I think you'll need to get the  
concepts right here - for each string you must define the encoding in  
order to be able to reproduce the unicode sequence that the string  
represents. At this point it has nothing to do with the language.


> As a matter of fact, those utf-8 code already is Unicode string, and  
> don't need to be converted at all.
>
> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,  
> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
> worse. It returned the result as ucs code, an kind of Unicode under  
> windows platform.
>

Well, that's exactly what you want, isn't it? The string is correctly  
flagged as UTF-8 so R is finally able to find out what exactly is  
represented by that string. However, your locale apparently doesn't  
support such characters so it cannot be displayed. If you use a locale  
that supports it, it works just fine, for example if you use local  
with SJIS encoding R will still know how to convert it from UTF-8 to  
SJIS *for display*. The actual string is not touched.

Here is a small piece of code that shows you the difference between  
native encoding and UTF8-strings:

#include <R.h>
#include <Rinternals.h>

SEXP me() {
   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
   SEXP a = allocVector(STRSXP, 2);
   PROTECT(a);
   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
   UNPROTECT(1);
   return a;
}

In a UTF-8 locale it doesn't matter:

ginaz:sandbox$ LANG=ja_JP.UTF-8 R
 > .Call("me")
[1] "ふ" "ふ"

But in any other, let's say SJIS, it does:

ginaz:sandbox$ LANG=ja_JP.SJIS R
 > .Call("me")
[1] "縺オ" "ふ"

Note that the first string is wrong, because we have supplied UTF-8  
encoding but the current one is SJIS. The second one is correct since  
we told R that it's UTF-8 encoded.

Finally, if the character cannot be displayed in the given encoding:

ginaz:sandbox$ LANG=en_US.US-ASCII R
 > .Call("me")
[1] "\343\201\265" "<U+3075>"

The first one is wrong again, since it's not flagged as UTF8, but the  
second one is exactly as expected - unicode 3075 which is the Hiragana  
"hu". It doesn't exist in US-ASCII so unicode designation is all you  
can display.


> All I want to get is just a SEXP object containing the original  
> utf-8 string, no matter what locale is set currently. Normally what  
> can I do?
>

mkChar(X, CE_UTF8);

Cheers,
Simon

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: A question about the API mkchar()

王永智
 Hi, Simon
 

Thanks for your elaborated instruction on mkCharCE.

 Concerning the UTF-8 Encoding, mkCharCE(X, CE_UTF8) is the correct way in parsing the Unicode string.

 However, I met another question:

 My program logic is intended to read the content of a text file r.tmp, which is encoded with UTF-8. After reading it, every line will be send to another C function ext_show(t const char** text, int* length, int* errLevel) for the further handle. Attached is the text file “r.tmp”.

 I tried to use the following R code to accomplish the process:

checkoutput<-scan(“r.tmp”,

                       what='character',

                       blank.lines.skip=FALSE,

                       sep='\n',

                       skip=0,

                       quiet=TRUE,

                       encoding = “unknown”)              

lines<-length(checkoutput)

print(checkoutput)

for (i in 1:lines)

 {

Inputstring = checkoutput[i]

out <- .C('ext_show',as.character(inputstring),

                                         as.integer(nchar(inputstring)),

                                         as.integer(err),

                                         PACKAGE="mypkg")

 }        

 

 

I don’t know why, if I typed the command in R GUI environment, the Japanese character can be shown correctly. Also, if I sink the inputstring into another text file, the content of this file also written correctly.

 But if I use the above code passing the inputstring into function ext_show, the string passed inputstring has been changed in the function ext_show ().

My current environment is WindowsXP, R 2.7.0, R encoding is "UTF-8":

> getOption("encoding")
[1] "UTF-8"

> Sys.getlocale()
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"


For current encoding is UTF-8, I don't think Chinese local will hinder the correct result.

 The ext_show is defined as below:

    void ext_show(

        const char** text,

        int* length,

        int* errLevel)

        {

            *errLevel = LoadLib();

            int real_length = strlen(*text);

            if( LOAD_SUCCESS == *errLevel )

                *errLevel = ShowInScreen(*text, real_length);

        }

 I am new to the R programming, and not every familiar with the encoding handle in R, I suspect if it is necessary to convert encoding of the inputstring before passing to the function ext_show().

 Many Thanks!

Joey

 
在2008-10-28,"Simon Urbanek" <[hidden email]> 写道:

>On Oct 28, 2008, at 6:26 , Fán Lóng wrote:
>
>> Hi guys,
>>
>
>Hey guy :)
>
>
>> I've got a question about the API mkchar(). I have met some  
>> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>>
>
>There is no mkchar() in R. Did you perhaps mean mkChar()?
>
>
>> I was intending to parse an utf-8 string str_jan (some Japanese
>> characters such asふ, whose utf-8 code is E381B5
>
>There is no such "UTF-8" code. I'm not sure if you meant Unicode, but  
>that would be \u3075 (Hiragana hu) for that character. The UTF-8  
>encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if  
>that's what you meant.
>
>
>> ) to R API SEXP
>> mkChar(const char *name) , we only need to create the SEXP using the
>> string that we parsed.
>>
>>
>>
>> Unfortunately, I found when parsing the variable str_jan, R will
>> automatically convert the str_jan according to the current locale
>> setting,
>
>That is not true - it will be kept as-is regardless of the encoding.  
>Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No  
>conversion takes place when the string is created, but you have told R  
>that it is in the native encoding. If that is not true (which is your  
>case probably isn't), all bets are off since you're lying to R ;).
>
>
>> so only in the English locale could the function work correctly,  
>> under other locale, such as Japanese or Chinese, the string will be  
>> convert incorrectly.
>
>That is clearly a nonsense since the encoding has nothing to do with  
>the locale language itself (Japanese, Chinese, ..). We are talking  
>about the encoding (note that both English and Japanese locales can  
>use UTF-8 encoding, but don't have to). I think you'll need to get the  
>concepts right here - for each string you must define the encoding in  
>order to be able to reproduce the unicode sequence that the string  
>represents. At this point it has nothing to do with the language.
>
>
>> As a matter of fact, those utf-8 code already is Unicode string, and  
>> don't need to be converted at all.
>>
>> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,  
>> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
>> worse. It returned the result as ucs code, an kind of Unicode under  
>> windows platform.
>>
>
>Well, that's exactly what you want, isn't it? The string is correctly  
>flagged as UTF-8 so R is finally able to find out what exactly is  
>represented by that string. However, your locale apparently doesn't  
>support such characters so it cannot be displayed. If you use a locale  
>that supports it, it works just fine, for example if you use local  
>with SJIS encoding R will still know how to convert it from UTF-8 to  
>SJIS *for display*. The actual string is not touched.
>
>Here is a small piece of code that shows you the difference between  
>native encoding and UTF8-strings:
>
>#include
>#include
>
>SEXP me() {
>   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
>   SEXP a = allocVector(STRSXP, 2);
>   PROTECT(a);
>   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
>   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
>   UNPROTECT(1);
>   return a;
>}
>
>In a UTF-8 locale it doesn't matter:
>
>ginaz:sandbox$ LANG=ja_JP.UTF-8 R
> > .Call("me")
>[1] "ふ" "ふ"
>
>But in any other, let's say SJIS, it does:
>
>ginaz:sandbox$ LANG=ja_JP.SJIS R
> > .Call("me")
>[1] "縺オ" "ふ"
>
>Note that the first string is wrong, because we have supplied UTF-8  
>encoding but the current one is SJIS. The second one is correct since  
>we told R that it's UTF-8 encoded.
>
>Finally, if the character cannot be displayed in the given encoding:
>
>ginaz:sandbox$ LANG=en_US.US-ASCII R
> > .Call("me")
>[1] "\343\201\265" "
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: A question about the API mkchar()

Prof Brian Ripley
1) 2.7.0 is rather old, and you were asked to update your R before
posting.

2) No file was attached.  But how to handle encodings is in the 'R
Internals' manual.  This is a tricky, advanced, topic in C-level R
programming.  It is your responsibility, not ours, to get yourself up to
the level of understanding required.  Sorry, but it is not reasonable to
expect a personal tutorial in this forum.

On Mon, 3 Nov 2008, 王永智 wrote:

> Hi, Simon
>
>
> Thanks for your elaborated instruction on mkCharCE.
>
> Concerning the UTF-8 Encoding, mkCharCE(X, CE_UTF8) is the correct way in parsing the Unicode string.
>
> However, I met another question:
>
> My program logic is intended to read the content of a text file r.tmp, which is encoded with UTF-8. After reading it, every line will be send to another C function ext_show(t const char** text, int* length, int* errLevel) for the further handle. Attached is the text file “r.tmp”.
>
> I tried to use the following R code to accomplish the process:
>
> checkoutput<-scan(“r.tmp”,
>
>                       what='character',
>
>                       blank.lines.skip=FALSE,
>
>                       sep='\n',
>
>                       skip=0,
>
>                       quiet=TRUE,
>
>                       encoding = “unknown”)
>
> lines<-length(checkoutput)
>
> print(checkoutput)
>
> for (i in 1:lines)
>
> {
>
> Inputstring = checkoutput[i]
>
> out <- .C('ext_show',as.character(inputstring),
>
>                                         as.integer(nchar(inputstring)),
>
>                                         as.integer(err),
>
>                                         PACKAGE="mypkg")
>
> }
>
>
>
>
>
> I don’t know why, if I typed the command in R GUI environment, the Japanese character can be shown correctly. Also, if I sink the inputstring into another text file, the content of this file also written correctly.
>
> But if I use the above code passing the inputstring into function ext_show, the string passed inputstring has been changed in the function ext_show ().
>
> My current environment is WindowsXP, R 2.7.0, R encoding is "UTF-8":
>
>> getOption("encoding")
> [1] "UTF-8"
>
>> Sys.getlocale()
> [1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
>
>
> For current encoding is UTF-8, I don't think Chinese local will hinder the correct result.
>
> The ext_show is defined as below:
>
>    void ext_show(
>
>        const char** text,
>
>        int* length,
>
>        int* errLevel)
>
>        {
>
>            *errLevel = LoadLib();
>
>            int real_length = strlen(*text);
>
>            if( LOAD_SUCCESS == *errLevel )
>
>                *errLevel = ShowInScreen(*text, real_length);
>
>        }
>
> I am new to the R programming, and not every familiar with the encoding handle in R, I suspect if it is necessary to convert encoding of the inputstring before passing to the function ext_show().
>
> Many Thanks!
>
> Joey
>
>
> 在2008-10-28,"Simon Urbanek" <[hidden email]> 写道:
>> On Oct 28, 2008, at 6:26 , Fán Lóng wrote:
>>
>>> Hi guys,
>>>
>>
>> Hey guy :)
>>
>>
>>> I've got a question about the API mkchar(). I have met some
>>> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>>>
>>
>> There is no mkchar() in R. Did you perhaps mean mkChar()?
>>
>>
>>> I was intending to parse an utf-8 string str_jan (some Japanese
>>> characters such asふ, whose utf-8 code is E381B5
>>
>> There is no such "UTF-8" code. I'm not sure if you meant Unicode, but
>> that would be \u3075 (Hiragana hu) for that character. The UTF-8
>> encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if
>> that's what you meant.
>>
>>
>>> ) to R API SEXP
>>> mkChar(const char *name) , we only need to create the SEXP using the
>>> string that we parsed.
>>>
>>>
>>>
>>> Unfortunately, I found when parsing the variable str_jan, R will
>>> automatically convert the str_jan according to the current locale
>>> setting,
>>
>> That is not true - it will be kept as-is regardless of the encoding.
>> Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No
>> conversion takes place when the string is created, but you have told R
>> that it is in the native encoding. If that is not true (which is your
>> case probably isn't), all bets are off since you're lying to R ;).
>>
>>
>>> so only in the English locale could the function work correctly,
>>> under other locale, such as Japanese or Chinese, the string will be
>>> convert incorrectly.
>>
>> That is clearly a nonsense since the encoding has nothing to do with
>> the locale language itself (Japanese, Chinese, ..). We are talking
>> about the encoding (note that both English and Japanese locales can
>> use UTF-8 encoding, but don't have to). I think you'll need to get the
>> concepts right here - for each string you must define the encoding in
>> order to be able to reproduce the unicode sequence that the string
>> represents. At this point it has nothing to do with the language.
>>
>>
>>> As a matter of fact, those utf-8 code already is Unicode string, and
>>> don't need to be converted at all.
>>>
>>> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,
>>> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
>>> worse. It returned the result as ucs code, an kind of Unicode under
>>> windows platform.
>>>
>>
>> Well, that's exactly what you want, isn't it? The string is correctly
>> flagged as UTF-8 so R is finally able to find out what exactly is
>> represented by that string. However, your locale apparently doesn't
>> support such characters so it cannot be displayed. If you use a locale
>> that supports it, it works just fine, for example if you use local
>> with SJIS encoding R will still know how to convert it from UTF-8 to
>> SJIS *for display*. The actual string is not touched.
>>
>> Here is a small piece of code that shows you the difference between
>> native encoding and UTF8-strings:
>>
>> #include
>> #include
>>
>> SEXP me() {
>>   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
>>   SEXP a = allocVector(STRSXP, 2);
>>   PROTECT(a);
>>   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
>>   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
>>   UNPROTECT(1);
>>   return a;
>> }
>>
>> In a UTF-8 locale it doesn't matter:
>>
>> ginaz:sandbox$ LANG=ja_JP.UTF-8 R
>>> .Call("me")
>> [1] "ふ" "ふ"
>>
>> But in any other, let's say SJIS, it does:
>>
>> ginaz:sandbox$ LANG=ja_JP.SJIS R
>>> .Call("me")
>> [1] "縺オ" "ふ"
>>
>> Note that the first string is wrong, because we have supplied UTF-8
>> encoding but the current one is SJIS. The second one is correct since
>> we told R that it's UTF-8 encoded.
>>
>> Finally, if the character cannot be displayed in the given encoding:
>>
>> ginaz:sandbox$ LANG=en_US.US-ASCII R
>>> .Call("me")
>> [1] "\343\201\265" "
--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel