build package with unicode (farsi) strings

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

build package with unicode (farsi) strings

Faridedin Cheraghi
Hi,

I have a R script file with Persian letters in it defined as a variable:

#' @export
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')

I have specified the encoding field in my DESCRIPTION file of my package.

...
Encoding: UTF-8
...

I also included Sys.setlocale(locale="Persian") in my .RProfile, so it is
executed when RCMD is called. However, after a BUILD and INSTALL, when I
access the variable from the package, the characters are not printed
correctly:
> futils::letters_fa
 [1] "<d8><a7><d9><84><d9><81>" "<d8><a8>"                 "<d9><be>"
           "<d8><aa>"                 "<d8><ab>"
 [6] "<d8><ac>"                 "<da><86>"                 "<d8><ad>"
           "<d8><ae>"                 "<d8><b1>"
[11] "<d8><b2>"                 "<d8><af>"


thanks
Farid

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: build package with unicode (farsi) strings

Thierry Onkelinx
Dear Farid,

Try using the ASCII notation. letters_fa <- c("\u0627", "\u0641"). The full
code table is available at https://www.utf8-chartable.de

Best regards,



ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
[hidden email]
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>

2018-08-28 7:17 GMT+02:00 Faridedin Cheraghi <[hidden email]>:

> Hi,
>
> I have a R script file with Persian letters in it defined as a variable:
>
> #' @export
> letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
>
> I have specified the encoding field in my DESCRIPTION file of my package.
>
> ...
> Encoding: UTF-8
> ...
>
> I also included Sys.setlocale(locale="Persian") in my .RProfile, so it is
> executed when RCMD is called. However, after a BUILD and INSTALL, when I
> access the variable from the package, the characters are not printed
> correctly:
> > futils::letters_fa
>  [1] "<d8><a7><d9><84><d9><81>" "<d8><a8>"                 "<d9><be>"
>            "<d8><aa>"                 "<d8><ab>"
>  [6] "<d8><ac>"                 "<da><86>"                 "<d8><ad>"
>            "<d8><ae>"                 "<d8><b1>"
> [11] "<d8><b2>"                 "<d8><af>"
>
>
> thanks
> Farid
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: build package with unicode (farsi) strings

Ista Zahn
On Thu, Aug 30, 2018 at 3:11 AM Thierry Onkelinx
<[hidden email]> wrote:
>
> Dear Farid,
>
> Try using the ASCII notation. letters_fa <- c("\u0627", "\u0641").

... as recommend in the manual:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Encoding-issues

Best,
Ista

The full

> code table is available at https://www.utf8-chartable.de
>
> Best regards,
>
>
>
> ir. Thierry Onkelinx
> Statisticus / Statistician
>
> Vlaamse Overheid / Government of Flanders
> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
> FOREST
> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> [hidden email]
> Havenlaan 88 bus 73, 1000 Brussel
> www.inbo.be
>
> ///////////////////////////////////////////////////////////////////////////////////////////
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
> ///////////////////////////////////////////////////////////////////////////////////////////
>
> <https://www.inbo.be>
>
> 2018-08-28 7:17 GMT+02:00 Faridedin Cheraghi <[hidden email]>:
>
> > Hi,
> >
> > I have a R script file with Persian letters in it defined as a variable:
> >
> > #' @export
> > letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
> >
> > I have specified the encoding field in my DESCRIPTION file of my package.
> >
> > ...
> > Encoding: UTF-8
> > ...
> >
> > I also included Sys.setlocale(locale="Persian") in my .RProfile, so it is
> > executed when RCMD is called. However, after a BUILD and INSTALL, when I
> > access the variable from the package, the characters are not printed
> > correctly:
> > > futils::letters_fa
> >  [1] "<d8><a7><d9><84><d9><81>" "<d8><a8>"                 "<d9><be>"
> >            "<d8><aa>"                 "<d8><ab>"
> >  [6] "<d8><ac>"                 "<da><86>"                 "<d8><ad>"
> >            "<d8><ae>"                 "<d8><b1>"
> > [11] "<d8><b2>"                 "<d8><af>"
> >
> >
> > thanks
> > Farid
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: build package with unicode (farsi) strings

hadley wickham
In reply to this post by Thierry Onkelinx
On Thu, Aug 30, 2018 at 2:11 AM Thierry Onkelinx
<[hidden email]> wrote:
>
> Dear Farid,
>
> Try using the ASCII notation. letters_fa <- c("\u0627", "\u0641"). The full
> code table is available at https://www.utf8-chartable.de

It's a little easier to do this with code:

letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
writeLines(stringi::stri_escape_unicode(letters_fa))
#> \u0627\u0644\u0641
#> \u0628
#> \u067e
#> \u062a
#> \u062b
#> \u062c
#> \u0686
#> \u062d
#> \u062e
#> \u0631
#> \u0632
#> \u062f

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: build package with unicode (farsi) strings

Faridedin Cheraghi
Thank you all for your valuable insights. The most viable workaround is a modification to the Hadley�s line of code:



stringi::stri_escape_unicode(letters_fa) %>%

paste0("'",.,"'",collapse=',') %>%

paste0('c(',.,')')



which then, the output string could be easily copied and pasted without manual editing. However, imagine you had to do this process to all of your English strings that you write daily! It is not that much productive. Is it?



I think R deserves a better support for internationalization and I know this implies fundamental revisions to the code to avoid the unecessary conversion to a (OS) native locale; i.e. directly reading/writing as unicode.



Farid



________________________________
From: Hadley Wickham <[hidden email]>
Sent: Friday, August 31, 2018 2:48:17 AM
To: ONKELINX, Thierry
Cc: [hidden email]; [hidden email]
Subject: Re: [Rd] build package with unicode (farsi) strings

On Thu, Aug 30, 2018 at 2:11 AM Thierry Onkelinx
<[hidden email]> wrote:
>
> Dear Farid,
>
> Try using the ASCII notation. letters_fa <- c("\u0627", "\u0641"). The full
> code table is available at https://www.utf8-chartable.de

It's a little easier to do this with code:

letters_fa <- c('���','�','�','�','�','�','�','�','�','�','�','�')
writeLines(stringi::stri_escape_unicode(letters_fa))
#> \u0627\u0644\u0641
#> \u0628
#> \u067e
#> \u062a
#> \u062b
#> \u062c
#> \u0686
#> \u062d
#> \u062e
#> \u0631
#> \u0632
#> \u062f

Hadley

--
http://hadley.nz

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel