Please guide -- UTF-8 locale setting fails on Windows on writing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Please guide -- UTF-8 locale setting fails on Windows on writing

Sunny Singha
Hi,
I think I'm experiencing an issue regarding system Locale. I have
exported '.csv' formatted data frames gathered from various social
media platforms like facebook/twitter/G+, etc.

I observe many variable/columns consists of strings formatted similar to below:
"<U+0645><U+062D><U+0645><U+062F>
<U+0627><U+0644><U+0633><U+0648><U+0627><U+062D>"

As expected and I confirmed, in social media data, they are strings in
different languages.
Platform details are provide in the end of this mail. OS locale is set
to English (United States) hence 'R' locale is 'English_United
States.1252'

I have attempted to change it to UTF-8 but receives below warning message:

Warning message:
In Sys.setlocale("LC_ALL", "UTF-8") :
  OS reports request to set locale to "UTF-8" cannot be honored


I have gone through below forums but no resolution so far:
--- http://stackoverflow.com/questions/20571147/how-to-set-unicode-locale-in-r
--- https://stat.ethz.ch/pipermail/r-devel/2013-November/067940.html
--- http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
--- https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
--- http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/

I'm not sure whether the issue is while reading/extracting the data
from media or while writing/exporting in Windows directory, but I
don't experience similar issue in my personal Mac machine. I need some
clarification here.

How could I export the data just as I see on web ?  Please guide.

Regards,
Sunny

Platform I'm using::::::::::::::::::::::::::::
Operating System : Windows 7 Professional SP1
R version details:
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status
major          3
minor          2.3
year           2015
month          12
day            10
svn rev        69752
language       R
version.string R version 3.2.3 (2015-12-10)
nickname       Wooden Christmas-Tree

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Please guide -- UTF-8 locale setting fails on Windows on writing

Milan Bouchet-Valat
Le lundi 28 mars 2016 à 19:16 +0530, Sunny Singha a écrit :

> Hi,
> I think I'm experiencing an issue regarding system Locale. I have
> exported '.csv' formatted data frames gathered from various social
> media platforms like facebook/twitter/G+, etc.
>
> I observe many variable/columns consists of strings formatted similar to below:
> "
> "
>
> As expected and I confirmed, in social media data, they are strings in
> different languages.
> Platform details are provide in the end of this mail. OS locale is set
> to English (United States) hence 'R' locale is 'English_United
> States.1252'
>
> I have attempted to change it to UTF-8 but receives below warning message:
>
> Warning message:
> In Sys.setlocale("LC_ALL", "UTF-8") :
>   OS reports request to set locale to "UTF-8" cannot be honored
You don't need to set the locale. Just pass an appropriate value (e.g.
"UTF-8") to read.csv() or write.csv()'s fileEncoding argument.

You also didn't tell us what program you used to read these files. Some
might guess the encoding incorrectly, or require you to choose it
manually.


Regards

> I have gone through below forums but no resolution so far:
> --- http://stackoverflow.com/questions/20571147/how-to-set-unicode-locale-in-r
> --- https://stat.ethz.ch/pipermail/r-devel/2013-November/067940.html
> --- http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
> --- https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
> --- http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
>
> I'm not sure whether the issue is while reading/extracting the data
> from media or while writing/exporting in Windows directory, but I
> don't experience similar issue in my personal Mac machine. I need some
> clarification here.
>
> How could I export the data just as I see on web ?  Please guide.
>
> Regards,
> Sunny
>
> Platform I'm using::::::::::::::::::::::::::::
> Operating System : Windows 7 Professional SP1
> R version details:
> platform       x86_64-w64-mingw32
> arch           x86_64
> os             mingw32
> system         x86_64, mingw32
> status
> major          3
> minor          2.3
> year           2015
> month          12
> day            10
> svn rev        69752
> language       R
> version.string R version 3.2.3 (2015-12-10)
> nickname       Wooden Christmas-Tree
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Please guide -- UTF-8 locale setting fails on Windows on writing

Sunny Singha
Milan,
Ok, Let me take a case of facebook. I used Rfacebook package
 to get posts (getPost()) which returns list() of data frames(post,
comments, Likes)

let me demonstrate 2 cases of read and write just as you suggested,
Case 1:::::::::
Lets say one of the facebook comment has below string value, in
Japanese language-->
"世界餐福事工 - 餐廳員工沒精打采 老是打盤子"

On R console I now assign above string to variableas: x <- "世界餐福事工 -
餐廳員工沒精打采 老是打盤子"
and write it as below:
write.csv(x, file='x.csv', row.names=F, fileEncoding='UTF-8')
I get this string in the file
""<U+4E16><U+754C><U+9910><U+798F><U+4E8B><U+5DE5> -
<U+9910><U+5EF3><U+54E1><U+5DE5><U+6C92><U+7CBE><U+6253><U+91C7> "

Case 2::::::::::::::
I create a notepad 'x.txt' and save Japanese string "世界餐福事工 - 餐廳員工沒精打采 老是打盤子"
and read it as below:
read.table('x.txt', fileEncoding='UTF-8'), I get below output:

  V1
1  ?
Warning messages:
1: In read.table("x.txt", fileEncoding = "UTF-8") :
  invalid input found on input connection 'x.txt'
2: In read.table("x.txt", fileEncoding = "UTF-8") :
  incomplete final line found by readTableHeader on 'x.txt'

Above was for demonstration, I'm infact reading social media data
extracted, which ultimately is somewhere using httr package and
returning data frames.
I'm not sure how should I get it handled in Windows as I don't observe
this behavior in Mac where system locase is set to 'en_US.UTF-8'

Regards,
Sunny




On Mon, Mar 28, 2016 at 7:39 PM, Milan Bouchet-Valat <[hidden email]> wrote:

> Le lundi 28 mars 2016 à 19:16 +0530, Sunny Singha a écrit :
>> Hi,
>> I think I'm experiencing an issue regarding system Locale. I have
>> exported '.csv' formatted data frames gathered from various social
>> media platforms like facebook/twitter/G+, etc.
>>
>> I observe many variable/columns consists of strings formatted similar to below:
>> "
>> "
>>
>> As expected and I confirmed, in social media data, they are strings in
>> different languages.
>> Platform details are provide in the end of this mail. OS locale is set
>> to English (United States) hence 'R' locale is 'English_United
>> States.1252'
>>
>> I have attempted to change it to UTF-8 but receives below warning message:
>>
>> Warning message:
>> In Sys.setlocale("LC_ALL", "UTF-8") :
>>   OS reports request to set locale to "UTF-8" cannot be honored
> You don't need to set the locale. Just pass an appropriate value (e.g.
> "UTF-8") to read.csv() or write.csv()'s fileEncoding argument.
>
> You also didn't tell us what program you used to read these files. Some
> might guess the encoding incorrectly, or require you to choose it
> manually.
>
>
> Regards
>
>> I have gone through below forums but no resolution so far:
>> --- http://stackoverflow.com/questions/20571147/how-to-set-unicode-locale-in-r
>> --- https://stat.ethz.ch/pipermail/r-devel/2013-November/067940.html
>> --- http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
>> --- https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
>> --- http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
>>
>> I'm not sure whether the issue is while reading/extracting the data
>> from media or while writing/exporting in Windows directory, but I
>> don't experience similar issue in my personal Mac machine. I need some
>> clarification here.
>>
>> How could I export the data just as I see on web ?  Please guide.
>>
>> Regards,
>> Sunny
>>
>> Platform I'm using::::::::::::::::::::::::::::
>> Operating System : Windows 7 Professional SP1
>> R version details:
>> platform       x86_64-w64-mingw32
>> arch           x86_64
>> os             mingw32
>> system         x86_64, mingw32
>> status
>> major          3
>> minor          2.3
>> year           2015
>> month          12
>> day            10
>> svn rev        69752
>> language       R
>> version.string R version 3.2.3 (2015-12-10)
>> nickname       Wooden Christmas-Tree
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Please guide -- UTF-8 locale setting fails on Windows on writing

Milan Bouchet-Valat
Le lundi 28 mars 2016 à 20:12 +0530, Sunny Singha a écrit :

> Milan,
> Ok, Let me take a case of facebook. I used Rfacebook package
>  to get posts (getPost()) which returns list() of data frames(post,
> comments, Likes)
>
> let me demonstrate 2 cases of read and write just as you suggested,
> Case 1:::::::::
> Lets say one of the facebook comment has below string value, in
> Japanese language-->
> "世界餐福事工 - 餐廳員工沒精打采 老是打盤子"
>
> On R console I now assign above string to variableas: x <- "世界餐福事工 -
> 餐廳員工沒精打采 老是打盤子"
> and write it as below:
> write.csv(x, file='x.csv', row.names=F, fileEncoding='UTF-8')
> I get this string in the file
> "" -
>  "
But how do you read back the contents of the file? You need to specify
the encoding when reading it too.

> Case 2::::::::::::::
> I create a notepad 'x.txt' and save Japanese string "世界餐福事工 - 餐廳員工沒精打采 老是打盤子"
> and read it as below:
> read.table('x.txt', fileEncoding='UTF-8'), I get below output:
>
>   V1
> 1  ?
> Warning messages:
> 1: In read.table("x.txt", fileEncoding = "UTF-8") :
>   invalid input found on input connection 'x.txt'
> 2: In read.table("x.txt", fileEncoding = "UTF-8") :
>   incomplete final line found by readTableHeader on 'x.txt'
Are you sure the notepad saved the text as UTF-8?

> Above was for demonstration, I'm infact reading social media data
> extracted, which ultimately is somewhere using httr package and
> returning data frames.
> I'm not sure how should I get it handled in Windows as I don't observe
> this behavior in Mac where system locase is set to 'en_US.UTF-8'
>
> Regards,
> Sunny
>
>
>
>
> On Mon, Mar 28, 2016 at 7:39 PM, Milan Bouchet-Valat  wrote:
> >
> > Le lundi 28 mars 2016 à 19:16 +0530, Sunny Singha a écrit :
> > >
> > > Hi,
> > > I think I'm experiencing an issue regarding system Locale. I have
> > > exported '.csv' formatted data frames gathered from various social
> > > media platforms like facebook/twitter/G+, etc.
> > >
> > > I observe many variable/columns consists of strings formatted similar to below:
> > > "
> > > "
> > >
> > > As expected and I confirmed, in social media data, they are strings in
> > > different languages.
> > > Platform details are provide in the end of this mail. OS locale is set
> > > to English (United States) hence 'R' locale is 'English_United
> > > States.1252'
> > >
> > > I have attempted to change it to UTF-8 but receives below warning message:
> > >
> > > Warning message:
> > > In Sys.setlocale("LC_ALL", "UTF-8") :
> > >   OS reports request to set locale to "UTF-8" cannot be honored
> > You don't need to set the locale. Just pass an appropriate value (e.g.
> > "UTF-8") to read.csv() or write.csv()'s fileEncoding argument.
> >
> > You also didn't tell us what program you used to read these files. Some
> > might guess the encoding incorrectly, or require you to choose it
> > manually.
> >
> >
> > Regards
> >
> > >
> > > I have gone through below forums but no resolution so far:
> > > --- http://stackoverflow.com/questions/20571147/how-to-set-unicode-locale-in-r
> > > --- https://stat.ethz.ch/pipermail/r-devel/2013-November/067940.html
> > > --- http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
> > > --- https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
> > > --- http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
> > >
> > > I'm not sure whether the issue is while reading/extracting the data
> > > from media or while writing/exporting in Windows directory, but I
> > > don't experience similar issue in my personal Mac machine. I need some
> > > clarification here.
> > >
> > > How could I export the data just as I see on web ?  Please guide.
> > >
> > > Regards,
> > > Sunny
> > >
> > > Platform I'm using::::::::::::::::::::::::::::
> > > Operating System : Windows 7 Professional SP1
> > > R version details:
> > > platform       x86_64-w64-mingw32
> > > arch           x86_64
> > > os             mingw32
> > > system         x86_64, mingw32
> > > status
> > > major          3
> > > minor          2.3
> > > year           2015
> > > month          12
> > > day            10
> > > svn rev        69752
> > > language       R
> > > version.string R version 3.2.3 (2015-12-10)
> > > nickname       Wooden Christmas-Tree
> > >
> > > ______________________________________________
> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Please guide -- UTF-8 locale setting fails on Windows on writing

Sunny Singha
Milan,
Anwer to your queries:
-- But how do you read back the contents of the file? You need to specify
the encoding when reading it too.
Answer: I read back as stated in 'Case 2'

-- Are you sure the notepad saved the text as UTF-8?
Answer: Yes.

Regards,
Sunny

On Mon, Mar 28, 2016 at 9:58 PM, Milan Bouchet-Valat <[hidden email]> wrote:

> Le lundi 28 mars 2016 à 20:12 +0530, Sunny Singha a écrit :
>> Milan,
>> Ok, Let me take a case of facebook. I used Rfacebook package
>>  to get posts (getPost()) which returns list() of data frames(post,
>> comments, Likes)
>>
>> let me demonstrate 2 cases of read and write just as you suggested,
>> Case 1:::::::::
>> Lets say one of the facebook comment has below string value, in
>> Japanese language-->
>> "世界餐福事工 - 餐廳員工沒精打采 老是打盤子"
>>
>> On R console I now assign above string to variableas: x <- "世界餐福事工 -
>> 餐廳員工沒精打采 老是打盤子"
>> and write it as below:
>> write.csv(x, file='x.csv', row.names=F, fileEncoding='UTF-8')
>> I get this string in the file
>> "" -
>>  "
> But how do you read back the contents of the file? You need to specify
> the encoding when reading it too.
>
>> Case 2::::::::::::::
>> I create a notepad 'x.txt' and save Japanese string "世界餐福事工 - 餐廳員工沒精打采 老是打盤子"
>> and read it as below:
>> read.table('x.txt', fileEncoding='UTF-8'), I get below output:
>>
>>   V1
>> 1  ?
>> Warning messages:
>> 1: In read.table("x.txt", fileEncoding = "UTF-8") :
>>   invalid input found on input connection 'x.txt'
>> 2: In read.table("x.txt", fileEncoding = "UTF-8") :
>>   incomplete final line found by readTableHeader on 'x.txt'
> Are you sure the notepad saved the text as UTF-8?
>
>> Above was for demonstration, I'm infact reading social media data
>> extracted, which ultimately is somewhere using httr package and
>> returning data frames.
>> I'm not sure how should I get it handled in Windows as I don't observe
>> this behavior in Mac where system locase is set to 'en_US.UTF-8'
>>
>> Regards,
>> Sunny
>>
>>
>>
>>
>> On Mon, Mar 28, 2016 at 7:39 PM, Milan Bouchet-Valat  wrote:
>> >
>> > Le lundi 28 mars 2016 à 19:16 +0530, Sunny Singha a écrit :
>> > >
>> > > Hi,
>> > > I think I'm experiencing an issue regarding system Locale. I have
>> > > exported '.csv' formatted data frames gathered from various social
>> > > media platforms like facebook/twitter/G+, etc.
>> > >
>> > > I observe many variable/columns consists of strings formatted similar to below:
>> > > "
>> > > "
>> > >
>> > > As expected and I confirmed, in social media data, they are strings in
>> > > different languages.
>> > > Platform details are provide in the end of this mail. OS locale is set
>> > > to English (United States) hence 'R' locale is 'English_United
>> > > States.1252'
>> > >
>> > > I have attempted to change it to UTF-8 but receives below warning message:
>> > >
>> > > Warning message:
>> > > In Sys.setlocale("LC_ALL", "UTF-8") :
>> > >   OS reports request to set locale to "UTF-8" cannot be honored
>> > You don't need to set the locale. Just pass an appropriate value (e.g.
>> > "UTF-8") to read.csv() or write.csv()'s fileEncoding argument.
>> >
>> > You also didn't tell us what program you used to read these files. Some
>> > might guess the encoding incorrectly, or require you to choose it
>> > manually.
>> >
>> >
>> > Regards
>> >
>> > >
>> > > I have gone through below forums but no resolution so far:
>> > > --- http://stackoverflow.com/questions/20571147/how-to-set-unicode-locale-in-r
>> > > --- https://stat.ethz.ch/pipermail/r-devel/2013-November/067940.html
>> > > --- http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
>> > > --- https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
>> > > --- http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
>> > >
>> > > I'm not sure whether the issue is while reading/extracting the data
>> > > from media or while writing/exporting in Windows directory, but I
>> > > don't experience similar issue in my personal Mac machine. I need some
>> > > clarification here.
>> > >
>> > > How could I export the data just as I see on web ?  Please guide.
>> > >
>> > > Regards,
>> > > Sunny
>> > >
>> > > Platform I'm using::::::::::::::::::::::::::::
>> > > Operating System : Windows 7 Professional SP1
>> > > R version details:
>> > > platform       x86_64-w64-mingw32
>> > > arch           x86_64
>> > > os             mingw32
>> > > system         x86_64, mingw32
>> > > status
>> > > major          3
>> > > minor          2.3
>> > > year           2015
>> > > month          12
>> > > day            10
>> > > svn rev        69752
>> > > language       R
>> > > version.string R version 3.2.3 (2015-12-10)
>> > > nickname       Wooden Christmas-Tree
>> > >
>> > > ______________________________________________
>> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.