Multibyte strings

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multibyte strings

Fisher Dennis
R 3.2.0
OS X

Colleagues,

Earlier today, I initiated a series of emails regarding SASxport (which was removed from CRAN).  David Winsemius proposed downloading the source code and installing with the following command:
        install.packages('~/Downloads/SASxport_1.5.0.tar.gz', repos = NULL , type="source”)Th

That works and I am grateful to David for his recommendation.  However, the package fails on some of the many objects that I attempted to write with:
        write.xport

The error message was:
        Error in nchar(var) : invalid multibyte string 3157

One work-around would be to edit out multibyte strings.  Is there a simple way to find and replace them?  Or is there some other clever approach that bypasses the problem?

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte strings

David Winsemius



On Sep 25, 2015, at 2:23 PM, Dennis Fisher wrote:

> R 3.2.0
> OS X
>
> Colleagues,
>
> Earlier today, I initiated a series of emails regarding SASxport (which was removed from CRAN).  David Winsemius proposed downloading the source code and installing with the following command:
> install.packages('~/Downloads/SASxport_1.5.0.tar.gz', repos = NULL , type="source”)Th
>
> That works and I am grateful to David for his recommendation.  However, the package fails on some of the many objects that I attempted to write with:
> write.xport
>
> The error message was:
> Error in nchar(var) : invalid multibyte string 3157

Consider using traceback() to see what section of code is actually reporting?

Since the error reported in your earlier message indicated a problem with a particular word starting with DIARRH  and ending in æéñåºA. When I try to drop that unquoted into an R console line I get:

> DIARRH¸æéñåºA
Error: unexpected input in "DIARRH¬"

My word process tells me that little comma-like glyph is a cedilla.

However I'm not sure this is reproducible problem since I am unable to produce a similar error with the toy file that is built with the write.xport help page code:

> abc <- data.frame( x=c(1, 2, NA, NA ), y=c('a', 'DIARRH¸æéñåºA', NA, '*' ) )
> abc
   x             y
1  1             a
2  2 DIARRH¸æéñåºA
3 NA          <NA>
4 NA             *
> SASformat(abc$x) <- 'date7.'
> label(abc$y) <- 'character variable'
> label(abc) <- 'Simple example'
> SAStype(abc) <- 'MYTYPE'
> str(abc)
'data.frame': 4 obs. of  2 variables:
 $ x: atomic  1 2 NA NA
  ..- attr(*, "SASformat")= chr "date7."
 $ y: Factor w/ 3 levels "*","a","DIARRH¸æéñåºA": 2 3 NA 1
  ..- attr(*, "label")= chr "character variable"
 - attr(*, "label")= chr "Simple example"
 - attr(*, "SAStype")= chr "MYTYPE"
> write.xport( abc, file="xxx.dat" )
> abc <- data.frame( x=c(1, 2, NA, NA ), y=c('a', 'DIARRH¸æéñåºA', NA, '*' ) )
> abc
   x             y
1  1             a
2  2 DIARRH¸æéñåºA
3 NA          <NA>
4 NA             *
> SASformat(abc$x) <- 'date7.'
> label(abc$y) <- '"DIARRH¸æéñåºA"'
> label(abc) <- 'Simple example'
> SAStype(abc) <- 'MYTYPE'
> str(abc)
'data.frame': 4 obs. of  2 variables:
 $ x: atomic  1 2 NA NA
  ..- attr(*, "SASformat")= chr "date7."
 $ y: Factor w/ 3 levels "*","a","DIARRH¸æéñåºA": 2 3 NA 1
  ..- attr(*, "label")= chr "\"DIARRH¸æéñåºA\""
 - attr(*, "label")= chr "Simple example"
 - attr(*, "SAStype")= chr "MYTYPE"
> write.xport( abc, file="xxx.dat" )


>
> One work-around would be to edit out multibyte strings.  Is there a simple way to find and replace them?  

On a Mac I have used the Zap Gremlins option in TextWrangler.app. It would change the spelling of words that were originally constructed using ligature characters.


Best of luck;
David.

> Or is there some other clever approach that bypasses the problem?
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte strings

Peter Dalgaard-2
In reply to this post by Fisher Dennis
Dennis,

The invalid multibyte issue is almost certainly a symptom of being in a UTF-8 locale and trying to handle strings that aren't in UTF-8. (UTF uses particular 8 bit patterns to say that the following k bytes contain a Unicode value outside ASCII, other "8 bit ASCII" encodings, like Latin-1, just use the extra 128 character codes for special characters. Treating the latter as the former causes errors, the other way around just looks weird.

So perhaps you should try diddling your locale settings and/or look for encoding arguments for the functions that you use. Then again, the XPT format may not be happy with non-ASCII characters, whatever the encoding, in which case you may need to massage the input data sets and change variable names and factor labels (iconv() should be your friend).

By the way, I don't think the FDA "requests" XPT files. As far as I recall, they say somewhere that they _accept_ them (possibly defending themselves against the platform-specific SAS files that once abunded), but I think even Excel goes for submissions - the important thing is that they can get at the actual data reasonably easy. I can see the attraction of taking the well-trodden path, though.

-pd

> On 25 Sep 2015, at 23:23 , Dennis Fisher <[hidden email]> wrote:
>
> R 3.2.0
> OS X
>
> Colleagues,
>
> Earlier today, I initiated a series of emails regarding SASxport (which was removed from CRAN).  David Winsemius proposed downloading the source code and installing with the following command:
> install.packages('~/Downloads/SASxport_1.5.0.tar.gz', repos = NULL , type="source”)Th
>
> That works and I am grateful to David for his recommendation.  However, the package fails on some of the many objects that I attempted to write with:
> write.xport
>
> The error message was:
> Error in nchar(var) : invalid multibyte string 3157
>
> One work-around would be to edit out multibyte strings.  Is there a simple way to find and replace them?  Or is there some other clever approach that bypasses the problem?
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte strings

Fisher Dennis
Peter

Thanks for the explanation.  One further comment — you wrote:
> I don't think the FDA "requests" XPT files

In fact, they do make such a request.  Here is the actual language received this week (and repeatedly in the past):
> Program/script files should be submitted using text files (*.TXT) and the data should be submitted using SAS transport files (*.XPT).

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com



> On Sep 26, 2015, at 5:52 AM, peter dalgaard <[hidden email]> wrote:
>
> Dennis,
>
> The invalid multibyte issue is almost certainly a symptom of being in a UTF-8 locale and trying to handle strings that aren't in UTF-8. (UTF uses particular 8 bit patterns to say that the following k bytes contain a Unicode value outside ASCII, other "8 bit ASCII" encodings, like Latin-1, just use the extra 128 character codes for special characters. Treating the latter as the former causes errors, the other way around just looks weird.
>
> So perhaps you should try diddling your locale settings and/or look for encoding arguments for the functions that you use. Then again, the XPT format may not be happy with non-ASCII characters, whatever the encoding, in which case you may need to massage the input data sets and change variable names and factor labels (iconv() should be your friend).
>
> By the way, I don't think the FDA "requests" XPT files. As far as I recall, they say somewhere that they _accept_ them (possibly defending themselves against the platform-specific SAS files that once abunded), but I think even Excel goes for submissions - the important thing is that they can get at the actual data reasonably easy. I can see the attraction of taking the well-trodden path, though.
>
> -pd
>
>> On 25 Sep 2015, at 23:23 , Dennis Fisher <[hidden email]> wrote:
>>
>> R 3.2.0
>> OS X
>>
>> Colleagues,
>>
>> Earlier today, I initiated a series of emails regarding SASxport (which was removed from CRAN).  David Winsemius proposed downloading the source code and installing with the following command:
>> install.packages('~/Downloads/SASxport_1.5.0.tar.gz', repos = NULL , type="source”)Th
>>
>> That works and I am grateful to David for his recommendation.  However, the package fails on some of the many objects that I attempted to write with:
>> write.xport
>>
>> The error message was:
>> Error in nchar(var) : invalid multibyte string 3157
>>
>> One work-around would be to edit out multibyte strings.  Is there a simple way to find and replace them?  Or is there some other clever approach that bypasses the problem?
>>
>> Dennis
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: [hidden email]  Priv: [hidden email]
>
>
>
>
>
>
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.