text analysis errors

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

text analysis errors

gdb1986
Hello all,

I have asked this question on many forums without response. And although
I've made progress myself, I am stuck as to how to respond to a particular
error message.

I have a question about text-analysis packages and code. The general idea
is that I am trying to perform readability analyses on a collection of
about 4,000 Word files. I would like to do any of a number of such
analyses, but the problem now is getting R to recognize the uploaded files
as data ready for analysis. But I have been getting error messages. Let me
show what I have done so far. I have three separate commands because I
broke the file of 4,000 files up into three separate ones because,
evidently, the file was too voluminous to be read alone in its entirety.
So, I divided the files up into three roughly similar folders. They are
called ‘WPSCASES’ one through three. Here is my code, with the error
messages for each command recorded below:

token <-
tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")

The code is the same for the other folders; the name of the folder is
different, but otherwise identical.

The error message reads:

*Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
string, element 348*

The error messages are the same for the other two commands. But the
'element' number is different. It's 925 for the second folder, and 4302 for
the third.

token2 <-
tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")

token3 <-
tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")

These are the other commands if that's helpful.

I’ve tried to discover whether the ‘element’ that the error message
mentions corresponds to the file of that number in the file’s order. But
since folder 3 does not have 4,300 files in it, I think that that was
unlikely. Please let me know if you can figure out how to fix this stuff so
that I can start to use ‘koRpus’ commands, like ‘readability’ and its
progeny.

Thank you,
Gordon

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: text analysis errors

Jim Lemon-4
Hi Gordon,
Looks to me as though you may have to extract the text from the Word
files. Export As Text.

Jim

On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud
<[hidden email]> wrote:

>
> Hello all,
>
> I have asked this question on many forums without response. And although
> I've made progress myself, I am stuck as to how to respond to a particular
> error message.
>
> I have a question about text-analysis packages and code. The general idea
> is that I am trying to perform readability analyses on a collection of
> about 4,000 Word files. I would like to do any of a number of such
> analyses, but the problem now is getting R to recognize the uploaded files
> as data ready for analysis. But I have been getting error messages. Let me
> show what I have done so far. I have three separate commands because I
> broke the file of 4,000 files up into three separate ones because,
> evidently, the file was too voluminous to be read alone in its entirety.
> So, I divided the files up into three roughly similar folders. They are
> called ‘WPSCASES’ one through three. Here is my code, with the error
> messages for each command recorded below:
>
> token <-
> tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")
>
> The code is the same for the other folders; the name of the folder is
> different, but otherwise identical.
>
> The error message reads:
>
> *Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
> string, element 348*
>
> The error messages are the same for the other two commands. But the
> 'element' number is different. It's 925 for the second folder, and 4302 for
> the third.
>
> token2 <-
> tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")
>
> token3 <-
> tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")
>
> These are the other commands if that's helpful.
>
> I’ve tried to discover whether the ‘element’ that the error message
> mentions corresponds to the file of that number in the file’s order. But
> since folder 3 does not have 4,300 files in it, I think that that was
> unlikely. Please let me know if you can figure out how to fix this stuff so
> that I can start to use ‘koRpus’ commands, like ‘readability’ and its
> progeny.
>
> Thank you,
> Gordon
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: text analysis errors

Rasmus Liland-3
On 2021-01-07 11:34 +1100, Jim Lemon wrote:

> On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud
> <[hidden email]> wrote:
> >
> > Hello all,
> >
> > I have asked this question on many forums without response. And although
> > I've made progress myself, I am stuck as to how to respond to a particular
> > error message.
> >
> > I have a question about text-analysis packages and code. The general idea
> > is that I am trying to perform readability analyses on a collection of
> > about 4,000 Word files. I would like to do any of a number of such
> > analyses, but the problem now is getting R to recognize the uploaded files
> > as data ready for analysis. But I have been getting error messages. Let me
> > show what I have done so far. I have three separate commands because I
> > broke the file of 4,000 files up into three separate ones because,
> > evidently, the file was too voluminous to be read alone in its entirety.
> > So, I divided the files up into three roughly similar folders. They are
> > called ‘WPSCASES’ one through three. Here is my code, with the error
> > messages for each command recorded below:
> >
> > token <-
> > tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")
> >
> > The code is the same for the other folders; the name of the folder is
> > different, but otherwise identical.
> >
> > The error message reads:
> >
> > *Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
> > string, element 348*
> >
> > The error messages are the same for the other two commands. But the
> > 'element' number is different. It's 925 for the second folder, and 4302 for
> > the third.
> >
> > token2 <-
> > tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")
> >
> > token3 <-
> > tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")
> >
> > These are the other commands if that's helpful.
> >
> > I’ve tried to discover whether the ‘element’ that the error message
> > mentions corresponds to the file of that number in the file’s order. But
> > since folder 3 does not have 4,300 files in it, I think that that was
> > unlikely. Please let me know if you can figure out how to fix this stuff so
> > that I can start to use ‘koRpus’ commands, like ‘readability’ and its
> > progeny.
> >
> > Thank you,
> > Gordon
>
> Hi Gordon,
> Looks to me as though you may have to extract the text from the Word
> files. Export As Text.
Hi!  

quanteda::tokenizer says it needs a
character vector or «corpus» as input

        https://www.rdocumentation.org/packages/quanteda/versions/0.99.12/topics/tokenize

... or is this tokenize from the
tokenizers package, I found something
about «doc_id» here:

        https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html

You can convert docx to markdown using
pandoc:

        pandoc --from docx --to markdown $inputfile

odt also works, and many others.  

I believe pandoc is included in RStudio.  
But I have never used it from there
myself, so that is really bad advice I
think.

To read doc, I use wvHtml:

        wvHtml $inputfile - 2> /dev/null | w3m -dump -T text/html

Rasmus

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment