TM reader with text

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

TM reader with text

Mickael R problem
Hello everybody,
I work, I try, with TM but I have a problem with some special words in french. I think this is due to the manner to transform PDF to text, but I'm not perfectly sure.
Let's see to the example :

findFreqTerms(tdm1,30)
    [33] "<U+F0A3>"            "<U+FB01>n"           "<U+FB01>nancement"   "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"   "<U+FB01>nanciers"    "<U+FB01>xe"        

Some french words are not well reading by TM with the reader readPlain. I try to use reader= reader PDF. But it doesn't work so I must transformed PDF text to text. And some words are not understand so when I use  TermDocumentMatrix a word like inflation diseappear. It's a big probleme for me. I spend lot of time on this problem, any idea ? Thank's for you time.
Best regard"s
Mickaël    
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

David Winsemius

On Feb 29, 2012, at 6:00 PM, Mickael R problem wrote:

> Hello everybody,
> I work, I try, with TM but I have a problem with some special words in
> french. I think this is due to the manner to transform PDF to text,  
> but I'm
> not perfectly sure.
> Let's see to the example :
>
> findFreqTerms(tdm1,30)
>    [33] "<U+F0A3>"            "<U+FB01>n"           "<U
> +FB01>nancement"
> "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"
> "<U+FB01>nanciers"    "<U+FB01>xe"
>
> Some french words are not well reading by TM with the reader  
> readPlain. I
> try to use reader= reader PDF. But it doesn't work so I must  
> transformed PDF
> text to text. And some words are not understand so when I use
> TermDocumentMatrix a word like inflation diseappear. It's a big  
> probleme for
> me. I spend lot of time on this problem, any idea ? Thank's for you  
> time.

You included no information about your platform, locale settings, or  
encoding of the text.

?Encoding
?sessionInfo

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Mickael R problem
my computer run under windows vista 64 sp2. The question about encoding, I don't understand it, sorry ?
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Richard M. Heiberger
In reply to this post by David Winsemius
Most, maybe all, of the example words you posted include ligatures,
With "financier" for example, the leading "fi" is rendered in PDF and in
most typesetting
situations as a ligature with the a single complex character representing
the "fi' combination.

fi fl

I pasted the "fi" and "fl" ligatures in this email. I hope they get through.

I don't know the package you are using, I hope it has arguments that tell
it about ligatures.

Rich



On Wed, Feb 29, 2012

at 6:49 PM, David Winsemius <[hidden email]> wrote:

>
> On Feb 29, 2012, at 6:00 PM, Mickael R problem wrote:
>
> Hello everybody,
>> I work, I try, with TM but I have a problem with some special words in
>> french. I think this is due to the manner to transform PDF to text, but
>> I'm
>> not perfectly sure.
>> Let's see to the example :
>>
>> findFreqTerms(tdm1,30)
>>   [33] "<U+F0A3>"            "<U+FB01>n"           "<U+FB01>nancement"
>> "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"
>> "<U+FB01>nanciers"    "<U+FB01>xe"
>>
>> Some french words are not well reading by TM with the reader readPlain. I
>> try to use reader= reader PDF. But it doesn't work so I must transformed
>> PDF
>> text to text. And some words are not understand so when I use
>> TermDocumentMatrix a word like inflation diseappear. It's a big probleme
>> for
>> me. I spend lot of time on this problem, any idea ? Thank's for you time.
>>
>
> You included no information about your platform, locale settings, or
> encoding of the text.
>
> ?Encoding
> ?sessionInfo
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
>
> ______________________________**________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide http://www.R-project.org/**
> posting-guide.html <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Mickael R problem
Hi Richard,
clearly there is a problem with latin ligature because the word resulting from my ask with  findFreqTerms give me some words >           "<U+FB01>n"           "<U+FB01>nancement"
>> "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"
>> "<U+FB01>nanciers"    "<U+FB01>xe"
 where U+FB01 is a code for latin ligature. The problem is well identified ok.

Now, how can I tretaed it. The package TAU seems to offer a solution for text but not for corpus.

quoation TAU " translate Translate Unicode Latin Ligatures Description Translate Unicode “Latin ligature” characters to their respective constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
x a character vector in UTF-8 encoding.
Details In typography, a ligature occurs where two or more graphemes are joined as a single glyph. (See
http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
Code Name
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
0152 LATIN CAPITAL LIGATURE OE
0153 LATIN SMALL LIGATURE OE
FB00 LATIN SMALL LIGATURE FF
util 9
FB01 LATIN SMALL LIGATURE FI
FB02 LATIN SMALL LIGATURE FL
FB03 LATIN SMALL LIGATURE FFI
FB04 LATIN SMALL LIGATURE FFL
FB05 LATIN SMALL LIGATURE LONG S T
FB06 LATIN SMALL LIGATURE ST

translate_Unicode_latin_ligatures translates these to their respective constituent characters.

I need this king of fonction for corpus not only text or characters. Any ideas ?
Thank's for comments and your answers. We are in progress!
Mickaël
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Milan Bouchet-Valat
Le jeudi 01 mars 2012 à 07:07 -0800, Mickael R problem a écrit :

> Hi Richard,
> clearly there is a problem with latin ligature because the word resulting
> from my ask with  findFreqTerms give me some words >           "<U+FB01>n"          
> "<U+FB01>nancement"
> >> "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"
> >> "<U+FB01>nanciers"    "<U+FB01>xe"
>  where U+FB01 is a code for latin ligature. The problem is well identified
> ok.
>
> Now, how can I tretaed it. The package TAU seems to offer a solution for
> text but not for corpus.
>
> quoation TAU " translate Translate Unicode Latin Ligatures Description
> Translate Unicode “Latin ligature” characters to their respective
> constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
> x a character vector in UTF-8 encoding.
> Details In typography, a ligature occurs where two or more graphemes are
> joined as a single glyph. (See
> http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
> Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
> Code Name
> 0132 LATIN CAPITAL LIGATURE IJ
> 0133 LATIN SMALL LIGATURE IJ
> 0152 LATIN CAPITAL LIGATURE OE
> 0153 LATIN SMALL LIGATURE OE
> FB00 LATIN SMALL LIGATURE FF
> util 9
> FB01 LATIN SMALL LIGATURE FI
> FB02 LATIN SMALL LIGATURE FL
> FB03 LATIN SMALL LIGATURE FFI
> FB04 LATIN SMALL LIGATURE FFL
> FB05 LATIN SMALL LIGATURE LONG S T
> FB06 LATIN SMALL LIGATURE ST
>
> translate_Unicode_latin_ligatures translates these to their respective
> constituent characters.
>
> I need this king of fonction for corpus not only text or characters. Any
> ideas ?
Try:
corpus <- tm_map(corpus, translate_Unicode_latin_ligatures)
(with 'corpus' your corpus, of course ;-)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Mickael R problem
Hello everybody,
I don't give up the fight, but it's hard. I have finded a solution for the ligature with a best converter wich tranlated more precisely PDF to plain text. But a new problem has occured. In french particulary, but it should be the case in english too, I have a big problem ' " brackets wich polluted the counting of the words. Actullaly the fonction remove ponctuation are not able to treated this "punctuation".

The solution should be to produce a more precise fonction in remove punctation which allowed to destroy any bracket. The problem is that brackets are not separeted of the word with space, but normally there are jsut before or after the word. So, remove punctuation undertand the bracket as a part of the word.
 Another problem, less important, is the bad account of words in reason of s or not and so on. For the fonction TermDocumentMatrix may be there is an option for ask only the word, but I don't find it.  

For the moment I treat this probleme with my little fingers. I open all the texts with word to ellimanted all the bracket with a small macro. But it's not an easy way with much undred texts in my corpus.
For plural I take the word with or without s and i make the difference. Fortunaltly, I wish to conserve only 40 more meagningfull words of the corpus.
I know what kind of improvement could be done but I m just a user not an ingeneer. I think little improvements could be realize by the magical ingeneer wich work for the communauty as I try modestly with my comments.
Thank's for all,
Mickaël  
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Milan Bouchet-Valat
Le samedi 03 mars 2012 à 16:56 -0800, Mickael R problem a écrit :

> Hello everybody,
> I don't give up the fight, but it's hard. I have finded a solution for the
> ligature with a best converter wich tranlated more precisely PDF to plain
> text. But a new problem has occured. In french particulary, but it should be
> the case in english too, I have a big problem ' " brackets wich polluted the
> counting of the words. Actullaly the fonction remove ponctuation are not
> able to treated this "punctuation".
>
> The solution should be to produce a more precise fonction in remove
> punctation which allowed to destroy any bracket. The problem is that
> brackets are not separeted of the word with space, but normally there are
> jsut before or after the word. So, remove punctuation undertand the bracket
> as a part of the word.
removePunctuation() only handles correctly English punctuation, sadly.
In English, this problem never happens, or only with ending 's, which
does not really matter.

Try this before running removePuncutation():
corpus <- tm_map(corpus, function(x) gsub("[\'\U2019]«»", " ", x))

It will replace quotation marks with a space, and that's enough to
separate them from the rest of the word.

>  Another problem, less important, is the bad account of words in reason of s
> or not and so on. For the fonction TermDocumentMatrix may be there is an
> option for ask only the word, but I don't find it.  
>
> For the moment I treat this probleme with my little fingers. I open all the
> texts with word to ellimanted all the bracket with a small macro. But it's
> not an easy way with much undred texts in my corpus.
> For plural I take the word with or without s and i make the difference.
> Fortunaltly, I wish to conserve only 40 more meagningfull words of the
> corpus.
> I know what kind of improvement could be done but I m just a user not an
> ingeneer. I think little improvements could be realize by the magical
> ingeneer wich work for the communauty as I try modestly with my comments.
This is called stemming, and it's implemented by the Snowball package.
You can do this with:
corpus <- tm_map(corpus, stemDocument, language="french")
(after installing Snowball)

You can also try the GUI I'm currently writing to do that easily [1]. No
warranty it will work, but it usually does quite well, though it's still
in development. To install it:
install.packages("RcmdrPlugin.TextMiningSuite",
repos="http://R-Forge.R-project.org")

Hope this helps

1: https://r-forge.r-project.org/projects/rcmdr-tms/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: TM reader with text

Mickael R problem
"Try this before running removePuncutation():
corpus <- tm_map(corpus, function(x) gsub("[\'\U2019]«»", " ", x))"
It will replace quotation marks with a space, and that's enough to
separate them from the rest of the word.

I try to use your solution. It's work only for characters, not for a Corpus, but I was an idea.

You can also try the GUI I'm currently writing to do that easily [1]. No
warranty it will work, but it usually does quite well, though it's still
in development. To install it:
install.packages("RcmdrPlugin.TextMiningSuite",
repos="http://R-Forge.R-project.org")
I will be interesinfg in but I ever try to set up this package but there is problem Rgrafiz version. I don't undertand exaclty the problem.

best regard's
Mickaël