Feature space problem regarding text classification using SVM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Feature space problem regarding text classification using SVM

R help mailing list-2
Dear list members,

I'm currently working on text classification of student's essays, trying
to identify texts that fit to a certain class or not. I use texts from
one semester (A) for training and texts from another semester (B) for
testing the classifier. My workflow is like this:

  * read all texts from A, build a DTM(A) with about 1387 terms
  * read all texts from B, build a DTM(B) with about 626 terms
  * train the classifier with DTM(A), using a SVM (package e1071)

Now I want to classify all texts in DTM(B) using the classifyer. But
when I try to use predict(), I always get the error message: Error in
eval(expr, envir, enclos) : object 'XY' not found. As I found out, the
reason for this is that DTM(A) and DTM(B) have a different number of
terms and consequently not every term used for training the model is
available in DTM(B).

My question is: how should/do I deal with this? Should I match the terms
used in DTM(A) and DTM(B), in order to get an identical feature space?
This could be achieved either reducing the number of terms in DTM(A) or
adding several empty/NA columns to DTM(B). Or is there another solution
to my problem?

Kind regards

   Björn



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Feature space problem regarding text classification using SVM

Bert Gunter-2
This question is about statistics and therefore is off topic for this list.
However, if I understand correctly isn't the answer obvious? -- how can you
classify by features whose values are unknown?

Cheers,
Bert



On Feb 17, 2017 5:28 AM, "Björn Fisseler via R-help" <[hidden email]>
wrote:

Dear list members,

I'm currently working on text classification of student's essays, trying
to identify texts that fit to a certain class or not. I use texts from
one semester (A) for training and texts from another semester (B) for
testing the classifier. My workflow is like this:

  * read all texts from A, build a DTM(A) with about 1387 terms
  * read all texts from B, build a DTM(B) with about 626 terms
  * train the classifier with DTM(A), using a SVM (package e1071)

Now I want to classify all texts in DTM(B) using the classifyer. But
when I try to use predict(), I always get the error message: Error in
eval(expr, envir, enclos) : object 'XY' not found. As I found out, the
reason for this is that DTM(A) and DTM(B) have a different number of
terms and consequently not every term used for training the model is
available in DTM(B).

My question is: how should/do I deal with this? Should I match the terms
used in DTM(A) and DTM(B), in order to get an identical feature space?
This could be achieved either reducing the number of terms in DTM(A) or
adding several empty/NA columns to DTM(B). Or is there another solution
to my problem?

Kind regards

   Björn



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...