Classifying large text corpora using R

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Classifying large text corpora using R

andy1234
Dear everyone,

I am new to R, and I am looking at doing text classification on a huge collection of documents (>500,000) which are distributed among 300 classes (so basically, this is my training data). Would someone please be kind enough to let me know about the R packages to use and their scalability (time and space)?

I am very new to R and do not know of the right packages to use. I started off by trying to use the tm package (http://cran.r-project.org/package=tm) for pre-processing and FSelector (http://cran.r-project.org/web/packages/FSelector/index.html) package for feature selection - but both of these are incredibly slow and completely unusable for my task.

So the question is what are the right packages to use (for pre-processing, feature selection, and classification)? Please consider the fact that I may be dealing with data of millions of dimensions which may not even fit in memory.

I posted on this issue twice (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html , http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) but did not get any response. This is a very critical piece of my research and I have been struggling with this issue for a long time. Please consider helping me out, directly or by pointing me to any other software/website that you think may be more appropriate.

Many thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Classifying large text corpora using R

Daniel Malter
Take a look here: http://www.jstatsoft.org/v25/i05/paper

HTH,
Da.

andy1234 wrote
Dear everyone,

I am new to R, and I am looking at doing text classification on a huge collection of documents (>500,000) which are distributed among 300 classes (so basically, this is my training data). Would someone please be kind enough to let me know about the R packages to use and their scalability (time and space)?

I am very new to R and do not know of the right packages to use. I started off by trying to use the tm package (http://cran.r-project.org/package=tm) for pre-processing and FSelector (http://cran.r-project.org/web/packages/FSelector/index.html) package for feature selection - but both of these are incredibly slow and completely unusable for my task.

So the question is what are the right packages to use (for pre-processing, feature selection, and classification)? Please consider the fact that I may be dealing with data of millions of dimensions which may not even fit in memory.

I posted on this issue twice (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html , http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) but did not get any response. This is a very critical piece of my research and I have been struggling with this issue for a long time. Please consider helping me out, directly or by pointing me to any other software/website that you think may be more appropriate.

Many thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Classifying large text corpora using R

andy1234
Daniel Malter wrote
Take a look here: http://www.jstatsoft.org/v25/i05/paper

HTH,
Da.

andy1234 wrote
Dear everyone,

I am new to R, and I am looking at doing text classification on a huge collection of documents (>500,000) which are distributed among 300 classes (so basically, this is my training data). Would someone please be kind enough to let me know about the R packages to use and their scalability (time and space)?

I am very new to R and do not know of the right packages to use. I started off by trying to use the tm package (http://cran.r-project.org/package=tm) for pre-processing and FSelector (http://cran.r-project.org/web/packages/FSelector/index.html) package for feature selection - but both of these are incredibly slow and completely unusable for my task.

So the question is what are the right packages to use (for pre-processing, feature selection, and classification)? Please consider the fact that I may be dealing with data of millions of dimensions which may not even fit in memory.

I posted on this issue twice (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html , http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) but did not get any response. This is a very critical piece of my research and I have been struggling with this issue for a long time. Please consider helping me out, directly or by pointing me to any other software/website that you think may be more appropriate.

Many thanks in advance.
Hi,

Many thanks for your reply.

I did in fact mention in my e-mail that I have looked at tm package. It does not scale well at all.

Then there are other stages in the pipeline - feature selection, classification etc. and I need to find suitable R packages for those also.

Any other thoughts?

Thanks.
Andy