SVM. How to use categorical attributes?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

SVM. How to use categorical attributes?

Alekseiy Beloshitskiy
Hi All,

Here is the case. I want to build classification model (SVM). Some of variables for this model are categorical attributes which represent words  (usually 3-10 words - query for search in google). For example:
search_id | query_words                        |..| result
-----------+----------------------------------+--+--------
1            | how,to,grow,tree                  |..| 4
2            | smartfone,htc,buy,price         |..| 7
3            | buy,house,realty,london         |..| 6
4            | where,to,go,weekend,cinema |..| 4
...
As you can see, words in the query are disordered and may occur in different queries. Total number of unique words for all queries is several thousands.
The question is how to represent this variable (query_words) to use for SVM.

Thank you for any advices!

Alex

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Steve Lianoglou-6
Hi,

On Tue, Mar 27, 2012 at 6:05 AM, Alekseiy Beloshitskiy
<[hidden email]> wrote:

> Hi All,
>
> Here is the case. I want to build classification model (SVM). Some of variables for this model are categorical attributes which represent words  (usually 3-10 words - query for search in google). For example:
> search_id | query_words                        |..| result
> -----------+----------------------------------+--+--------
> 1            | how,to,grow,tree                  |..| 4
> 2            | smartfone,htc,buy,price         |..| 7
> 3            | buy,house,realty,london         |..| 6
> 4            | where,to,go,weekend,cinema |..| 4
> ...
> As you can see, words in the query are disordered and may occur in different queries. Total number of unique words for all queries is several thousands.
> The question is how to represent this variable (query_words) to use for SVM.
>
> Thank you for any advices!

One approach is to wire up a "bag of words" type of design matrix.

That is to say the matrix has as many columns as there are unique
words. Each row is an observation (query), and the words that appear
in the query have a value of 1 (or you can count the number of times
each word appears).

You can maybe get smarter and try to group like words together, but
... now you'll have two problems ...

Hope you have lots of data!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Alekseiy Beloshitskiy
Thank you, Steve,

I was thinking about smth like this. Just not sure about the efficiency of using several thousands of additional variables. And the second problem will be time-consumption for managing all these data in memory.

Here I posted more brief description:
http://stats.stackexchange.com/questions/25355/multi-value-categorical-attributes-how-r

Thank you,
-Alex

________________________________________
From: Steve Lianoglou [[hidden email]]
Sent: 27 March 2012 21:47
To: Alekseiy Beloshitskiy
Cc: [hidden email]
Subject: Re: [R] SVM. How to use categorical attributes?

Hi,

On Tue, Mar 27, 2012 at 6:05 AM, Alekseiy Beloshitskiy
<[hidden email]> wrote:

> Hi All,
>
> Here is the case. I want to build classification model (SVM). Some of variables for this model are categorical attributes which represent words  (usually 3-10 words - query for search in google). For example:
> search_id | query_words                        |..| result
> -----------+----------------------------------+--+--------
> 1            | how,to,grow,tree                  |..| 4
> 2            | smartfone,htc,buy,price         |..| 7
> 3            | buy,house,realty,london         |..| 6
> 4            | where,to,go,weekend,cinema |..| 4
> ...
> As you can see, words in the query are disordered and may occur in different queries. Total number of unique words for all queries is several thousands.
> The question is how to represent this variable (query_words) to use for SVM.
>
> Thank you for any advices!

One approach is to wire up a "bag of words" type of design matrix.

That is to say the matrix has as many columns as there are unique
words. Each row is an observation (query), and the words that appear
in the query have a value of 1 (or you can count the number of times
each word appears).

You can maybe get smarter and try to group like words together, but
... now you'll have two problems ...

Hope you have lots of data!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Ulrich Bodenhofer
Alex,

To avoid the memory issue, you can directly use a "bag of words" kernel (which corresponds to using the linear kernel on the sparse bag of words matrix Steve suggested). Just a little toy example how this is done for two :

> x1 <- c("how", "to", "grow", "tree")
> x2 <- c("where", "to", "go", "weekend", "cinema")
> k12 <- length(intersect(x1, x2))
> k12
[1] 1

If you run this for every pair of samples (additionally exploiting the symmetry of the resulting matrix), you will get an L x L matrix of kernel values (where L is the number of samples) without the need of having to store the large bag of words matrix. That's exactly one of the beauties of SVMs, in my humble opinion.

Just as a side note: the result above is 1 because there is one overlap in the two bags of words, the word "to". Maybe it is a good idea to remove such unspecific words first and, moreover, to do word stemming, as is the standard in analyses like the one you are aiming at.

Best regards,
Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Ulrich Bodenhofer
Sorry, I forgot to mention the following: all I wrote is only valid as long as your number of samples is smaller than the number of different words. If the number of samples exceeds the total number of different words, you should better use the explicit matrix representation and use some kernel (e.g. linear) on this matrix.

Best regards,
Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Alekseiy Beloshitskiy
Thank you so much, Ulrich,

Will play with this.

Best,
-Alex
________________________________________
From: [hidden email] [[hidden email]] on behalf of Ulrich Bodenhofer [[hidden email]]
Sent: 28 March 2012 14:40
To: [hidden email]
Subject: Re: [R] SVM. How to use categorical attributes?

Sorry, I forgot to mention the following: all I wrote is only valid as long
as your number of samples is smaller than the number of different words. If
the number of samples exceeds the total number of different words, you
should better use the explicit matrix representation and use some kernel
(e.g. linear) on this matrix.

Best regards,
Ulrich


--
View this message in context: http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512041.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Steve Lianoglou-6
In reply to this post by Ulrich Bodenhofer
Hi,

These suggestions still require you to explicitly compute your feature
space or kernel matrix first, which might kill you memory wise.

You might consider taking a look at the shogun toolbox:

http://www.shogun-toolbox.org/

With some digging, I'm pretty sure you'll find a bag-of-words type of
kernel there (it's related to the spectrum kernel, which you can find
for searching the code base for something like "commword") ... you
might consider posting to their mailing list after you give it the
"good old college try" of sorting this out for yourself for a bit.

The R interface to the toolbox is a bit ... alien, though. I'm working
on making a nicer one but it's not quite ready for public consumption.

-steve


On Wed, Mar 28, 2012 at 7:38 AM, Ulrich Bodenhofer
<[hidden email]> wrote:

> Alex,
>
> To avoid the memory issue, you can directly use a "bag of words" kernel
> (which corresponds to using the linear kernel on the sparse bag of words
> matrix Steve suggested). Just a little toy example how this is done for two
> :
>
>> x1 <- c("how", "to", "grow", "tree")
>> x2 <- c("where", "to", "go", "weekend", "cinema")
>> k12 <- length(intersect(x1, x2))
>> k12
> [1] 1
>
> If you run this for every pair of samples (additionally exploiting the
> symmetry of the resulting matrix), you will get an L x L matrix of kernel
> values (where L is the number of samples) without the need of having to
> store the large bag of words matrix. That's exactly one of the beauties of
> SVMs, in my humble opinion.
>
> Just as a side note: the result above is 1 because there is one overlap in
> the two bags of words, the word "to". Maybe it is a good idea to remove such
> unspecific words first and, moreover, to do word stemming, as is the
> standard in analyses like the one you are aiming at.
>
> Best regards,
> Ulrich
>
> --
> View this message in context: http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512034.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SVM. How to use categorical attributes?

Steve Lianoglou-6
Sorry -- I should add that I'm pointing out the potential shogun
implementation because I suspect their implementation of a
bag-of-words -like kernel would use the kernel trick, so you won't
have to map all of your data explicitly into some huge feature space
that will blow your memory away.

I'm not 100% sure they have what you're looking for, but as I said ...
it's worth checking out.

-steve

On Wed, Mar 28, 2012 at 9:54 AM, Steve Lianoglou
<[hidden email]> wrote:

> Hi,
>
> These suggestions still require you to explicitly compute your feature
> space or kernel matrix first, which might kill you memory wise.
>
> You might consider taking a look at the shogun toolbox:
>
> http://www.shogun-toolbox.org/
>
> With some digging, I'm pretty sure you'll find a bag-of-words type of
> kernel there (it's related to the spectrum kernel, which you can find
> for searching the code base for something like "commword") ... you
> might consider posting to their mailing list after you give it the
> "good old college try" of sorting this out for yourself for a bit.
>
> The R interface to the toolbox is a bit ... alien, though. I'm working
> on making a nicer one but it's not quite ready for public consumption.
>
> -steve
>
>
> On Wed, Mar 28, 2012 at 7:38 AM, Ulrich Bodenhofer
> <[hidden email]> wrote:
>> Alex,
>>
>> To avoid the memory issue, you can directly use a "bag of words" kernel
>> (which corresponds to using the linear kernel on the sparse bag of words
>> matrix Steve suggested). Just a little toy example how this is done for two
>> :
>>
>>> x1 <- c("how", "to", "grow", "tree")
>>> x2 <- c("where", "to", "go", "weekend", "cinema")
>>> k12 <- length(intersect(x1, x2))
>>> k12
>> [1] 1
>>
>> If you run this for every pair of samples (additionally exploiting the
>> symmetry of the resulting matrix), you will get an L x L matrix of kernel
>> values (where L is the number of samples) without the need of having to
>> store the large bag of words matrix. That's exactly one of the beauties of
>> SVMs, in my humble opinion.
>>
>> Just as a side note: the result above is 1 because there is one overlap in
>> the two bags of words, the word "to". Maybe it is a good idea to remove such
>> unspecific words first and, moreover, to do word stemming, as is the
>> standard in analyses like the one you are aiming at.
>>
>> Best regards,
>> Ulrich
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512034.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact



--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.