variable selections to avoid multicollinearity

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

variable selections to avoid multicollinearity

Kristi Glover
HI R user,
I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found  PCA for feature  selection are used.
but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.

would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables

dat<-structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
341L), var7 = c(29L, -9L, -18L, -22L, -20L), var8 = c(305L, 361L,
359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
    var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
    1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
    ), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
    22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
    var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
    213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
    257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
"var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
"var13", "var14", "var15", "var16", "var17", "var18", "var19",
"var20"), class = "data.frame", row.names = c(NA, -5L))

thanks

     
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: variable selections to avoid multicollinearity

Bert Gunter
OFFTOPIC! This is a statistical question, not an R question. Post on a
statistics site like stats.stackexchange.com  .

However, your post suggests that you are completely out of your depth
here (0/1 responses suggest that glm modeling via logistic regression
is called for). Remote internet advice is unlikely to fill the gap
between what you seem to need and what you seem to know. I strongly
suggest you find a local statistical expert to help if you wish to
avoid producing nonsense.

(Once you have figured out what you need to do, questions about how to
use R tools to do it are of course appropriate).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, May 17, 2015 at 1:06 PM, Kristi Glover
<[hidden email]> wrote:

> HI R user,
> I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found  PCA for feature  selection are used.
> but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.
>
> would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables
>
> dat<-structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
> 134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
> 129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
> 8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
> 341L), var7 = c(29L, -9L, -18L, -22L, -20L), var8 = c(305L, 361L,
> 359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
> 67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
>     var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
>     1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
>     ), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
>     22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
>     var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
>     213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
>     257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
> "var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
> "var13", "var14", "var15", "var16", "var17", "var18", "var19",
> "var20"), class = "data.frame", row.names = c(NA, -5L))
>
> thanks
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: variable selections to avoid multicollinearity

Patrick_Schn01
This post has NOT been accepted by the mailing list yet.
This post was updated on .
In reply to this post by Kristi Glover
I am a little confused about your data. What could you possibly intend to model with a vector of only 1s? While you say you have no absence/false data.. do you have any empty cells? any that are not 1?  The assumption with species sampling is that 0s do not necessarily indicate an absence of a particular taxa, but instead a lack of detection. That being said, lack of detection/presence would still be coded as a zero.

Beyond that, an alternative to Bert's suggestion of logistic regression would be partial least-squares (PLS) using the 0/1 vector as the response. PLS is similar to PCA with a different objective function; in this case you would be finding linear combinations of X which maximize the covariance between X and the 0/1 vector. You could either use this PLS model directly and determine important variables using the magnitude of weights, standardized regression coefficients and/or variable importance in projection (VIP) scores:

http://www.researchgate.net/profile/Lars_Snipen/publication/233748490_A_review_of_variable_selection_methods_in_Partial_Least_Squares_Regression/links/0912f50bca8a9cc896000000.pdf 

You could also use the orthogonal PLS components to model 0/1 in a subsequent analysis, using logistic regression or linear discriminant analysis. I would suppose you are interested in getting predictions of 0/1 and an estimation of misclassification error rates.

See these and other pubs:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.4538&rep=rep1&type=pdf
http://cedric.cnam.fr/fichiers/RC906.pdf

PLS-LDA can be performed in R with the plsgenomics package:

http://cran.r-project.org/web/packages/plsgenomics/plsgenomics.pdf

I believe PLS-glm can be performed with this package:

http://cran.r-project.org/web/packages/plsRglm/plsRglm.pdf


HTH.


Kristi Glover wrote
HI R user,
I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found  PCA for feature  selection are used.
but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.

would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables

dat<-structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
341L), var7 = c(29L, -9L, -18L, -22L, -20L), var8 = c(305L, 361L,
359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
    var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
    1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
    ), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
    22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
    var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
    213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
    257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
"var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
"var13", "var14", "var15", "var16", "var17", "var18", "var19",
"var20"), class = "data.frame", row.names = c(NA, -5L))

thanks

     
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.