

HI R user,
I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found PCA for feature selection are used.
but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.
would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables
dat<structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
341L), var7 = c(29L, 9L, 18L, 22L, 20L), var8 = c(305L, 361L,
359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
"var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
"var13", "var14", "var15", "var16", "var17", "var18", "var19",
"var20"), class = "data.frame", row.names = c(NA, 5L))
thanks
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


OFFTOPIC! This is a statistical question, not an R question. Post on a
statistics site like stats.stackexchange.com .
However, your post suggests that you are completely out of your depth
here (0/1 responses suggest that glm modeling via logistic regression
is called for). Remote internet advice is unlikely to fill the gap
between what you seem to need and what you seem to know. I strongly
suggest you find a local statistical expert to help if you wish to
avoid producing nonsense.
(Once you have figured out what you need to do, questions about how to
use R tools to do it are of course appropriate).
Cheers,
Bert
Bert Gunter
Genentech Nonclinical Biostatistics
(650) 4677374
"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll
On Sun, May 17, 2015 at 1:06 PM, Kristi Glover
< [hidden email]> wrote:
> HI R user,
> I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found PCA for feature selection are used.
> but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.
>
> would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables
>
> dat<structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
> 134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
> 129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
> 8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
> 341L), var7 = c(29L, 9L, 18L, 22L, 20L), var8 = c(305L, 361L,
> 359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
> 67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
> var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
> 1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
> ), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
> 22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
> var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
> 213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
> 257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
> "var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
> "var13", "var14", "var15", "var16", "var17", "var18", "var19",
> "var20"), class = "data.frame", row.names = c(NA, 5L))
>
> thanks
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


This post has NOT been accepted by the mailing list yet.
This post was updated on .
I am a little confused about your data. What could you possibly intend to model with a vector of only 1s? While you say you have no absence/false data.. do you have any empty cells? any that are not 1? The assumption with species sampling is that 0s do not necessarily indicate an absence of a particular taxa, but instead a lack of detection. That being said, lack of detection/presence would still be coded as a zero.
Beyond that, an alternative to Bert's suggestion of logistic regression would be partial leastsquares (PLS) using the 0/1 vector as the response. PLS is similar to PCA with a different objective function; in this case you would be finding linear combinations of X which maximize the covariance between X and the 0/1 vector. You could either use this PLS model directly and determine important variables using the magnitude of weights, standardized regression coefficients and/or variable importance in projection (VIP) scores:
http://www.researchgate.net/profile/Lars_Snipen/publication/233748490_A_review_of_variable_selection_methods_in_Partial_Least_Squares_Regression/links/0912f50bca8a9cc896000000.pdf
You could also use the orthogonal PLS components to model 0/1 in a subsequent analysis, using logistic regression or linear discriminant analysis. I would suppose you are interested in getting predictions of 0/1 and an estimation of misclassification error rates.
See these and other pubs:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.4538&rep=rep1&type=pdfhttp://cedric.cnam.fr/fichiers/RC906.pdfPLSLDA can be performed in R with the plsgenomics package:
http://cran.rproject.org/web/packages/plsgenomics/plsgenomics.pdfI believe PLSglm can be performed with this package:
http://cran.rproject.org/web/packages/plsRglm/plsRglm.pdfHTH.
Kristi Glover wrote
HI R user,
I was trying to reduce my independent variables before I run models. I have a dependent variable as a present or TRUE only (no Absence or False) whereas I have more than 20 independent variables but they are highly correlated. I was trying to reduce the independent variables . I found PCA for feature selection are used.
but for the PCA feature selection, I realized that it used dependent variable (as a linear model) with independent variables to select the variables based on variation explained. But, for me , the dependent data are only "1". Therefore, I could not run it.
would you give me some suggestions on how I reduce the variables into a certain numbers ? I have attached a sample data. In this data set, the dependent variable is "sp" and other 20 variables are the independent variables
dat<structure(list(sp = c(1L, 1L, 1L, 1L, 1L), var1 = c(32L, 222L,
134L, 114L, 121L), var2 = c(188L, 175L, 167L, 166L, 167L), var3 = c(123L,
129L, 136L, 138L, 137L), var4 = c(40L, 35L, 37L, 38L, 37L), var5 = c(6756L,
8080L, 7856L, 7899L, 7891L), var6 = c(334L, 352L, 341L, 340L,
341L), var7 = c(29L, 9L, 18L, 22L, 20L), var8 = c(305L, 361L,
359L, 362L, 361L), var9 = c(108L, 217L, 167L, 166L, 166L), var10 = c(237L,
67L, 61L, 59L, 60L), var11 = c(270L, 276L, 265L, 264L, 264L),
var12 = c(97L, 67L, 61L, 59L, 60L), var13 = c(1491L, 916L,
1245L, 1282L, 1250L), var14 = c(168L, 127L, 154L, 155L, 154L
), var15 = c(99L, 43L, 67L, 70L, 68L), var16 = c(15L, 32L,
22L, 21L, 21L), var17 = c(432L, 313L, 390L, 400L, 392L),
var18 = c(308L, 148L, 254L, 269L, 257L), var19 = c(332L,
213L, 269L, 277L, 271L), var20 = c(430L, 148L, 254L, 269L,
257L)), .Names = c("sp", "var1", "var2", "var3", "var4",
"var5", "var6", "var7", "var8", "var9", "var10", "var11", "var12",
"var13", "var14", "var15", "var16", "var17", "var18", "var19",
"var20"), class = "data.frame", row.names = c(NA, 5L))
thanks
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

