Selecting A List of Columns

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Selecting A List of Columns

Sparks, John James
Dear R Helpers,

I need help with a slightly unusual situation in which I am trying to
select some columns from a data frame.  I know how to use the subset
statement with column names as in:


x=as.data.frame(matrix(c(1,2,3,
        1,2,3,
        1,2,2,
        1,2,2,
        1,1,1),ncol=3,byrow=T))

all.cols<-colnames(x)
to.keep<-all.cols[1:2]

Kept<-subset(x,select=to.keep)
Kept

However, if I want to select some columns based on a selection of the most
important variables from a random forest then I find myself stuck.  The
example below demonstrates the problem.


library(randomForest)

data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE)
Importance<-data.frame(mtcars.rf$importance)
Importance



MSEImportance<-head(Importance[order(Importance$X.IncMSE,
decreasing=TRUE),],3)
MSEVars<-row.names(MSEImportance)
MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE)
colnames(MSEVars)<-"Vars"

NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),],
3)
NodeVars<-row.names(NodeImportance)
NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE)
colnames(NodeVars)<-"Vars"


ImportantVars<-rbind(MSEVars,NodeVars)
ImportantVars<-unique(ImportantVars)
nrow(ImportantVars)
ImportantVars<-as.character(ImportantVars)
ImportantVars
CarsVarsKept<-subset(mtcars,select=ImportantVars)
Error in `[.data.frame`(x, r, vars, drop = drop) :
  undefined columns selected

Any help on how to select these columns from the data frame would be most
appreciated.

--John J. Sparks, Ph.D.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Selecting A List of Columns

Pascal Oettli-2
Hello,

It works for me if I replace

> ImportantVars <- as.character(ImportantVars)

by

> ImportantVars <- ImportantVars$Vars

Hope this helps,
Pascal




2013/5/17 Sparks, John James <[hidden email]>

> Dear R Helpers,
>
> I need help with a slightly unusual situation in which I am trying to
> select some columns from a data frame.  I know how to use the subset
> statement with column names as in:
>
>
> x=as.data.frame(matrix(c(1,2,3,
>         1,2,3,
>         1,2,2,
>         1,2,2,
>         1,1,1),ncol=3,byrow=T))
>
> all.cols<-colnames(x)
> to.keep<-all.cols[1:2]
>
> Kept<-subset(x,select=to.keep)
> Kept
>
> However, if I want to select some columns based on a selection of the most
> important variables from a random forest then I find myself stuck.  The
> example below demonstrates the problem.
>
>
> library(randomForest)
>
> data(mtcars)
> mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE)
> Importance<-data.frame(mtcars.rf$importance)
> Importance
>
>
>
> MSEImportance<-head(Importance[order(Importance$X.IncMSE,
> decreasing=TRUE),],3)
> MSEVars<-row.names(MSEImportance)
> MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE)
> colnames(MSEVars)<-"Vars"
>
>
> NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),],
> 3)
> NodeVars<-row.names(NodeImportance)
> NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE)
> colnames(NodeVars)<-"Vars"
>
>
> ImportantVars<-rbind(MSEVars,NodeVars)
> ImportantVars<-unique(ImportantVars)
> nrow(ImportantVars)
> ImportantVars<-as.character(ImportantVars)
> ImportantVars
> CarsVarsKept<-subset(mtcars,select=ImportantVars)
> Error in `[.data.frame`(x, r, vars, drop = drop) :
>   undefined columns selected
>
> Any help on how to select these columns from the data frame would be most
> appreciated.
>
> --John J. Sparks, Ph.D.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Selecting A List of Columns

Peter Dalgaard-2
In reply to this post by Sparks, John James

On May 17, 2013, at 08:51 , Sparks, John James wrote:

> Dear R Helpers,
>
> I need help with a slightly unusual situation in which I am trying to
> select some columns from a data frame.  I know how to use the subset
> statement with column names as in:

Notice that subset() is a convenience function for command line use. The non-standard evaluation tricks in it tend to become inconveniences if you try to use subset() in a function (I can say that, I wrote the blasted thing...). Just use normal subseting functions instead and everything behaves much more predictably. If ImportantVars is a vector of column names, use

mtcars[ImportantVars]

(or mtcars[,ImportantVars], which also works for matrices).


>
>
> x=as.data.frame(matrix(c(1,2,3,
>        1,2,3,
>        1,2,2,
>        1,2,2,
>        1,1,1),ncol=3,byrow=T))
>
> all.cols<-colnames(x)
> to.keep<-all.cols[1:2]
>
> Kept<-subset(x,select=to.keep)
> Kept
>
> However, if I want to select some columns based on a selection of the most
> important variables from a random forest then I find myself stuck.  The
> example below demonstrates the problem.
>
>
> library(randomForest)
>
> data(mtcars)
> mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE)
> Importance<-data.frame(mtcars.rf$importance)
> Importance
>
>
>
> MSEImportance<-head(Importance[order(Importance$X.IncMSE,
> decreasing=TRUE),],3)
> MSEVars<-row.names(MSEImportance)
> MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE)
> colnames(MSEVars)<-"Vars"
>
> NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),],
> 3)
> NodeVars<-row.names(NodeImportance)
> NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE)
> colnames(NodeVars)<-"Vars"
>
>
> ImportantVars<-rbind(MSEVars,NodeVars)
> ImportantVars<-unique(ImportantVars)
> nrow(ImportantVars)
> ImportantVars<-as.character(ImportantVars)
> ImportantVars
> CarsVarsKept<-subset(mtcars,select=ImportantVars)
> Error in `[.data.frame`(x, r, vars, drop = drop) :
>  undefined columns selected
>
> Any help on how to select these columns from the data frame would be most
> appreciated.
>
> --John J. Sparks, Ph.D.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Selecting A List of Columns

Peter Dalgaard-2

On May 17, 2013, at 12:02 , peter dalgaard wrote:

>
> On May 17, 2013, at 08:51 , Sparks, John James wrote:
>
>> Dear R Helpers,
>>
>> I need help with a slightly unusual situation in which I am trying to
>> select some columns from a data frame.  I know how to use the subset
>> statement with column names as in:
>
> Notice that subset() is a convenience function for command line use. The non-standard evaluation tricks in it tend to become inconveniences if you try to use subset() in a function (I can say that, I wrote the blasted thing...). Just use normal subseting functions instead and everything behaves much more predictably. If ImportantVars is a vector of column names, use
>
> mtcars[ImportantVars]
>
> (or mtcars[,ImportantVars], which also works for matrices).
>


Oups, Pascal has the right end of the stick. The above is correct, but the "If" is important: You need a character vector of names, and that's not what is in ImportantVars.

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.