sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present

Dario Strbenac-2
Good day,

Sometimes, sparse.model.matrix outputs a dgCMatrix which has column names consisting of factor levels that were not in the original dataset. The first factor appears to be correctly transformed, but the following factors don't. For example:

diamonds <- as.data.frame(ggplot2::diamonds)
> colnames(sparse.model.matrix(~ . -1, diamonds))
 [1] "carat"        "cutFair"      "cutGood"      "cutVery Good" "cutPremium"   "cutIdeal"     "color.L"      "color.Q"      "color.C"      "color^4"      "color^5"    
[12] "color^6"      "clarity.L"    "clarity.Q"    "clarity.C"    "clarity^4"    "clarity^5"    "clarity^6"    "clarity^7"    "depth"        "table"        "price"      
[23] "x"            "y"            "z"

The variables color and clarity don't have factor levels which have been suffixed to them in the transformed matrix. The values in those columns are also wrong. Changing the Ord.factor columns into simply being factors fixes the problem.

> diamonds[, "cut"] <- factor(as.character(diamonds[, "cut"]))
> diamonds[, "color"] <- factor(as.character(diamonds[, "color"]))
> diamonds[, "clarity"] <- factor(as.character(diamonds[, "clarity"]))

> colnames(sparse.model.matrix(~ . -1, diamonds)) # No more invented factor levels.
 [1] "carat"        "cutFair"      "cutGood"      "cutIdeal"     "cutPremium"   "cutVery Good" "colorE"       "colorF"       "colorG"       "colorH"      
[11] "colorI"       "colorJ"       "clarityIF"    "claritySI1"   "claritySI2"   "clarityVS1"   "clarityVS2"   "clarityVVS1"  "clarityVVS2"  "depth"      
[21] "table"        "price"        "x"            "y"            "z"

Can it be made to work correctly for both plain and ordered factors?

> sessionInfo()
R Under development (unstable) (2018-02-06 r74231)
Platform: i386-w64-mingw32/i386 (32-bit)

other attached packages:
[1] Matrix_1.2-12

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.5.0     compiler_3.5.0   lazyeval_0.2.1  
 [5] plyr_1.8.4       pillar_1.1.0     gtable_0.2.0     tibble_1.4.2    
 [9] Rcpp_0.12.15     ggplot2_2.2.1    grid_3.5.0       rlang_0.1.6    
[13] munsell_0.4.3    lattice_0.20-35

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present

bbolker

  color and clarity are ordered factors, so sparse.model.matrix is
generating orthogonal-polynomial contrasts  (see ?contr.poly).  This is
by design ...  what are you trying to do?  Are you interested in fac2sparse?

On 18-02-07 11:00 PM, Dario Strbenac wrote:

> Good day,
>
> Sometimes, sparse.model.matrix outputs a dgCMatrix which has column names consisting of factor levels that were not in the original dataset. The first factor appears to be correctly transformed, but the following factors don't. For example:
>
> diamonds <- as.data.frame(ggplot2::diamonds)
>> colnames(sparse.model.matrix(~ . -1, diamonds))
>  [1] "carat"        "cutFair"      "cutGood"      "cutVery Good" "cutPremium"   "cutIdeal"     "color.L"      "color.Q"      "color.C"      "color^4"      "color^5"    
> [12] "color^6"      "clarity.L"    "clarity.Q"    "clarity.C"    "clarity^4"    "clarity^5"    "clarity^6"    "clarity^7"    "depth"        "table"        "price"      
> [23] "x"            "y"            "z"
>
> The variables color and clarity don't have factor levels which have been suffixed to them in the transformed matrix. The values in those columns are also wrong. Changing the Ord.factor columns into simply being factors fixes the problem.
>
>> diamonds[, "cut"] <- factor(as.character(diamonds[, "cut"]))
>> diamonds[, "color"] <- factor(as.character(diamonds[, "color"]))
>> diamonds[, "clarity"] <- factor(as.character(diamonds[, "clarity"]))
>
>> colnames(sparse.model.matrix(~ . -1, diamonds)) # No more invented factor levels.
>  [1] "carat"        "cutFair"      "cutGood"      "cutIdeal"     "cutPremium"   "cutVery Good" "colorE"       "colorF"       "colorG"       "colorH"      
> [11] "colorI"       "colorJ"       "clarityIF"    "claritySI1"   "claritySI2"   "clarityVS1"   "clarityVS2"   "clarityVVS1"  "clarityVVS2"  "depth"      
> [21] "table"        "price"        "x"            "y"            "z"
>
> Can it be made to work correctly for both plain and ordered factors?
>
>> sessionInfo()
> R Under development (unstable) (2018-02-06 r74231)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> other attached packages:
> [1] Matrix_1.2-12
>
> loaded via a namespace (and not attached):
>  [1] colorspace_1.3-2 scales_0.5.0     compiler_3.5.0   lazyeval_0.2.1  
>  [5] plyr_1.8.4       pillar_1.1.0     gtable_0.2.0     tibble_1.4.2    
>  [9] Rcpp_0.12.15     ggplot2_2.2.1    grid_3.5.0       rlang_0.1.6    
> [13] munsell_0.4.3    lattice_0.20-35
>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present

Dario Strbenac-2
Good day,

The intention is to convert the dataset into a format suitable for the random forest classifier implemented by the CRAN package xgboost. The input data is required to be transformed into one-hot format using the sparse.discrim.matrix function, as specified by the package's vignette of URL https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html I did not know to read the help page for contr.poly after reading the sparse.discrim.matrix help page. Perhaps there could be a helpful mention added to it?

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel