On 09.02.2012 22:39, Yang Zhang wrote:
> I always bump into a few (very minor) problems when building model
> matrices with e.g.:
> train = model.matrix(label~., read.csv('train.csv'))
> target = model.matrix(label~., read.csv('target.csv'))
> (1) The two may have different factor levels, yielding different
> matrices. I usually first rbind the data frames together to "meld"
> the factors, and then split them apart and matrixify them.
You can preprocess the data and explicitly define the levels for factor
variables in your data.frames.
> (2) The target set that I'm predicting on typically doesn't have
> labels. I usually manually append dummy labels to the target data
R cannot know labels if you do not provide any.
> (3) I almost always remove the Intercept from the model matrices,
> since it seems to always be redundant (I usually use caret).
Then change your model formula to: "label ~ . - 1". But note the
interpretation changes and it is *not* redundant in general.
> None of these is a big deal at all, but I'm just curious if I'm
> missing something simple in how I'm doing things. Thanks.