I got two questions on factors in regression:
Q1. In a table, there a few categorical/factor variables, a few numerical variables and the response variable is numeric. Some factors are important but others not. How to determine which categorical variables are significant to the response variable? Q2. As we knew, lm can deal with categorical variables. I thought, when there is a categorical predictor, we may use lm directly without quantifying these factors and assigning different values to factors would not change the fittings as shown: x <- 1:20 ## numeric predictor yes.no <- c("yes","no") factors <- gl(2,10,20,yes.no) ##factor predictor factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of factors factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) ##second quantificatio of factors response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response lm.quant <- lm(response ~ x + factors.quant) ##lm with quantifications lm.fact <- lm(response ~ x + factors) ##lm with factors lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with quantifications lm.fact.1 <- lm(response ~ x + factors) ##lm with factors par(mfrow=c(2,2)) ## comparisons of two fittings plot(x, response) lines(x,fitted(lm.quant),col="blue") grid() plot(x,response) lines(x,fitted(lm.fact),col = "red") grid() plot(x, response) lines(x,fitted(lm.quant.1),lty =2,col="blue") grid() plot(x,response) lines(x,fitted(lm.fact.1),lty =2,col = "red") grid() par(mfrow = c(1,1)) So, is it right that we can assign any numeric values to factors, for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, before doing lm, glm, aov, even nls? Please drop a few lines and/or direct me some references. Thanks, -james ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: > I got two questions on factors in regression: > > Q1. > In a table, there a few categorical/factor variables, a few numerical > variables and the response variable is numeric. Some factors are > important > but others not. > How to determine which categorical variables are significant to the > response variable? Seems that you should engage the services of a consulting statistician for that sort of question. Or post in a venue where statistical consulting is supposed to occur, such as one of the sci.stat.* newsgroups. > > Q2. > As we knew, lm can deal with categorical variables. > I thought, when there is a categorical predictor, we may use lm > directly > without quantifying these factors and assigning different values to > factors > would not change the fittings as shown: The "numbers" that you are attempting to assign are really just labels for the factor levels. The regression functions in R will not use them for any calculations. They should not be thought of as having "values". Even if the factor is an ordered factor, the labels may not be interpretable as having the same numerical order as the string values might suggest. > > x <- 1:20 ## numeric predictor > yes.no <- c("yes","no") > factors <- gl(2,10,20,yes.no) ##factor predictor > factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of > factors Not sure what that is supposed to mean. It is not a factor object even though you may be misleading yourself in to believing it should be. It's a numeric vector. > str(factors.quant) num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... > factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) > ##second quantificatio of factors > response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response > lm.quant <- lm(response ~ x + factors.quant) ##lm with quantifications > lm.fact <- lm(response ~ x + factors) ##lm with factors > lm.quant Call: lm(formula = response ~ x + factors.quant) Coefficients: (Intercept) x factors.quant 14.9098 0.5385 1.2350 > lm.fact Call: lm(formula = response ~ x + factors) Coefficients: (Intercept) x factorsno 38.1286 0.5385 13.7090 > > lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with > quantifications > lm.quant.1 Call: lm(formula = response ~ x + factors.quant.1) Coefficients: (Intercept) x factors.quant.1 27.5976 0.5385 0.6231 > lm.fact.1 <- lm(response ~ x + factors) ##lm with factors > > par(mfrow=c(2,2)) ## comparisons of two fittings > plot(x, response) > lines(x,fitted(lm.quant),col="blue") > grid() > plot(x,response) > lines(x,fitted(lm.fact),col = "red") > grid() > plot(x, response) > lines(x,fitted(lm.quant.1),lty =2,col="blue") > grid() > plot(x,response) > lines(x,fitted(lm.fact.1),lty =2,col = "red") > grid() > par(mfrow = c(1,1)) > > So, is it right that we can assign any numeric values to factors, > for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, > before doing lm, glm, aov, even nls? You can give factor levels any name you like, including any sequence of digit characters. Unlike "ordinary R where unquoted numbers cannot start variable names, factor functions will coerce numeric vectors to character vectors when assigning level names. But you seem to be conflating factors with numeric vectors that have many ties. Those two entities would have different handling by R's regression functions. -- David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Thanks!
> > On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: > >> I got two questions on factors in regression: >> >> Q1. >> In a table, there a few categorical/factor variables, a few numerical >> variables and the response variable is numeric. Some factors are >> important >> but others not. >> How to determine which categorical variables are significant to the >> response variable? > > Seems that you should engage the services of a consulting statistician > for that sort of question. Or post in a venue where statistical > consulting is supposed to occur, such as one of the sci.stat.* > newsgroups. I googled sci.stat.* and got sci.stat.math and sci.stat.consult. Are they good? I have no idea to do this. So any clue will be appreciated. > >> >> Q2. >> As we knew, lm can deal with categorical variables. >> I thought, when there is a categorical predictor, we may use lm >> directly >> without quantifying these factors and assigning different values to >> factors >> would not change the fittings as shown: > > The "numbers" that you are attempting to assign are really just labels > for the factor levels. The regression functions in R will not use them > for any calculations. They should not be thought of as having > "values". Even if the factor is an ordered factor, the labels may not > be interpretable as having the same numerical order as the string > values might suggest. > >> >> x <- 1:20 ## numeric predictor >> yes.no <- c("yes","no") >> factors <- gl(2,10,20,yes.no) ##factor predictor >> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >> factors > > Not sure what that is supposed to mean. It is not a factor object even > though you may be misleading yourself in to believing it should be. > It's a numeric vector. Yes, levels are not numeric but just labels. But after the levels factors being assigned to numeric values as factors.quant and factors.quant.1, lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1) produced the same fitted curve as lm(response ~ x + factors). This is what I could not understand. > > str(factors.quant) > num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... > >> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >> ##second quantificatio of factors >> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >> lm.quant <- lm(response ~ x + factors.quant) ##lm with quantifications >> lm.fact <- lm(response ~ x + factors) ##lm with factors > > > lm.quant > > Call: > lm(formula = response ~ x + factors.quant) > > Coefficients: > (Intercept) x factors.quant > 14.9098 0.5385 1.2350 > > > lm.fact > > Call: > lm(formula = response ~ x + factors) > > Coefficients: > (Intercept) x factorsno > 38.1286 0.5385 13.7090 >> >> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >> quantifications > > > lm.quant.1 > > Call: > lm(formula = response ~ x + factors.quant.1) > > Coefficients: > (Intercept) x factors.quant.1 > 27.5976 0.5385 0.6231 > >> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >> >> par(mfrow=c(2,2)) ## comparisons of two fittings >> plot(x, response) >> lines(x,fitted(lm.quant),col="blue") >> grid() >> plot(x,response) >> lines(x,fitted(lm.fact),col = "red") >> grid() >> plot(x, response) >> lines(x,fitted(lm.quant.1),lty =2,col="blue") >> grid() >> plot(x,response) >> lines(x,fitted(lm.fact.1),lty =2,col = "red") >> grid() >> par(mfrow = c(1,1)) >> >> So, is it right that we can assign any numeric values to factors, >> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, >> before doing lm, glm, aov, even nls? > > You can give factor levels any name you like, including any sequence > of digit characters. Unlike "ordinary R where unquoted numbers cannot > start variable names, factor functions will coerce numeric vectors to > character vectors when assigning level names. But you seem to be > conflating factors with numeric vectors that have many ties. Those two > entities would have different handling by R's regression functions. > > -- > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Aug 20, 2009, at 3:42 PM, [hidden email] wrote: > Thanks! >> >> On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: >> >>> I got two questions on factors in regression: >>> >>> Q1. >>> In a table, there a few categorical/factor variables, a few >>> numerical >>> variables and the response variable is numeric. Some factors are >>> important >>> but others not. >>> How to determine which categorical variables are significant to the >>> response variable? >> >> Seems that you should engage the services of a consulting >> statistician >> for that sort of question. Or post in a venue where statistical >> consulting is supposed to occur, such as one of the sci.stat.* >> newsgroups. > > I googled sci.stat.* and got sci.stat.math and sci.stat.consult. > Are they good? The quality of responses varies. You may get what you pay for. On the other hand sometimes you get high-quality advice for free. > I have no idea to do this. So any clue will be appreciated. http://groups.google.com/?hl=en > >> >>> >>> Q2. >>> As we knew, lm can deal with categorical variables. >>> I thought, when there is a categorical predictor, we may use lm >>> directly >>> without quantifying these factors and assigning different values to >>> factors >>> would not change the fittings as shown: >> >> The "numbers" that you are attempting to assign are really just >> labels >> for the factor levels. The regression functions in R will not use >> them >> for any calculations. They should not be thought of as having >> "values". Even if the factor is an ordered factor, the labels may not >> be interpretable as having the same numerical order as the string >> values might suggest. >> >>> >>> x <- 1:20 ## numeric predictor >>> yes.no <- c("yes","no") >>> factors <- gl(2,10,20,yes.no) ##factor predictor >>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >>> factors >> >> Not sure what that is supposed to mean. It is not a factor object >> even >> though you may be misleading yourself in to believing it should be. >> It's a numeric vector. > > Yes, levels are not numeric but just labels. But > after the levels factors being assigned to numeric values as > factors.quant > and factors.quant.1, > lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1) > produced the same fitted curve as lm(response ~ x + factors). This > is what > I could not understand. In for the factor variable case and the numeric variable case there was no variation in the predictor variable within a level. So the predictions will all be the same within levels in each case. There will be differences in the coefficients arrived at to achieve that result, however. > >>> str(factors.quant) >> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... >> >>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >>> ##second quantificatio of factors >>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >>> lm.quant <- lm(response ~ x + factors.quant) ##lm with >>> quantifications >>> lm.fact <- lm(response ~ x + factors) ##lm with factors >> >>> lm.quant >> >> Call: >> lm(formula = response ~ x + factors.quant) >> >> Coefficients: >> (Intercept) x factors.quant >> 14.9098 0.5385 1.2350 >> >>> lm.fact >> >> Call: >> lm(formula = response ~ x + factors) >> >> Coefficients: >> (Intercept) x factorsno >> 38.1286 0.5385 13.7090 >>> >>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >>> quantifications >> >>> lm.quant.1 >> >> Call: >> lm(formula = response ~ x + factors.quant.1) >> >> Coefficients: >> (Intercept) x factors.quant.1 >> 27.5976 0.5385 0.6231 >> >>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >>> >>> par(mfrow=c(2,2)) ## comparisons of two fittings >>> plot(x, response) >>> lines(x,fitted(lm.quant),col="blue") >>> grid() >>> plot(x,response) >>> lines(x,fitted(lm.fact),col = "red") >>> grid() >>> plot(x, response) >>> lines(x,fitted(lm.quant.1),lty =2,col="blue") >>> grid() >>> plot(x,response) >>> lines(x,fitted(lm.fact.1),lty =2,col = "red") >>> grid() >>> par(mfrow = c(1,1)) >>> >>> So, is it right that we can assign any numeric values to factors, >>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, >>> before doing lm, glm, aov, even nls? >> >> You can give factor levels any name you like, including any sequence >> of digit characters. Unlike "ordinary R where unquoted numbers cannot >> start variable names, factor functions will coerce numeric vectors to >> character vectors when assigning level names. But you seem to be >> conflating factors with numeric vectors that have many ties. Those >> two >> entities would have different handling by R's regression functions. >> -- David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
>
> On Aug 20, 2009, at 3:42 PM, [hidden email] wrote: > >> Thanks! >>> >>> On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: >>> >>>> I got two questions on factors in regression: >>>> >>>> Q1. >>>> In a table, there a few categorical/factor variables, a few >>>> numerical >>>> variables and the response variable is numeric. Some factors are >>>> important >>>> but others not. >>>> How to determine which categorical variables are significant to the >>>> response variable? >>> >>> Seems that you should engage the services of a consulting >>> statistician >>> for that sort of question. Or post in a venue where statistical >>> consulting is supposed to occur, such as one of the sci.stat.* >>> newsgroups. >> >> I googled sci.stat.* and got sci.stat.math and sci.stat.consult. >> Are they good? > > The quality of responses varies. You may get what you pay for. On the > other hand sometimes you get high-quality advice for free. > >> I have no idea to do this. So any clue will be appreciated. > > http://groups.google.com/?hl=en > >> >>> >>>> >>>> Q2. >>>> As we knew, lm can deal with categorical variables. >>>> I thought, when there is a categorical predictor, we may use lm >>>> directly >>>> without quantifying these factors and assigning different values to >>>> factors >>>> would not change the fittings as shown: >>> >>> The "numbers" that you are attempting to assign are really just >>> labels >>> for the factor levels. The regression functions in R will not use >>> them >>> for any calculations. They should not be thought of as having >>> "values". Even if the factor is an ordered factor, the labels may not >>> be interpretable as having the same numerical order as the string >>> values might suggest. >>> >>>> >>>> x <- 1:20 ## numeric predictor >>>> yes.no <- c("yes","no") >>>> factors <- gl(2,10,20,yes.no) ##factor predictor >>>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >>>> factors >>> >>> Not sure what that is supposed to mean. It is not a factor object >>> even >>> though you may be misleading yourself in to believing it should be. >>> It's a numeric vector. >> >> Yes, levels are not numeric but just labels. But >> after the levels factors being assigned to numeric values as >> factors.quant >> and factors.quant.1, >> lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1) >> produced the same fitted curve as lm(response ~ x + factors). This >> is what >> I could not understand. > > In for the factor variable case and the numeric variable case there > was no variation in the predictor variable within a level. So the > predictions will all be the same within levels in each case. There > will be differences in the coefficients arrived at to achieve that > result, however. I even tried > cor(response, factors) [1] 0.968241 > cor(response, factors.quant) [1] 0.968241 > cor(response, factors.quant.1) [1] 0.968241 If assigning values to factors does not change curve-fitting, one may use factors.quant to do regression analysis if he wants to find the curve patterns. The coefficients are different since they use different predictors. If they are the same, then the curves fitted are different. Can I rank factors.1 and factors.2 using cor(response factors.1) and cor(response factors.1)? Thanks, > >> >>>> str(factors.quant) >>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... >>> >>>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >>>> ##second quantificatio of factors >>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with >>>> quantifications >>>> lm.fact <- lm(response ~ x + factors) ##lm with factors >>> >>>> lm.quant >>> >>> Call: >>> lm(formula = response ~ x + factors.quant) >>> >>> Coefficients: >>> (Intercept) x factors.quant >>> 14.9098 0.5385 1.2350 >>> >>>> lm.fact >>> >>> Call: >>> lm(formula = response ~ x + factors) >>> >>> Coefficients: >>> (Intercept) x factorsno >>> 38.1286 0.5385 13.7090 >>>> >>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >>>> quantifications >>> >>>> lm.quant.1 >>> >>> Call: >>> lm(formula = response ~ x + factors.quant.1) >>> >>> Coefficients: >>> (Intercept) x factors.quant.1 >>> 27.5976 0.5385 0.6231 >>> >>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >>>> >>>> par(mfrow=c(2,2)) ## comparisons of two fittings >>>> plot(x, response) >>>> lines(x,fitted(lm.quant),col="blue") >>>> grid() >>>> plot(x,response) >>>> lines(x,fitted(lm.fact),col = "red") >>>> grid() >>>> plot(x, response) >>>> lines(x,fitted(lm.quant.1),lty =2,col="blue") >>>> grid() >>>> plot(x,response) >>>> lines(x,fitted(lm.fact.1),lty =2,col = "red") >>>> grid() >>>> par(mfrow = c(1,1)) >>>> >>>> So, is it right that we can assign any numeric values to factors, >>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, >>>> before doing lm, glm, aov, even nls? >>> >>> You can give factor levels any name you like, including any sequence >>> of digit characters. Unlike "ordinary R where unquoted numbers cannot >>> start variable names, factor functions will coerce numeric vectors to >>> character vectors when assigning level names. But you seem to be >>> conflating factors with numeric vectors that have many ties. Those >>> two >>> entities would have different handling by R's regression functions. > > >>> -- > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Aug 20, 2009, at 4:07 PM, [hidden email] wrote: >> >> On Aug 20, 2009, at 3:42 PM, [hidden email] wrote: >> >>> Thanks! >>>> >>>> On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: >>>> >>>>> I got two questions on factors in regression: >>>>> >>>>> Q1. >>>>> In a table, there a few categorical/factor variables, a few >>>>> numerical >>>>> variables and the response variable is numeric. Some factors are >>>>> important >>>>> but others not. >>>>> How to determine which categorical variables are significant to >>>>> the >>>>> response variable? >>>> >>>> Seems that you should engage the services of a consulting >>>> statistician >>>> for that sort of question. Or post in a venue where statistical >>>> consulting is supposed to occur, such as one of the sci.stat.* >>>> newsgroups. >>> >>> I googled sci.stat.* and got sci.stat.math and sci.stat.consult. >>> Are they good? >> >> The quality of responses varies. You may get what you pay for. On the >> other hand sometimes you get high-quality advice for free. >> >>> I have no idea to do this. So any clue will be appreciated. >> >> http://groups.google.com/?hl=en >> >>> >>>> >>>>> >>>>> Q2. >>>>> As we knew, lm can deal with categorical variables. >>>>> I thought, when there is a categorical predictor, we may use lm >>>>> directly >>>>> without quantifying these factors and assigning different values >>>>> to >>>>> factors >>>>> would not change the fittings as shown: >>>> >>>> The "numbers" that you are attempting to assign are really just >>>> labels >>>> for the factor levels. The regression functions in R will not use >>>> them >>>> for any calculations. They should not be thought of as having >>>> "values". Even if the factor is an ordered factor, the labels may >>>> not >>>> be interpretable as having the same numerical order as the string >>>> values might suggest. >>>> >>>>> >>>>> x <- 1:20 ## numeric predictor >>>>> yes.no <- c("yes","no") >>>>> factors <- gl(2,10,20,yes.no) ##factor predictor >>>>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >>>>> factors >>>> >>>> Not sure what that is supposed to mean. It is not a factor object >>>> even >>>> though you may be misleading yourself in to believing it should be. >>>> It's a numeric vector. >>> >>> Yes, levels are not numeric but just labels. But >>> after the levels factors being assigned to numeric values as >>> factors.quant >>> and factors.quant.1, >>> lm(response ~ x + factors.quant) and lm(response ~ x + >>> factors.quant1) >>> produced the same fitted curve as lm(response ~ x + factors). This >>> is what >>> I could not understand. >> >> In for the factor variable case and the numeric variable case there >> was no variation in the predictor variable within a level. So the >> predictions will all be the same within levels in each case. There >> will be differences in the coefficients arrived at to achieve that >> result, however. > > I even tried > >> cor(response, factors) > [1] 0.968241 >> cor(response, factors.quant) > [1] 0.968241 >> cor(response, factors.quant.1) > [1] 0.968241 > > If assigning values to factors does not change curve-fitting, > one may use factors.quant to do regression analysis if he wants to > find the curve patterns. > The coefficients are different since they use different predictors. > If they are the same, then the curves fitted are different. Try setting up with 3 factor levels and three discrete values for the numeric predictor. the cor() function will continue to give meaningful results for the numeric variable but not for the factor variable. The interpretation of the coefficients from a model with three level factors may require further study on your part. > > Can I rank factors.1 and factors.2 using > cor(response factors.1) and cor(response factors.1)? > Thanks, >> >>> >>>>> str(factors.quant) >>>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... >>>> >>>>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >>>>> ##second quantificatio of factors >>>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >>>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with >>>>> quantifications >>>>> lm.fact <- lm(response ~ x + factors) ##lm with factors >>>> >>>>> lm.quant >>>> >>>> Call: >>>> lm(formula = response ~ x + factors.quant) >>>> >>>> Coefficients: >>>> (Intercept) x factors.quant >>>> 14.9098 0.5385 1.2350 >>>> >>>>> lm.fact >>>> >>>> Call: >>>> lm(formula = response ~ x + factors) >>>> >>>> Coefficients: >>>> (Intercept) x factorsno >>>> 38.1286 0.5385 13.7090 >>>>> >>>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >>>>> quantifications >>>> >>>>> lm.quant.1 >>>> >>>> Call: >>>> lm(formula = response ~ x + factors.quant.1) >>>> >>>> Coefficients: >>>> (Intercept) x factors.quant.1 >>>> 27.5976 0.5385 0.6231 >>>> >>>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >>>>> >>>>> par(mfrow=c(2,2)) ## comparisons of two fittings >>>>> plot(x, response) >>>>> lines(x,fitted(lm.quant),col="blue") >>>>> grid() >>>>> plot(x,response) >>>>> lines(x,fitted(lm.fact),col = "red") >>>>> grid() >>>>> plot(x, response) >>>>> lines(x,fitted(lm.quant.1),lty =2,col="blue") >>>>> grid() >>>>> plot(x,response) >>>>> lines(x,fitted(lm.fact.1),lty =2,col = "red") >>>>> grid() >>>>> par(mfrow = c(1,1)) >>>>> >>>>> So, is it right that we can assign any numeric values to factors, >>>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the >>>>> above, >>>>> before doing lm, glm, aov, even nls? >>>> >>>> You can give factor levels any name you like, including any >>>> sequence >>>> of digit characters. Unlike "ordinary R where unquoted numbers >>>> cannot >>>> start variable names, factor functions will coerce numeric >>>> vectors to >>>> character vectors when assigning level names. But you seem to be >>>> conflating factors with numeric vectors that have many ties. Those >>>> two >>>> entities would have different handling by R's regression functions. David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
>
Yes, when the number of the levels is greater than or equal 3.
> On Aug 20, 2009, at 4:07 PM, [hidden email] wrote: > >>> >>> On Aug 20, 2009, at 3:42 PM, [hidden email] wrote: >>> >>>> Thanks! >>>>> >>>>> On Aug 20, 2009, at 1:46 PM, [hidden email] wrote: >>>>> >>>>>> I got two questions on factors in regression: >>>>>> >>>>>> Q1. >>>>>> In a table, there a few categorical/factor variables, a few >>>>>> numerical >>>>>> variables and the response variable is numeric. Some factors are >>>>>> important >>>>>> but others not. >>>>>> How to determine which categorical variables are significant to >>>>>> the >>>>>> response variable? >>>>> >>>>> Seems that you should engage the services of a consulting >>>>> statistician >>>>> for that sort of question. Or post in a venue where statistical >>>>> consulting is supposed to occur, such as one of the sci.stat.* >>>>> newsgroups. >>>> >>>> I googled sci.stat.* and got sci.stat.math and sci.stat.consult. >>>> Are they good? >>> >>> The quality of responses varies. You may get what you pay for. On the >>> other hand sometimes you get high-quality advice for free. >>> >>>> I have no idea to do this. So any clue will be appreciated. >>> >>> http://groups.google.com/?hl=en >>> >>>> >>>>> >>>>>> >>>>>> Q2. >>>>>> As we knew, lm can deal with categorical variables. >>>>>> I thought, when there is a categorical predictor, we may use lm >>>>>> directly >>>>>> without quantifying these factors and assigning different values >>>>>> to >>>>>> factors >>>>>> would not change the fittings as shown: >>>>> >>>>> The "numbers" that you are attempting to assign are really just >>>>> labels >>>>> for the factor levels. The regression functions in R will not use >>>>> them >>>>> for any calculations. They should not be thought of as having >>>>> "values". Even if the factor is an ordered factor, the labels may >>>>> not >>>>> be interpretable as having the same numerical order as the string >>>>> values might suggest. >>>>> >>>>>> >>>>>> x <- 1:20 ## numeric predictor >>>>>> yes.no <- c("yes","no") >>>>>> factors <- gl(2,10,20,yes.no) ##factor predictor >>>>>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >>>>>> factors >>>>> >>>>> Not sure what that is supposed to mean. It is not a factor object >>>>> even >>>>> though you may be misleading yourself in to believing it should be. >>>>> It's a numeric vector. >>>> >>>> Yes, levels are not numeric but just labels. But >>>> after the levels factors being assigned to numeric values as >>>> factors.quant >>>> and factors.quant.1, >>>> lm(response ~ x + factors.quant) and lm(response ~ x + >>>> factors.quant1) >>>> produced the same fitted curve as lm(response ~ x + factors). This >>>> is what >>>> I could not understand. >>> >>> In for the factor variable case and the numeric variable case there >>> was no variation in the predictor variable within a level. So the >>> predictions will all be the same within levels in each case. There >>> will be differences in the coefficients arrived at to achieve that >>> result, however. >> >> I even tried >> >>> cor(response, factors) >> [1] 0.968241 >>> cor(response, factors.quant) >> [1] 0.968241 >>> cor(response, factors.quant.1) >> [1] 0.968241 >> >> If assigning values to factors does not change curve-fitting, >> one may use factors.quant to do regression analysis if he wants to >> find the curve patterns. >> The coefficients are different since they use different predictors. >> If they are the same, then the curves fitted are different. > > Try setting up with 3 factor levels and three discrete values for the > numeric predictor. the cor() function will continue to give meaningful > results for the numeric variable but not for the factor variable. The > interpretation of the coefficients from a model with three level > factors may require further study on your part. > That is not true. Thanks, > >> >> Can I rank factors.1 and factors.2 using >> cor(response factors.1) and cor(response factors.1)? >> Thanks, >>> >>>> >>>>>> str(factors.quant) >>>>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... >>>>> >>>>>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >>>>>> ##second quantificatio of factors >>>>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >>>>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with >>>>>> quantifications >>>>>> lm.fact <- lm(response ~ x + factors) ##lm with factors >>>>> >>>>>> lm.quant >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors.quant) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factors.quant >>>>> 14.9098 0.5385 1.2350 >>>>> >>>>>> lm.fact >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factorsno >>>>> 38.1286 0.5385 13.7090 >>>>>> >>>>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >>>>>> quantifications >>>>> >>>>>> lm.quant.1 >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors.quant.1) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factors.quant.1 >>>>> 27.5976 0.5385 0.6231 >>>>> >>>>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >>>>>> >>>>>> par(mfrow=c(2,2)) ## comparisons of two fittings >>>>>> plot(x, response) >>>>>> lines(x,fitted(lm.quant),col="blue") >>>>>> grid() >>>>>> plot(x,response) >>>>>> lines(x,fitted(lm.fact),col = "red") >>>>>> grid() >>>>>> plot(x, response) >>>>>> lines(x,fitted(lm.quant.1),lty =2,col="blue") >>>>>> grid() >>>>>> plot(x,response) >>>>>> lines(x,fitted(lm.fact.1),lty =2,col = "red") >>>>>> grid() >>>>>> par(mfrow = c(1,1)) >>>>>> >>>>>> So, is it right that we can assign any numeric values to factors, >>>>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the >>>>>> above, >>>>>> before doing lm, glm, aov, even nls? >>>>> >>>>> You can give factor levels any name you like, including any >>>>> sequence >>>>> of digit characters. Unlike "ordinary R where unquoted numbers >>>>> cannot >>>>> start variable names, factor functions will coerce numeric >>>>> vectors to >>>>> character vectors when assigning level names. But you seem to be >>>>> conflating factors with numeric vectors that have many ties. Those >>>>> two >>>>> entities would have different handling by R's regression functions. > -- > > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |