Quantcast

Creating Dummy Variables in R

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Creating Dummy Variables in R

whitaker m. (mw1006)
Hi,
I am trying to create a set of dummy variables to use within a multiple linear regression and am unable to find the codes within the manuals.

For example i have:
Price     Weight     Clarity
                             IF      VVS1    VVS2
500        8             1         0          0
1000      5.2          0         0          1
864        3              0        1          0
340        2.6          0         0          1
90          0.5          1         0          0
450        2.3          0         1          0

Where price is dependent upon weight (single value in each observation) and clarity (split into three levels, IF, VVS1, VVS2).
I am having trouble telling the program that clarity is a set of 3 dummy variables and keep getting error messages, what is the correct way?

Any helps is greatly appreciated.
Matthew

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Odp: Creating Dummy Variables in R

PIKAL Petr
Hi

[hidden email] napsal dne 16.12.2009 15:58:56:

> Hi,
> I am trying to create a set of dummy variables to use within a multiple
linear

> regression and am unable to find the codes within the manuals.
>
> For example i have:
> Price     Weight     Clarity
>                              IF      VVS1    VVS2
> 500        8             1         0          0
> 1000      5.2          0         0          1
> 864        3              0        1          0
> 340        2.6          0         0          1
> 90          0.5          1         0          0
> 450        2.3          0         1          0
>
> Where price is dependent upon weight (single value in each observation)
and
> clarity (split into three levels, IF, VVS1, VVS2).
> I am having trouble telling the program that clarity is a set of 3 dummy

> variables and keep getting error messages, what is the correct way?

Well, try to bribe it. Or ask what please it to break its resistance.

Seriously. What is a structure of your data in R.

?str

what commands did you use for regression

I suppose

lm(Price~Weight+IF+VVS1+VVS2, data=your.data)

shall not complain if your.data is a data frame.

Regards
Petr



>
> Any helps is greatly appreciated.
> Matthew
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Creating Dummy Variables in R

Stephan Devriese
In reply to this post by whitaker m. (mw1006)
On 12/16/2009 03:58 PM, whitaker m. (mw1006) wrote:

> Hi,
> I am trying to create a set of dummy variables to use within a multiple linear regression and am unable to find the codes within the manuals.
>
> For example i have:
> Price     Weight     Clarity
>                              IF      VVS1    VVS2
> 500        8             1         0          0
> 1000      5.2          0         0          1
> 864        3              0        1          0
> 340        2.6          0         0          1
> 90          0.5          1         0          0
> 450        2.3          0         1          0
>
> Where price is dependent upon weight (single value in each observation) and clarity (split into three levels, IF, VVS1, VVS2).
> I am having trouble telling the program that clarity is a set of 3 dummy variables and keep getting error messages, what is the correct way?
>

Without an example of your code, it's a bit difficult. But it might be
easier to use one variable "clarity" with three possible values (IF,
VVS1, VVS2), defined as a factor.
lm(Price ~ Weight + Clarity) should then do the trick (unless you
explicitly want to use a different dummy coding than the default)

Stephan

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Creating Dummy Variables in R

Achim Zeileis
In reply to this post by whitaker m. (mw1006)
On Wed, 16 Dec 2009, whitaker m. (mw1006) wrote:

> Hi,
> I am trying to create a set of dummy variables to use within a multiple linear regression and am unable to find the codes within the manuals.
>
> For example i have:
> Price     Weight     Clarity
>                             IF      VVS1    VVS2
> 500        8             1         0          0
> 1000      5.2          0         0          1
> 864        3              0        1          0
> 340        2.6          0         0          1
> 90          0.5          1         0          0
> 450        2.3          0         1          0
>
> Where price is dependent upon weight (single value in each observation) and clarity (split into three levels, IF, VVS1, VVS2).
> I am having trouble telling the program that clarity is a set of 3 dummy variables and keep getting error messages, what is the correct way?

You should code the categorical variable "Clarity" as a "factor" so that R
knows that this is a categorical variable and can deal with it
appropriately in subsequent computations such as summary() or lm().

Thus, I would recommend to store your data as

dat <- data.frame(
   Price = c(500, 1000, 864, 340, 90, 450),
   Weight = c(8, 5.2, 3, 2.6, 0.5, 2.3),
   Clarity = c("IF", "VVS1", "VVS2")[c(1, 3, 2, 3, 1, 2)])

which yields, e.g.,

R> summary(dat)
      Price            Weight      Clarity
  Min.   :  90.0   Min.   :0.500   IF  :2
  1st Qu.: 367.5   1st Qu.:2.375   VVS1:2
  Median : 475.0   Median :2.800   VVS2:2
  Mean   : 540.7   Mean   :3.600
  3rd Qu.: 773.0   3rd Qu.:4.650
  Max.   :1000.0   Max.   :8.000

and then you can also do

R> lm(Price ~ Weight + Clarity, data = dat)

Call:
lm(formula = Price ~ Weight + Clarity, data = dat)

Coefficients:
(Intercept)       Weight  ClarityVVS1  ClarityVVS2
      -45.05        80.01       490.02       403.00

or if you wish to choose a different coding

R> lm(Price ~ 0 + Weight + Clarity, data = dat)

Call:
lm(formula = Price ~ 0 + Weight + Clarity, data = dat)

Coefficients:
      Weight    ClarityIF  ClarityVVS1  ClarityVVS2
       80.01       -45.05       444.97       357.95


Some further reading of introductory material on linear regression in R
would be useful. Also look at ?lm, ?factor, ?model.matrix, ?contrasts etc.

hth,
Z

> Any helps is greatly appreciated.
> Matthew
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp: Creating Dummy Variables in R

Nikhil Kaza-2
In reply to this post by PIKAL Petr
I don't think R will complain, if you use the approach below. However,  
IF, VVS1 and VVS2 are linearly dependent.
Better use the factor approach and define which factor should be the  
contrast

Nikhil

On 16 Dec 2009, at 10:12AM, Petr PIKAL wrote:

> what commands did you use for regression
>
> I suppose
>
> lm(Price~Weight+IF+VVS1+VVS2, data=your.data)
>
> shall not complain if your.data is a data frame.
>
> Regards
> Petr

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Creating Dummy Variables in R

Tom Fletcher
In reply to this post by whitaker m. (mw1006)
Is your variable Clarity a categorical with 4 levels? Thus, the need for
k-1 (3) dummies? Your error may be the result of creating k instead of
k-1 dummies, but can't be sure from the example.

In R, you don't have to (unless you really want to) explicitly create
separate variables. You can use the internal contrast functions.

See

?contr.treatment

Which is dummy coding by default. You can specify which group is the
reference group.

Alternatively, if you prefer effects coding, you can see
?contr.sum

There are others as well.

Tom Fletcher



-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of whitaker m. (mw1006)
Sent: Wednesday, December 16, 2009 8:59 AM
To: [hidden email]
Subject: [R] Creating Dummy Variables in R

Hi,
I am trying to create a set of dummy variables to use within a multiple
linear regression and am unable to find the codes within the manuals.

For example i have:
Price     Weight     Clarity
                             IF      VVS1    VVS2
500        8             1         0          0
1000      5.2          0         0          1
864        3              0        1          0
340        2.6          0         0          1
90          0.5          1         0          0
450        2.3          0         1          0

Where price is dependent upon weight (single value in each observation)
and clarity (split into three levels, IF, VVS1, VVS2).
I am having trouble telling the program that clarity is a set of 3 dummy
variables and keep getting error messages, what is the correct way?

Any helps is greatly appreciated.
Matthew

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp: Creating Dummy Variables in R

whitaker m. (mw1006)
In reply to this post by Nikhil Kaza-2
I have a much larger dataset than in my original email (attached - price dependent upon weight, Clarity (different levels IF-SI2), colour(levels D-L) and Cut (ideal-fair), and tried the regression command:

>diamond.lm<-lm(price~weight+IF+VVS1+VVS2+VS1+VS2+SI1+SI2+I1+I2+D+E+F+G+H+I+J+K+L+ideal+excellent+very.good+good+fair, data="Diamonds2.txt")

Error in eval(predvars, data, env) : invalid 'envir' argument

Which lead to the error message below the command. I have tried searching for this, and assumed this was down to having categrocial variables within the data, is this a correct assumption or am i doing something else wrong? Apologies if this is a bit of a basic question!

Thanks again,
Matthew
________________________________________
From: Nikhil Kaza [[hidden email]]
Sent: Wednesday, December 16, 2009 4:14 PM
To: Petr PIKAL
Cc: whitaker m. (mw1006); [hidden email]
Subject: Re: [R] Odp:  Creating Dummy Variables in R

I don't think R will complain, if you use the approach below. However,
IF, VVS1 and VVS2 are linearly dependent.
Better use the factor approach and define which factor should be the
contrast

Nikhil

On 16 Dec 2009, at 10:12AM, Petr PIKAL wrote:

> what commands did you use for regression
>
> I suppose
>
> lm(Price~Weight+IF+VVS1+VVS2, data=your.data)
>
> shall not complain if your.data is a data frame.
>
> Regards
> Petr

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Diamonds2.txt (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp: Creating Dummy Variables in R

Nordlund, Dan (DSHS/RDA)
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of whitaker m. (mw1006)
> Sent: Wednesday, December 16, 2009 2:14 PM
> To: Nikhil Kaza; Petr PIKAL
> Cc: [hidden email]
> Subject: Re: [R] Odp: Creating Dummy Variables in R
>
> I have a much larger dataset than in my original email (attached - price dependent
> upon weight, Clarity (different levels IF-SI2), colour(levels D-L) and Cut (ideal-fair),
> and tried the regression command:
>
> >diamond.lm<-
> lm(price~weight+IF+VVS1+VVS2+VS1+VS2+SI1+SI2+I1+I2+D+E+F+G+H+I+J+K
> +L+ideal+excellent+very.good+good+fair, data="Diamonds2.txt")
>
> Error in eval(predvars, data, env) : invalid 'envir' argument
>
> Which lead to the error message below the command. I have tried searching for
> this, and assumed this was down to having categrocial variables within the data, is
> this a correct assumption or am i doing something else wrong? Apologies if this is a
> bit of a basic question!
>
> Thanks again,
> Matthew

You need to read your data from Diamonds2.txt into a dataframe first before running the lm() function.  What does your file Diamonds2.txt look like?

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA  98504-5204

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp: Creating Dummy Variables in R

Peter Dalgaard
Nordlund, Dan (DSHS/RDA) wrote:

>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]] On
>> Behalf Of whitaker m. (mw1006)
>> Sent: Wednesday, December 16, 2009 2:14 PM
>> To: Nikhil Kaza; Petr PIKAL
>> Cc: [hidden email]
>> Subject: Re: [R] Odp: Creating Dummy Variables in R
>>
>> I have a much larger dataset than in my original email (attached - price dependent
>> upon weight, Clarity (different levels IF-SI2), colour(levels D-L) and Cut (ideal-fair),
>> and tried the regression command:
>>
>>> diamond.lm<-
>> lm(price~weight+IF+VVS1+VVS2+VS1+VS2+SI1+SI2+I1+I2+D+E+F+G+H+I+J+K
>> +L+ideal+excellent+very.good+good+fair, data="Diamonds2.txt")
>>
>> Error in eval(predvars, data, env) : invalid 'envir' argument
>>
>> Which lead to the error message below the command. I have tried searching for
>> this, and assumed this was down to having categrocial variables within the data, is
>> this a correct assumption or am i doing something else wrong? Apologies if this is a
>> bit of a basic question!
>>
>> Thanks again,
>> Matthew
>
> You need to read your data from Diamonds2.txt into a dataframe first before running the lm() function.  What does your file Diamonds2.txt look like?

And, to put it more bluntly, he needs to study some introductory R text
rather more carefully to learn how the pieces fit together.

(Although I can see that the error may be cryptic to a beginner, I am at
a loss to explain what kind of leap of logic led him to believe that
dummy variables had _anything_ to do with it. For Heaven's sake, he must
be getting the same error if he leaves out the dummy variables!)

--
    O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Odp: Creating Dummy Variables in R

Rolf Turner
In reply to this post by whitaker m. (mw1006)

On 17/12/2009, at 11:14 AM, whitaker m. (mw1006) wrote:

> I have a much larger dataset than in my original email (attached -  
> price dependent upon weight, Clarity (different levels IF-SI2),  
> colour(levels D-L) and Cut (ideal-fair), and tried the regression  
> command:
>
>> diamond.lm<-lm(price~weight+IF+VVS1+VVS2+VS1+VS2+SI1+SI2+I1+I2+D+E
>> +F+G+H+I+J+K+L+ideal+excellent+very.good+good+fair,  
>> data="Diamonds2.txt")
>
> Error in eval(predvars, data, env) : invalid 'envir' argument
>
> Which lead to the error message below the command. I have tried  
> searching for this, and assumed this was down to having categrocial  
> variables within the data, is this a correct assumption or am i  
> doing something else wrong? Apologies if this is a bit of a basic  
> question!

(a) You don't want the quote marks around the data argument.  That is  
the source
of the "invalid 'envir' argument" error.

(b) You are not using the power of R.  ***Don't*** create your own  
dummy variables;
let lm() do it for you.  Learn something about how R works, for  
crying out loud.

Essentially you should be doing something like

        diamond.lm <- lm(price ~ weight + Clarity + colour + Cut, data =  
Diamond.txt)

where price, weight, Clarity, colour, and Cut are columns of the data  
frame
Diamond.txt.  The columns price and weight should be numeric vectors;  
Clarity,
colour, and Cut should be ***factors***.

It is slightly worrying that you refer to ``Diamond.txt''.  That  
``.txt'' suffix
would lead me to believe that ``Diamond.txt'' is a (text) file  
containing your
data.  If that is the case, this won't work.  The ``data'' argument  
to lm() must
be an ***R object***.  You have to read the data file into an R  
object before trying
to use the data in a call to lm().  Something like

        Diamond <- read.table("Diamond.txt") # Note that you ***do*** want  
to quote the file name.

Then

        diamond.lm <- lm(price ~ weight + Clarity + colour + Cut, data =  
Diamond)

should do what you want.  The dummy variable encoding used will be  
determined
by the (first) value of options()$contrasts, which by default i  
contr.treatment.

Read up on factors and contrasts.

        cheers,

                Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...