Documentation examples for lm and glm

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Documentation examples for lm and glm

Thomas Yee
Hello,

something that has been on my mind for a decade or two has
been the examples for lm() and glm(). They encourage poor style
because of mismanagement of data frames. Also, having the
variables in a data frame means that predict()
is more likely to work properly.

For lm(), the variables should be put into a data frame.
As 2 vectors are assigned first in the general workspace they
should be deleted afterwards.

For the glm(), the data frame d.AD is constructed but not used. Also,
its 3 components were assigned first in the general workspace, so they
float around dangerously afterwards like in the lm() example.

Rather than attached improved .Rd files here, they are put at
www.stat.auckland.ac.nz/~yee/Rdfiles
You are welcome to use them!

Best,

Thomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

bbolker

  Agree.  Or just create the data frame with those variables in it
directly ...

On 2018-12-13 3:26 p.m., Thomas Yee wrote:

> Hello,
>
> something that has been on my mind for a decade or two has
> been the examples for lm() and glm(). They encourage poor style
> because of mismanagement of data frames. Also, having the
> variables in a data frame means that predict()
> is more likely to work properly.
>
> For lm(), the variables should be put into a data frame.
> As 2 vectors are assigned first in the general workspace they
> should be deleted afterwards.
>
> For the glm(), the data frame d.AD is constructed but not used. Also,
> its 3 components were assigned first in the general workspace, so they
> float around dangerously afterwards like in the lm() example.
>
> Rather than attached improved .Rd files here, they are put at
> www.stat.auckland.ac.nz/~yee/Rdfiles
> You are welcome to use them!
>
> Best,
>
> Thomas
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

S Ellison-2
FWIW, before all the examples are changed to data frame variants, I think there's fairly good reason to have at least _one_ example that does _not_ place variables in a data frame.

The data argument in lm() is optional. And there is more than one way to manage data in a project. I personally don't much like lots of stray variables lurking about, but if those are the only variables out there and we can be sure they aren't affected by other code, it's hardly essential to create a data frame to hold something you already have.
Also, attach() is still part of R, for those folk who have a data frame but want to reference the contents across a wider range of functions without using with() a lot. lm() can reasonably omit the data argument there, too.

So while there are good reasons to use data frames, there are also good reasons to provide examples that don't.

Steve Ellison


> -----Original Message-----
> From: R-devel [mailto:[hidden email]] On Behalf Of Ben
> Bolker
> Sent: 13 December 2018 20:36
> To: [hidden email]
> Subject: Re: [Rd] Documentation examples for lm and glm
>
>
>   Agree.  Or just create the data frame with those variables in it
> directly ...
>
> On 2018-12-13 3:26 p.m., Thomas Yee wrote:
> > Hello,
> >
> > something that has been on my mind for a decade or two has
> > been the examples for lm() and glm(). They encourage poor style
> > because of mismanagement of data frames. Also, having the
> > variables in a data frame means that predict()
> > is more likely to work properly.
> >
> > For lm(), the variables should be put into a data frame.
> > As 2 vectors are assigned first in the general workspace they
> > should be deleted afterwards.
> >
> > For the glm(), the data frame d.AD is constructed but not used. Also,
> > its 3 components were assigned first in the general workspace, so they
> > float around dangerously afterwards like in the lm() example.
> >
> > Rather than attached improved .Rd files here, they are put at
> > www.stat.auckland.ac.nz/~yee/Rdfiles
> > You are welcome to use them!
> >
> > Best,
> >
> > Thomas
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

David Hugh-Jones-3
I would argue examples should encourage good practice. Beginners ought to
learn to keep data in data frames and not to overuse attach(). Experts can
do otherwise at their own risk, but they have less need of explicit
examples.

On Fri, 14 Dec 2018 at 14:51, S Ellison <[hidden email]> wrote:

> FWIW, before all the examples are changed to data frame variants, I think
> there's fairly good reason to have at least _one_ example that does _not_
> place variables in a data frame.
>
> The data argument in lm() is optional. And there is more than one way to
> manage data in a project. I personally don't much like lots of stray
> variables lurking about, but if those are the only variables out there and
> we can be sure they aren't affected by other code, it's hardly essential to
> create a data frame to hold something you already have.
> Also, attach() is still part of R, for those folk who have a data frame
> but want to reference the contents across a wider range of functions
> without using with() a lot. lm() can reasonably omit the data argument
> there, too.
>
> So while there are good reasons to use data frames, there are also good
> reasons to provide examples that don't.
>
> Steve Ellison
>
>
> > -----Original Message-----
> > From: R-devel [mailto:[hidden email]] On Behalf Of Ben
> > Bolker
> > Sent: 13 December 2018 20:36
> > To: [hidden email]
> > Subject: Re: [Rd] Documentation examples for lm and glm
> >
> >
> >   Agree.  Or just create the data frame with those variables in it
> > directly ...
> >
> > On 2018-12-13 3:26 p.m., Thomas Yee wrote:
> > > Hello,
> > >
> > > something that has been on my mind for a decade or two has
> > > been the examples for lm() and glm(). They encourage poor style
> > > because of mismanagement of data frames. Also, having the
> > > variables in a data frame means that predict()
> > > is more likely to work properly.
> > >
> > > For lm(), the variables should be put into a data frame.
> > > As 2 vectors are assigned first in the general workspace they
> > > should be deleted afterwards.
> > >
> > > For the glm(), the data frame d.AD is constructed but not used. Also,
> > > its 3 components were assigned first in the general workspace, so they
> > > float around dangerously afterwards like in the lm() example.
> > >
> > > Rather than attached improved .Rd files here, they are put at
> > > www.stat.auckland.ac.nz/~yee/Rdfiles
> > > You are welcome to use them!
> > >
> > > Best,
> > >
> > > Thomas
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:12}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Achim Zeileis-4
A pragmatic solution could be to create a simple linear regression example
with variables in the global environment and then another example with a
data.frame.

The latter might be somewhat more complex, e.g., with several regressors
and/or mixed categorical and numeric covariates to illustrate how
regression and analysis of (co-)variance can be combined. I like to use
MASS's whiteside data for this:

data("whiteside", package = "MASS")
m1 <- lm(Gas ~ Temp, data = whiteside)
m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
anova(m1, m2, m3)

Moreover, some binary response data.frame with a few covariates might be a
useful addition to "datasets". For example a more granular version of the
"Titanic" data (in addition to the 4-way tabel ?Titanic). Or another
relatively straightforward data set, popular in econometrics and social
sciences is the "Mroz" data, see e.g., help("PSID1976", package = "AER").

I would be happy to help with these if such additions were considered for
datasets/stats.


On Sat, 15 Dec 2018, David Hugh-Jones wrote:

> I would argue examples should encourage good practice. Beginners ought to
> learn to keep data in data frames and not to overuse attach(). Experts can
> do otherwise at their own risk, but they have less need of explicit
> examples.
>
> On Fri, 14 Dec 2018 at 14:51, S Ellison <[hidden email]> wrote:
>
>> FWIW, before all the examples are changed to data frame variants, I think
>> there's fairly good reason to have at least _one_ example that does _not_
>> place variables in a data frame.
>>
>> The data argument in lm() is optional. And there is more than one way to
>> manage data in a project. I personally don't much like lots of stray
>> variables lurking about, but if those are the only variables out there and
>> we can be sure they aren't affected by other code, it's hardly essential to
>> create a data frame to hold something you already have.
>> Also, attach() is still part of R, for those folk who have a data frame
>> but want to reference the contents across a wider range of functions
>> without using with() a lot. lm() can reasonably omit the data argument
>> there, too.
>>
>> So while there are good reasons to use data frames, there are also good
>> reasons to provide examples that don't.
>>
>> Steve Ellison
>>
>>
>>> -----Original Message-----
>>> From: R-devel [mailto:[hidden email]] On Behalf Of Ben
>>> Bolker
>>> Sent: 13 December 2018 20:36
>>> To: [hidden email]
>>> Subject: Re: [Rd] Documentation examples for lm and glm
>>>
>>>
>>>   Agree.  Or just create the data frame with those variables in it
>>> directly ...
>>>
>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote:
>>>> Hello,
>>>>
>>>> something that has been on my mind for a decade or two has
>>>> been the examples for lm() and glm(). They encourage poor style
>>>> because of mismanagement of data frames. Also, having the
>>>> variables in a data frame means that predict()
>>>> is more likely to work properly.
>>>>
>>>> For lm(), the variables should be put into a data frame.
>>>> As 2 vectors are assigned first in the general workspace they
>>>> should be deleted afterwards.
>>>>
>>>> For the glm(), the data frame d.AD is constructed but not used. Also,
>>>> its 3 components were assigned first in the general workspace, so they
>>>> float around dangerously afterwards like in the lm() example.
>>>>
>>>> Rather than attached improved .Rd files here, they are put at
>>>> www.stat.auckland.ac.nz/~yee/Rdfiles
>>>> You are welcome to use them!
>>>>
>>>> Best,
>>>>
>>>> Thomas
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>> *******************************************************************
>> This email and any attachments are confidential. Any u...{{dropped:12}}
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

frederik-2
I agree with Steve and Achim that we should keep some examples with no
data frame. That's Objectively Simpler, whether or not it leads to
clutter in the wrong hands. As Steve points out, we have attach()
which is an excellent language feature - not to mention with().

I would go even further and say that the examples that are in lm() now
should stay at the top. Because people may be used to referring to
them, and also because Historical Order is generally a good order in
which to learn things. However, if there is an important function
argument ("data=") not in the examples, then we should add examples
which use it. Likewise if there is a popular programming style
(putting things in a data frame). So let's do something along the
lines of what Thomas is requesting, but put it after the existing
documentation? Please?

On a bit of a tangent, I would like to see an example in lm() which
plots my data with a fitted line through it. I'm probably betraying my
ignorance here, but I was asked how to do this when showing R to a
friend and I thought it should be in lm(), after all it seems a bit
more basic than displaying a Normal Q-Q plot (whatever that is!
gasp...). Similarly for glm(). Perhaps all this can be accomplished
with merely doubling the size of the existing examples.

Thanks.

Frederick

On Sat, Dec 15, 2018 at 02:15:52PM +0100, Achim Zeileis wrote:

>A pragmatic solution could be to create a simple linear regression
>example with variables in the global environment and then another
>example with a data.frame.
>
>The latter might be somewhat more complex, e.g., with several
>regressors and/or mixed categorical and numeric covariates to
>illustrate how regression and analysis of (co-)variance can be
>combined. I like to use MASS's whiteside data for this:
>
>data("whiteside", package = "MASS")
>m1 <- lm(Gas ~ Temp, data = whiteside)
>m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
>m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
>anova(m1, m2, m3)
>
>Moreover, some binary response data.frame with a few covariates might
>be a useful addition to "datasets". For example a more granular
>version of the "Titanic" data (in addition to the 4-way tabel
>?Titanic). Or another relatively straightforward data set, popular in
>econometrics and social sciences is the "Mroz" data, see e.g.,
>help("PSID1976", package = "AER").
>
>I would be happy to help with these if such additions were considered
>for datasets/stats.
>
>
>On Sat, 15 Dec 2018, David Hugh-Jones wrote:
>
>>I would argue examples should encourage good practice. Beginners ought to
>>learn to keep data in data frames and not to overuse attach(). Experts can
>>do otherwise at their own risk, but they have less need of explicit
>>examples.
>>
>>On Fri, 14 Dec 2018 at 14:51, S Ellison <[hidden email]> wrote:
>>
>>>FWIW, before all the examples are changed to data frame variants, I think
>>>there's fairly good reason to have at least _one_ example that does _not_
>>>place variables in a data frame.
>>>
>>>The data argument in lm() is optional. And there is more than one way to
>>>manage data in a project. I personally don't much like lots of stray
>>>variables lurking about, but if those are the only variables out there and
>>>we can be sure they aren't affected by other code, it's hardly essential to
>>>create a data frame to hold something you already have.
>>>Also, attach() is still part of R, for those folk who have a data frame
>>>but want to reference the contents across a wider range of functions
>>>without using with() a lot. lm() can reasonably omit the data argument
>>>there, too.
>>>
>>>So while there are good reasons to use data frames, there are also good
>>>reasons to provide examples that don't.
>>>
>>>Steve Ellison
>>>
>>>
>>>>-----Original Message-----
>>>>From: R-devel [mailto:[hidden email]] On Behalf Of Ben
>>>>Bolker
>>>>Sent: 13 December 2018 20:36
>>>>To: [hidden email]
>>>>Subject: Re: [Rd] Documentation examples for lm and glm
>>>>
>>>>
>>>>  Agree.  Or just create the data frame with those variables in it
>>>>directly ...
>>>>
>>>>On 2018-12-13 3:26 p.m., Thomas Yee wrote:
>>>>>Hello,
>>>>>
>>>>>something that has been on my mind for a decade or two has
>>>>>been the examples for lm() and glm(). They encourage poor style
>>>>>because of mismanagement of data frames. Also, having the
>>>>>variables in a data frame means that predict()
>>>>>is more likely to work properly.
>>>>>
>>>>>For lm(), the variables should be put into a data frame.
>>>>>As 2 vectors are assigned first in the general workspace they
>>>>>should be deleted afterwards.
>>>>>
>>>>>For the glm(), the data frame d.AD is constructed but not used. Also,
>>>>>its 3 components were assigned first in the general workspace, so they
>>>>>float around dangerously afterwards like in the lm() example.
>>>>>
>>>>>Rather than attached improved .Rd files here, they are put at
>>>>>www.stat.auckland.ac.nz/~yee/Rdfiles
>>>>>You are welcome to use them!
>>>>>
>>>>>Best,
>>>>>
>>>>>Thomas
>>>>>
>>>>>______________________________________________
>>>>>[hidden email] mailing list
>>>>>https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>______________________________________________
>>>>[hidden email] mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>>*******************************************************************
>>>This email and any attachments are confidential. Any u...{{dropped:12}}
>>
>>______________________________________________
>>[hidden email] mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Achim Zeileis-4
On Sat, 15 Dec 2018, [hidden email] wrote:

> I agree with Steve and Achim that we should keep some examples with no
> data frame. That's Objectively Simpler, whether or not it leads to
> clutter in the wrong hands. As Steve points out, we have attach()
> which is an excellent language feature - not to mention with().

Just for the record: Personally, I wouldn't recommend using lm() with
attach() or with() but would always encourage using data= instead.

In my previous e-mail I just wanted to point out that a pragmatic step for
the man page could be to keep one example without data= argument when
adding examples with data=.

> I would go even further and say that the examples that are in lm() now
> should stay at the top. Because people may be used to referring to
> them, and also because Historical Order is generally a good order in
> which to learn things. However, if there is an important function
> argument ("data=") not in the examples, then we should add examples
> which use it. Likewise if there is a popular programming style
> (putting things in a data frame). So let's do something along the
> lines of what Thomas is requesting, but put it after the existing
> documentation? Please?
>
> On a bit of a tangent, I would like to see an example in lm() which
> plots my data with a fitted line through it. I'm probably betraying my
> ignorance here, but I was asked how to do this when showing R to a
> friend and I thought it should be in lm(), after all it seems a bit
> more basic than displaying a Normal Q-Q plot (whatever that is!
> gasp...). Similarly for glm(). Perhaps all this can be accomplished
> with merely doubling the size of the existing examples.
>
> Thanks.
>
> Frederick
>
> On Sat, Dec 15, 2018 at 02:15:52PM +0100, Achim Zeileis wrote:
>> A pragmatic solution could be to create a simple linear regression example
>> with variables in the global environment and then another example with a
>> data.frame.
>>
>> The latter might be somewhat more complex, e.g., with several regressors
>> and/or mixed categorical and numeric covariates to illustrate how
>> regression and analysis of (co-)variance can be combined. I like to use
>> MASS's whiteside data for this:
>>
>> data("whiteside", package = "MASS")
>> m1 <- lm(Gas ~ Temp, data = whiteside)
>> m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
>> m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
>> anova(m1, m2, m3)
>>
>> Moreover, some binary response data.frame with a few covariates might be a
>> useful addition to "datasets". For example a more granular version of the
>> "Titanic" data (in addition to the 4-way tabel ?Titanic). Or another
>> relatively straightforward data set, popular in econometrics and social
>> sciences is the "Mroz" data, see e.g., help("PSID1976", package = "AER").
>>
>> I would be happy to help with these if such additions were considered for
>> datasets/stats.
>>
>>
>> On Sat, 15 Dec 2018, David Hugh-Jones wrote:
>>
>>> I would argue examples should encourage good practice. Beginners ought to
>>> learn to keep data in data frames and not to overuse attach(). Experts can
>>> do otherwise at their own risk, but they have less need of explicit
>>> examples.
>>>
>>> On Fri, 14 Dec 2018 at 14:51, S Ellison <[hidden email]> wrote:
>>>
>>>> FWIW, before all the examples are changed to data frame variants, I think
>>>> there's fairly good reason to have at least _one_ example that does _not_
>>>> place variables in a data frame.
>>>>
>>>> The data argument in lm() is optional. And there is more than one way to
>>>> manage data in a project. I personally don't much like lots of stray
>>>> variables lurking about, but if those are the only variables out there
>>>> and
>>>> we can be sure they aren't affected by other code, it's hardly essential
>>>> to
>>>> create a data frame to hold something you already have.
>>>> Also, attach() is still part of R, for those folk who have a data frame
>>>> but want to reference the contents across a wider range of functions
>>>> without using with() a lot. lm() can reasonably omit the data argument
>>>> there, too.
>>>>
>>>> So while there are good reasons to use data frames, there are also good
>>>> reasons to provide examples that don't.
>>>>
>>>> Steve Ellison
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Ben
>>>>> Bolker
>>>>> Sent: 13 December 2018 20:36
>>>>> To: [hidden email]
>>>>> Subject: Re: [Rd] Documentation examples for lm and glm
>>>>>
>>>>>
>>>>>  Agree.  Or just create the data frame with those variables in it
>>>>> directly ...
>>>>>
>>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote:
>>>>>> Hello,
>>>>>>
>>>>>> something that has been on my mind for a decade or two has
>>>>>> been the examples for lm() and glm(). They encourage poor style
>>>>>> because of mismanagement of data frames. Also, having the
>>>>>> variables in a data frame means that predict()
>>>>>> is more likely to work properly.
>>>>>>
>>>>>> For lm(), the variables should be put into a data frame.
>>>>>> As 2 vectors are assigned first in the general workspace they
>>>>>> should be deleted afterwards.
>>>>>>
>>>>>> For the glm(), the data frame d.AD is constructed but not used. Also,
>>>>>> its 3 components were assigned first in the general workspace, so they
>>>>>> float around dangerously afterwards like in the lm() example.
>>>>>>
>>>>>> Rather than attached improved .Rd files here, they are put at
>>>>>> www.stat.auckland.ac.nz/~yee/Rdfiles
>>>>>> You are welcome to use them!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>> *******************************************************************
>>>> This email and any attachments are confidential. Any u...{{dropped:12}}
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Thomas Yee
Thanks for the discussion. I do feel quite strongly that
the variables should always be a part of a data frame. Then
functions such as summary() and pairs() can operate on them all
simultaneously.... regression is only one part of the analysis. And
what if there are lots of variables? Have them all scattered
about the workspace? One of them could be easily overwritten.

The generic predict() will still work when lm() was not assigned
a data frame, but then the 'newdata' argument needs be assigned
a data.frame. So this suggests that the original fit should have
used a data frame too.

BTW I believe attach() should be discouraged. Functions like
with() and within() are safer. Many users of attach() do not seem
to detach(), and subtle problems can arise with attach()---quite
dangerous really. The online help has a section called "Good
practice" which is good but I think it should go a little further
by actively discouraging its use in the first place.

I do not wish to be contentious on all this... just encouraging
good practice that's all.

cheers
Thomas



On 17/12/18 12:26 PM, Achim Zeileis wrote:

> On Sat, 15 Dec 2018, [hidden email] wrote:
>
>> I agree with Steve and Achim that we should keep some examples with no
>> data frame. That's Objectively Simpler, whether or not it leads to
>> clutter in the wrong hands. As Steve points out, we have attach()
>> which is an excellent language feature - not to mention with().
>
> Just for the record: Personally, I wouldn't recommend using lm() with
> attach() or with() but would always encourage using data= instead.
>
> In my previous e-mail I just wanted to point out that a pragmatic step
> for the man page could be to keep one example without data= argument
> when adding examples with data=.
>
>> I would go even further and say that the examples that are in lm() now
>> should stay at the top. Because people may be used to referring to
>> them, and also because Historical Order is generally a good order in
>> which to learn things. However, if there is an important function
>> argument ("data=") not in the examples, then we should add examples
>> which use it. Likewise if there is a popular programming style
>> (putting things in a data frame). So let's do something along the
>> lines of what Thomas is requesting, but put it after the existing
>> documentation? Please?
>>
>> On a bit of a tangent, I would like to see an example in lm() which
>> plots my data with a fitted line through it. I'm probably betraying my
>> ignorance here, but I was asked how to do this when showing R to a
>> friend and I thought it should be in lm(), after all it seems a bit
>> more basic than displaying a Normal Q-Q plot (whatever that is!
>> gasp...). Similarly for glm(). Perhaps all this can be accomplished
>> with merely doubling the size of the existing examples.
>>
>> Thanks.
>>
>> Frederick
>>
>> On Sat, Dec 15, 2018 at 02:15:52PM +0100, Achim Zeileis wrote:
>>> A pragmatic solution could be to create a simple linear regression
>>> example with variables in the global environment and then another
>>> example with a data.frame.
>>>
>>> The latter might be somewhat more complex, e.g., with several
>>> regressors and/or mixed categorical and numeric covariates to
>>> illustrate how regression and analysis of (co-)variance can be
>>> combined. I like to use MASS's whiteside data for this:
>>>
>>> data("whiteside", package = "MASS")
>>> m1 <- lm(Gas ~ Temp, data = whiteside)
>>> m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
>>> m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
>>> anova(m1, m2, m3)
>>>
>>> Moreover, some binary response data.frame with a few covariates
>>> might be a useful addition to "datasets". For example a more
>>> granular version of the "Titanic" data (in addition to the 4-way
>>> tabel ?Titanic). Or another relatively straightforward data set,
>>> popular in econometrics and social sciences is the "Mroz" data, see
>>> e.g., help("PSID1976", package = "AER").
>>>
>>> I would be happy to help with these if such additions were
>>> considered for datasets/stats.
>>>
>>>
>>> On Sat, 15 Dec 2018, David Hugh-Jones wrote:
>>>
>>>> I would argue examples should encourage good practice. Beginners
>>>> ought to
>>>> learn to keep data in data frames and not to overuse attach().
>>>> Experts can
>>>> do otherwise at their own risk, but they have less need of explicit
>>>> examples.
>>>>
>>>> On Fri, 14 Dec 2018 at 14:51, S Ellison <[hidden email]>
>>>> wrote:
>>>>
>>>>> FWIW, before all the examples are changed to data frame variants,
>>>>> I think
>>>>> there's fairly good reason to have at least _one_ example that
>>>>> does _not_
>>>>> place variables in a data frame.
>>>>>
>>>>> The data argument in lm() is optional. And there is more than one
>>>>> way to
>>>>> manage data in a project. I personally don't much like lots of stray
>>>>> variables lurking about, but if those are the only variables out
>>>>> there and
>>>>> we can be sure they aren't affected by other code, it's hardly
>>>>> essential to
>>>>> create a data frame to hold something you already have.
>>>>> Also, attach() is still part of R, for those folk who have a data
>>>>> frame
>>>>> but want to reference the contents across a wider range of functions
>>>>> without using with() a lot. lm() can reasonably omit the data
>>>>> argument
>>>>> there, too.
>>>>>
>>>>> So while there are good reasons to use data frames, there are also
>>>>> good
>>>>> reasons to provide examples that don't.
>>>>>
>>>>> Steve Ellison
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: R-devel [mailto:[hidden email]] On Behalf Of
>>>>>> Ben
>>>>>> Bolker
>>>>>> Sent: 13 December 2018 20:36
>>>>>> To: [hidden email]
>>>>>> Subject: Re: [Rd] Documentation examples for lm and glm
>>>>>>
>>>>>>
>>>>>>  Agree.  Or just create the data frame with those variables in it
>>>>>> directly ...
>>>>>>
>>>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> something that has been on my mind for a decade or two has
>>>>>>> been the examples for lm() and glm(). They encourage poor style
>>>>>>> because of mismanagement of data frames. Also, having the
>>>>>>> variables in a data frame means that predict()
>>>>>>> is more likely to work properly.
>>>>>>>
>>>>>>> For lm(), the variables should be put into a data frame.
>>>>>>> As 2 vectors are assigned first in the general workspace they
>>>>>>> should be deleted afterwards.
>>>>>>>
>>>>>>> For the glm(), the data frame d.AD is constructed but not used.
>>>>>>> Also,
>>>>>>> its 3 components were assigned first in the general workspace,
>>>>>>> so they
>>>>>>> float around dangerously afterwards like in the lm() example.
>>>>>>>
>>>>>>> Rather than attached improved .Rd files here, they are put at
>>>>>>> www.stat.auckland.ac.nz/~yee/Rdfiles
>>>>>>> You are welcome to use them!
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>>
>>>>> *******************************************************************
>>>>> This email and any attachments are confidential. Any
>>>>> u...{{dropped:12}}
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Martin Maechler
In reply to this post by David Hugh-Jones-3
>>>>> David Hugh-Jones
>>>>>     on Sat, 15 Dec 2018 08:47:28 +0100 writes:

    > I would argue examples should encourage good
    > practice. Beginners ought to learn to keep data in data
    > frames and not to overuse attach().

Note there's no attach() there in any of these examples!

    > otherwise at their own risk, but they have less need of
    > explicit examples.

The glm examples are nice in sofar they show both uses.

I agree the lm() example(s) are  "didactically misleading" by
not using data frames at all.

I disagree that only data frame examples should be shown.
If  lm()  is one of the first R functions a beginneR must use --
because they are in a basic stats class, say --  it may be
*better* didactically to focus on lm()  in the very first
example, and use data frames in a next one ...
.... and instead of next one, we have the pretty clear comment
     
  ### less simple examples in "See Also" above

I'm not convinced (but you can try more) we should change those
examples or add more there.

Martin

    > On Fri, 14 Dec 2018 at 14:51, S Ellison
    > <[hidden email]> wrote:

    >> FWIW, before all the examples are changed to data frame
    >> variants, I think there's fairly good reason to have at
    >> least _one_ example that does _not_ place variables in a
    >> data frame.
    >>
    >> The data argument in lm() is optional. And there is more
    >> than one way to manage data in a project. I personally
    >> don't much like lots of stray variables lurking about,
    >> but if those are the only variables out there and we can
    >> be sure they aren't affected by other code, it's hardly
    >> essential to create a data frame to hold something you
    >> already have.  Also, attach() is still part of R, for
    >> those folk who have a data frame but want to reference
    >> the contents across a wider range of functions without
    >> using with() a lot. lm() can reasonably omit the data
    >> argument there, too.
    >>
    >> So while there are good reasons to use data frames, there
    >> are also good reasons to provide examples that don't.
    >>
    >> Steve Ellison
    >>
    >>
    >> > -----Original Message----- > From: R-devel
    >> [mailto:[hidden email]] On Behalf Of Ben >
    >> Bolker > Sent: 13 December 2018 20:36 > To:
    >> [hidden email] > Subject: Re: [Rd] Documentation
    >> examples for lm and glm
    >> >
    >> >
    >> > Agree.  Or just create the data frame with those
    >> variables in it > directly ...
    >> >
    >> > On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,
    >> > >
    >> > > something that has been on my mind for a decade or
    >> two has > > been the examples for lm() and glm(). They
    >> encourage poor style > > because of mismanagement of data
    >> frames. Also, having the > > variables in a data frame
    >> means that predict() > > is more likely to work properly.
    >> > >
    >> > > For lm(), the variables should be put into a data
    >> frame.  > > As 2 vectors are assigned first in the
    >> general workspace they > > should be deleted afterwards.
    >> > >
    >> > > For the glm(), the data frame d.AD is constructed but
    >> not used. Also, > > its 3 components were assigned first
    >> in the general workspace, so they > > float around
    >> dangerously afterwards like in the lm() example.
    >> > >
    >> > > Rather than attached improved .Rd files here, they
    >> are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
    >> You are welcome to use them!
    >> > >
    >> > > Best,
    >> > >
    >> > > Thomas
    >> > >
    >> > > ______________________________________________ > >
    >> [hidden email] mailing list > >
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >
    >> > ______________________________________________ >
    >> [hidden email] mailing list >
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >>
    >>
    >> *******************************************************************
    >> This email and any attachments are confidential. Any
    >> u...{{dropped:12}}

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

S Ellison-2
In reply to this post by Thomas Yee


> From: Thomas Yee [mailto:[hidden email]]
>
> Thanks for the discussion. I do feel quite strongly that
> the variables should always be a part of a data frame.

This seems pretty much a decision for R core, and I think it's useful to have raised the issue.

But I, er, feel strongly that strong feelings and 'always' are unsafe in a best practice argument.

First, other folk with different use-cases or work practice may see 'best practice' quite differently. So I would pretty much always expect exceptions.

Second, for examples of capability, there are too many exceptions in this instance. For example:
glm() can take a two-column matrix as a single response variable.
lm() can take a matrix as a response variable.
lm() can take a complete data frame as a predictor (see ?stackloss)

None of these work naturally if everything is in a data frame, and some won’t work at all.

Steve E




*******************************************************************
This email and any attachments are confidential. Any use, copying or
disclosure other than by the intended recipient is unauthorised. If
you have received this message in error, please notify the sender
immediately via +44(0)20 8943 7000 or notify [hidden email]
and delete this message and any copies from your computer and network.
LGC Limited. Registered in England 2991879.
Registered office: Queens Road, Teddington, Middlesex, TW11 0LY, UK
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Fox, John
In reply to this post by David Hugh-Jones-3
Dear Martin,

I think that everyone agrees that it’s generally preferable to use the data argument to lm() and I have nothing significant to add to the substance of the discussion, but I think that it’s a mistake not to add to the current examples, for the following reasons:

(1) Relegating examples using the data argument to “see also” doesn’t suggest that using the argument is a best practice. Most users won’t bother to click the links.

(2) In my opinion, an new initial example using the data argument would more clearly suggest that this is the normally the best option.

(3) I think that it would also be desirable to add a remark to the explanation of the data argument, something like, “Although the argument is optional, it's generally preferable to specify it explicitly.” And similarly on the help page for glm().

My two (or three) cents.

John

  -------------------------------------------------
  John Fox, Professor Emeritus
  McMaster University
  Hamilton, Ontario, Canada
  Web: http::/socserv.mcmaster.ca/jfox

> On Dec 17, 2018, at 3:05 AM, Martin Maechler <[hidden email]> wrote:
>
>>>>>> David Hugh-Jones
>>>>>>    on Sat, 15 Dec 2018 08:47:28 +0100 writes:
>
>> I would argue examples should encourage good
>> practice. Beginners ought to learn to keep data in data
>> frames and not to overuse attach().
>
> Note there's no attach() there in any of these examples!
>
>> otherwise at their own risk, but they have less need of
>> explicit examples.
>
> The glm examples are nice in sofar they show both uses.
>
> I agree the lm() example(s) are  "didactically misleading" by
> not using data frames at all.
>
> I disagree that only data frame examples should be shown.
> If  lm()  is one of the first R functions a beginneR must use --
> because they are in a basic stats class, say --  it may be
> *better* didactically to focus on lm()  in the very first
> example, and use data frames in a next one ...
> .... and instead of next one, we have the pretty clear comment
>
>  ### less simple examples in "See Also" above
>
> I'm not convinced (but you can try more) we should change those
> examples or add more there.
>
> Martin
>
>> On Fri, 14 Dec 2018 at 14:51, S Ellison
>> <[hidden email]> wrote:
>
>>> FWIW, before all the examples are changed to data frame
>>> variants, I think there's fairly good reason to have at
>>> least _one_ example that does _not_ place variables in a
>>> data frame.
>>>
>>> The data argument in lm() is optional. And there is more
>>> than one way to manage data in a project. I personally
>>> don't much like lots of stray variables lurking about,
>>> but if those are the only variables out there and we can
>>> be sure they aren't affected by other code, it's hardly
>>> essential to create a data frame to hold something you
>>> already have.  Also, attach() is still part of R, for
>>> those folk who have a data frame but want to reference
>>> the contents across a wider range of functions without
>>> using with() a lot. lm() can reasonably omit the data
>>> argument there, too.
>>>
>>> So while there are good reasons to use data frames, there
>>> are also good reasons to provide examples that don't.
>>>
>>> Steve Ellison
>>>
>>>
>>>> -----Original Message----- > From: R-devel
>>> [mailto:[hidden email]] On Behalf Of Ben >
>>> Bolker > Sent: 13 December 2018 20:36 > To:
>>> [hidden email] > Subject: Re: [Rd] Documentation
>>> examples for lm and glm
>>>>
>>>>
>>>> Agree.  Or just create the data frame with those
>>> variables in it > directly ...
>>>>
>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,
>>>>>
>>>>> something that has been on my mind for a decade or
>>> two has > > been the examples for lm() and glm(). They
>>> encourage poor style > > because of mismanagement of data
>>> frames. Also, having the > > variables in a data frame
>>> means that predict() > > is more likely to work properly.
>>>>>
>>>>> For lm(), the variables should be put into a data
>>> frame.  > > As 2 vectors are assigned first in the
>>> general workspace they > > should be deleted afterwards.
>>>>>
>>>>> For the glm(), the data frame d.AD is constructed but
>>> not used. Also, > > its 3 components were assigned first
>>> in the general workspace, so they > > float around
>>> dangerously afterwards like in the lm() example.
>>>>>
>>>>> Rather than attached improved .Rd files here, they
>>> are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
>>> You are welcome to use them!
>>>>>
>>>>> Best,
>>>>>
>>>>> Thomas
>>>>>
>>>>> ______________________________________________ > >
>>> [hidden email] mailing list > >
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>> ______________________________________________ >
>>> [hidden email] mailing list >
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>> *******************************************************************
>>> This email and any attachments are confidential. Any
>>> u...{{dropped:12}}
>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Fox, John
In reply to this post by Thomas Yee
Dear Steve,

Since this relates as well to the message I posted a couple of minutes before yours, I agree that it’s possible to phrase “best practices” too categorically. In the current case, I believe that it’s reasonable to say that specifying the data argument is “generally” or “usually” the best option. That doesn’t rule out exceptions.

Best,
 John

  -------------------------------------------------
  John Fox, Professor Emeritus
  McMaster University
  Hamilton, Ontario, Canada
  Web: http::/socserv.mcmaster.ca/jfox

> On Dec 17, 2018, at 7:49 AM, S Ellison <[hidden email]> wrote:
>
>
>
>> From: Thomas Yee [mailto:[hidden email]]
>>
>> Thanks for the discussion. I do feel quite strongly that
>> the variables should always be a part of a data frame.
>
> This seems pretty much a decision for R core, and I think it's useful to have raised the issue.
>
> But I, er, feel strongly that strong feelings and 'always' are unsafe in a best practice argument.
>
> First, other folk with different use-cases or work practice may see 'best practice' quite differently. So I would pretty much always expect exceptions.
>
> Second, for examples of capability, there are too many exceptions in this instance. For example:
> glm() can take a two-column matrix as a single response variable.
> lm() can take a matrix as a response variable.
> lm() can take a complete data frame as a predictor (see ?stackloss)
>
> None of these work naturally if everything is in a data frame, and some won’t work at all.
>
> Steve E
>
>
>
>
> *******************************************************************
> This email and any attachments are confidential. Any use, copying or
> disclosure other than by the intended recipient is unauthorised. If
> you have received this message in error, please notify the sender
> immediately via +44(0)20 8943 7000 or notify [hidden email]
> and delete this message and any copies from your computer and network.
> LGC Limited. Registered in England 2991879.
> Registered office: Queens Road, Teddington, Middlesex, TW11 0LY, UK
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Heinz Tuechler
In reply to this post by Fox, John
Dear All,

do you think that use of a data argument is best practice in the example
below?

regards,

Heinz

### trivial example
plotwithline <- function(x, y) {
     plot(x, y)
     abline(lm(y~x)) ## data argument?
}

set.seed(25)
df0 <- data.frame(x=rnorm(20), y=rnorm(20))

plotwithline(df0[['x']], df0[['y']])



Fox, John wrote/hat geschrieben on/am 17.12.2018 15:21:

> Dear Martin,
>
> I think that everyone agrees that it’s generally preferable to use the data argument to lm() and I have nothing significant to add to the substance of the discussion, but I think that it’s a mistake not to add to the current examples, for the following reasons:
>
> (1) Relegating examples using the data argument to “see also” doesn’t suggest that using the argument is a best practice. Most users won’t bother to click the links.
>
> (2) In my opinion, an new initial example using the data argument would more clearly suggest that this is the normally the best option.
>
> (3) I think that it would also be desirable to add a remark to the explanation of the data argument, something like, “Although the argument is optional, it's generally preferable to specify it explicitly.” And similarly on the help page for glm().
>
> My two (or three) cents.
>
> John
>
>   -------------------------------------------------
>   John Fox, Professor Emeritus
>   McMaster University
>   Hamilton, Ontario, Canada
>   Web: http::/socserv.mcmaster.ca/jfox
>
>> On Dec 17, 2018, at 3:05 AM, Martin Maechler <[hidden email]> wrote:
>>
>>>>>>> David Hugh-Jones
>>>>>>>    on Sat, 15 Dec 2018 08:47:28 +0100 writes:
>>
>>> I would argue examples should encourage good
>>> practice. Beginners ought to learn to keep data in data
>>> frames and not to overuse attach().
>>
>> Note there's no attach() there in any of these examples!
>>
>>> otherwise at their own risk, but they have less need of
>>> explicit examples.
>>
>> The glm examples are nice in sofar they show both uses.
>>
>> I agree the lm() example(s) are  "didactically misleading" by
>> not using data frames at all.
>>
>> I disagree that only data frame examples should be shown.
>> If  lm()  is one of the first R functions a beginneR must use --
>> because they are in a basic stats class, say --  it may be
>> *better* didactically to focus on lm()  in the very first
>> example, and use data frames in a next one ...
>> .... and instead of next one, we have the pretty clear comment
>>
>>  ### less simple examples in "See Also" above
>>
>> I'm not convinced (but you can try more) we should change those
>> examples or add more there.
>>
>> Martin
>>
>>> On Fri, 14 Dec 2018 at 14:51, S Ellison
>>> <[hidden email]> wrote:
>>
>>>> FWIW, before all the examples are changed to data frame
>>>> variants, I think there's fairly good reason to have at
>>>> least _one_ example that does _not_ place variables in a
>>>> data frame.
>>>>
>>>> The data argument in lm() is optional. And there is more
>>>> than one way to manage data in a project. I personally
>>>> don't much like lots of stray variables lurking about,
>>>> but if those are the only variables out there and we can
>>>> be sure they aren't affected by other code, it's hardly
>>>> essential to create a data frame to hold something you
>>>> already have.  Also, attach() is still part of R, for
>>>> those folk who have a data frame but want to reference
>>>> the contents across a wider range of functions without
>>>> using with() a lot. lm() can reasonably omit the data
>>>> argument there, too.
>>>>
>>>> So while there are good reasons to use data frames, there
>>>> are also good reasons to provide examples that don't.
>>>>
>>>> Steve Ellison
>>>>
>>>>
>>>>> -----Original Message----- > From: R-devel
>>>> [mailto:[hidden email]] On Behalf Of Ben >
>>>> Bolker > Sent: 13 December 2018 20:36 > To:
>>>> [hidden email] > Subject: Re: [Rd] Documentation
>>>> examples for lm and glm
>>>>>
>>>>>
>>>>> Agree.  Or just create the data frame with those
>>>> variables in it > directly ...
>>>>>
>>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,
>>>>>>
>>>>>> something that has been on my mind for a decade or
>>>> two has > > been the examples for lm() and glm(). They
>>>> encourage poor style > > because of mismanagement of data
>>>> frames. Also, having the > > variables in a data frame
>>>> means that predict() > > is more likely to work properly.
>>>>>>
>>>>>> For lm(), the variables should be put into a data
>>>> frame.  > > As 2 vectors are assigned first in the
>>>> general workspace they > > should be deleted afterwards.
>>>>>>
>>>>>> For the glm(), the data frame d.AD is constructed but
>>>> not used. Also, > > its 3 components were assigned first
>>>> in the general workspace, so they > > float around
>>>> dangerously afterwards like in the lm() example.
>>>>>>
>>>>>> Rather than attached improved .Rd files here, they
>>>> are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
>>>> You are welcome to use them!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>> ______________________________________________ > >
>>>> [hidden email] mailing list > >
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>> ______________________________________________ >
>>>> [hidden email] mailing list >
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>> *******************************************************************
>>>> This email and any attachments are confidential. Any
>>>> u...{{dropped:12}}
>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Fox, John
In reply to this post by Fox, John
Dear Heinz,

  ----------------------------------------------
> On Dec 17, 2018, at 10:19 AM, Heinz Tuechler <[hidden email]> wrote:
>
> Dear All,
>
> do you think that use of a data argument is best practice in the example below?

No, but it is *normally* or *usually* the best option, in my opinion.

Best,
 John

>
> regards,
>
> Heinz
>
> ### trivial example
> plotwithline <- function(x, y) {
>    plot(x, y)
>    abline(lm(y~x)) ## data argument?
> }
>
> set.seed(25)
> df0 <- data.frame(x=rnorm(20), y=rnorm(20))
>
> plotwithline(df0[['x']], df0[['y']])
>
>
>
> Fox, John wrote/hat geschrieben on/am 17.12.2018 15:21:
>> Dear Martin,
>>
>> I think that everyone agrees that it’s generally preferable to use the data argument to lm() and I have nothing significant to add to the substance of the discussion, but I think that it’s a mistake not to add to the current examples, for the following reasons:
>>
>> (1) Relegating examples using the data argument to “see also” doesn’t suggest that using the argument is a best practice. Most users won’t bother to click the links.
>>
>> (2) In my opinion, an new initial example using the data argument would more clearly suggest that this is the normally the best option.
>>
>> (3) I think that it would also be desirable to add a remark to the explanation of the data argument, something like, “Although the argument is optional, it's generally preferable to specify it explicitly.” And similarly on the help page for glm().
>>
>> My two (or three) cents.
>>
>> John
>>
>>  -------------------------------------------------
>>  John Fox, Professor Emeritus
>>  McMaster University
>>  Hamilton, Ontario, Canada
>>  Web: http::/socserv.mcmaster.ca/jfox
>>
>>> On Dec 17, 2018, at 3:05 AM, Martin Maechler <[hidden email]> wrote:
>>>
>>>>>>>> David Hugh-Jones
>>>>>>>>   on Sat, 15 Dec 2018 08:47:28 +0100 writes:
>>>
>>>> I would argue examples should encourage good
>>>> practice. Beginners ought to learn to keep data in data
>>>> frames and not to overuse attach().
>>>
>>> Note there's no attach() there in any of these examples!
>>>
>>>> otherwise at their own risk, but they have less need of
>>>> explicit examples.
>>>
>>> The glm examples are nice in sofar they show both uses.
>>>
>>> I agree the lm() example(s) are  "didactically misleading" by
>>> not using data frames at all.
>>>
>>> I disagree that only data frame examples should be shown.
>>> If  lm()  is one of the first R functions a beginneR must use --
>>> because they are in a basic stats class, say --  it may be
>>> *better* didactically to focus on lm()  in the very first
>>> example, and use data frames in a next one ...
>>> .... and instead of next one, we have the pretty clear comment
>>>
>>> ### less simple examples in "See Also" above
>>>
>>> I'm not convinced (but you can try more) we should change those
>>> examples or add more there.
>>>
>>> Martin
>>>
>>>> On Fri, 14 Dec 2018 at 14:51, S Ellison
>>>> <[hidden email]> wrote:
>>>
>>>>> FWIW, before all the examples are changed to data frame
>>>>> variants, I think there's fairly good reason to have at
>>>>> least _one_ example that does _not_ place variables in a
>>>>> data frame.
>>>>>
>>>>> The data argument in lm() is optional. And there is more
>>>>> than one way to manage data in a project. I personally
>>>>> don't much like lots of stray variables lurking about,
>>>>> but if those are the only variables out there and we can
>>>>> be sure they aren't affected by other code, it's hardly
>>>>> essential to create a data frame to hold something you
>>>>> already have.  Also, attach() is still part of R, for
>>>>> those folk who have a data frame but want to reference
>>>>> the contents across a wider range of functions without
>>>>> using with() a lot. lm() can reasonably omit the data
>>>>> argument there, too.
>>>>>
>>>>> So while there are good reasons to use data frames, there
>>>>> are also good reasons to provide examples that don't.
>>>>>
>>>>> Steve Ellison
>>>>>
>>>>>
>>>>>> -----Original Message----- > From: R-devel
>>>>> [mailto:[hidden email]] On Behalf Of Ben >
>>>>> Bolker > Sent: 13 December 2018 20:36 > To:
>>>>> [hidden email] > Subject: Re: [Rd] Documentation
>>>>> examples for lm and glm
>>>>>>
>>>>>>
>>>>>> Agree.  Or just create the data frame with those
>>>>> variables in it > directly ...
>>>>>>
>>>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,
>>>>>>>
>>>>>>> something that has been on my mind for a decade or
>>>>> two has > > been the examples for lm() and glm(). They
>>>>> encourage poor style > > because of mismanagement of data
>>>>> frames. Also, having the > > variables in a data frame
>>>>> means that predict() > > is more likely to work properly.
>>>>>>>
>>>>>>> For lm(), the variables should be put into a data
>>>>> frame.  > > As 2 vectors are assigned first in the
>>>>> general workspace they > > should be deleted afterwards.
>>>>>>>
>>>>>>> For the glm(), the data frame d.AD is constructed but
>>>>> not used. Also, > > its 3 components were assigned first
>>>>> in the general workspace, so they > > float around
>>>>> dangerously afterwards like in the lm() example.
>>>>>>>
>>>>>>> Rather than attached improved .Rd files here, they
>>>>> are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
>>>>> You are welcome to use them!
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> ______________________________________________ > >
>>>>> [hidden email] mailing list > >
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>> ______________________________________________ >
>>>>> [hidden email] mailing list >
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>>
>>>>> *******************************************************************
>>>>> This email and any attachments are confidential. Any
>>>>> u...{{dropped:12}}
>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Documentation examples for lm and glm

Heinz Tuechler
Dear John,

fully agreed! In the global environment I always keep my
"data-variables" in a data.frame. However, if I look in help I like
examples that start with the particular aspects of a function. It is
important to know, if a function offers a data argument, but in the
first line I don't need an example for the use of a data argument each
time I look in help.

best,
Heinz

Fox, John wrote/hat geschrieben on/am 17.12.2018 16:23:

> Dear Heinz,
>
>   ----------------------------------------------
>> On Dec 17, 2018, at 10:19 AM, Heinz Tuechler <[hidden email]> wrote:
>>
>> Dear All,
>>
>> do you think that use of a data argument is best practice in the example below?
>
> No, but it is *normally* or *usually* the best option, in my opinion.
>
> Best,
>  John
>
>>
>> regards,
>>
>> Heinz
>>
>> ### trivial example
>> plotwithline <- function(x, y) {
>>    plot(x, y)
>>    abline(lm(y~x)) ## data argument?
>> }
>>
>> set.seed(25)
>> df0 <- data.frame(x=rnorm(20), y=rnorm(20))
>>
>> plotwithline(df0[['x']], df0[['y']])
>>
>>
>>
>> Fox, John wrote/hat geschrieben on/am 17.12.2018 15:21:
>>> Dear Martin,
>>>
>>> I think that everyone agrees that it’s generally preferable to use the data argument to lm() and I have nothing significant to add to the substance of the discussion, but I think that it’s a mistake not to add to the current examples, for the following reasons:
>>>
>>> (1) Relegating examples using the data argument to “see also” doesn’t suggest that using the argument is a best practice. Most users won’t bother to click the links.
>>>
>>> (2) In my opinion, an new initial example using the data argument would more clearly suggest that this is the normally the best option.
>>>
>>> (3) I think that it would also be desirable to add a remark to the explanation of the data argument, something like, “Although the argument is optional, it's generally preferable to specify it explicitly.” And similarly on the help page for glm().
>>>
>>> My two (or three) cents.
>>>
>>> John
>>>
>>>  -------------------------------------------------
>>>  John Fox, Professor Emeritus
>>>  McMaster University
>>>  Hamilton, Ontario, Canada
>>>  Web: http::/socserv.mcmaster.ca/jfox
>>>
>>>> On Dec 17, 2018, at 3:05 AM, Martin Maechler <[hidden email]> wrote:
>>>>
>>>>>>>>> David Hugh-Jones
>>>>>>>>>   on Sat, 15 Dec 2018 08:47:28 +0100 writes:
>>>>
>>>>> I would argue examples should encourage good
>>>>> practice. Beginners ought to learn to keep data in data
>>>>> frames and not to overuse attach().
>>>>
>>>> Note there's no attach() there in any of these examples!
>>>>
>>>>> otherwise at their own risk, but they have less need of
>>>>> explicit examples.
>>>>
>>>> The glm examples are nice in sofar they show both uses.
>>>>
>>>> I agree the lm() example(s) are  "didactically misleading" by
>>>> not using data frames at all.
>>>>
>>>> I disagree that only data frame examples should be shown.
>>>> If  lm()  is one of the first R functions a beginneR must use --
>>>> because they are in a basic stats class, say --  it may be
>>>> *better* didactically to focus on lm()  in the very first
>>>> example, and use data frames in a next one ...
>>>> .... and instead of next one, we have the pretty clear comment
>>>>
>>>> ### less simple examples in "See Also" above
>>>>
>>>> I'm not convinced (but you can try more) we should change those
>>>> examples or add more there.
>>>>
>>>> Martin
>>>>
>>>>> On Fri, 14 Dec 2018 at 14:51, S Ellison
>>>>> <[hidden email]> wrote:
>>>>
>>>>>> FWIW, before all the examples are changed to data frame
>>>>>> variants, I think there's fairly good reason to have at
>>>>>> least _one_ example that does _not_ place variables in a
>>>>>> data frame.
>>>>>>
>>>>>> The data argument in lm() is optional. And there is more
>>>>>> than one way to manage data in a project. I personally
>>>>>> don't much like lots of stray variables lurking about,
>>>>>> but if those are the only variables out there and we can
>>>>>> be sure they aren't affected by other code, it's hardly
>>>>>> essential to create a data frame to hold something you
>>>>>> already have.  Also, attach() is still part of R, for
>>>>>> those folk who have a data frame but want to reference
>>>>>> the contents across a wider range of functions without
>>>>>> using with() a lot. lm() can reasonably omit the data
>>>>>> argument there, too.
>>>>>>
>>>>>> So while there are good reasons to use data frames, there
>>>>>> are also good reasons to provide examples that don't.
>>>>>>
>>>>>> Steve Ellison
>>>>>>
>>>>>>
>>>>>>> -----Original Message----- > From: R-devel
>>>>>> [mailto:[hidden email]] On Behalf Of Ben >
>>>>>> Bolker > Sent: 13 December 2018 20:36 > To:
>>>>>> [hidden email] > Subject: Re: [Rd] Documentation
>>>>>> examples for lm and glm
>>>>>>>
>>>>>>>
>>>>>>> Agree.  Or just create the data frame with those
>>>>>> variables in it > directly ...
>>>>>>>
>>>>>>> On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,
>>>>>>>>
>>>>>>>> something that has been on my mind for a decade or
>>>>>> two has > > been the examples for lm() and glm(). They
>>>>>> encourage poor style > > because of mismanagement of data
>>>>>> frames. Also, having the > > variables in a data frame
>>>>>> means that predict() > > is more likely to work properly.
>>>>>>>>
>>>>>>>> For lm(), the variables should be put into a data
>>>>>> frame.  > > As 2 vectors are assigned first in the
>>>>>> general workspace they > > should be deleted afterwards.
>>>>>>>>
>>>>>>>> For the glm(), the data frame d.AD is constructed but
>>>>>> not used. Also, > > its 3 components were assigned first
>>>>>> in the general workspace, so they > > float around
>>>>>> dangerously afterwards like in the lm() example.
>>>>>>>>
>>>>>>>> Rather than attached improved .Rd files here, they
>>>>>> are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
>>>>>> You are welcome to use them!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel