Conventions: Use of globals and main functions

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Conventions: Use of globals and main functions

R devel mailing list
In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages.

On the other hand, in Python, it is common to use a main function (through the `def main():` and  `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts.

Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level.

Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation?

This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Gábor Csárdi
This is what I usually put in scripts:

if (is.null(sys.calls())) {
  main()
}

This is mostly equivalent to the Python idiom. It the script runs from
Rscript, then it will run main(). It also lets you source() the
script, and debug its functions, test them, etc. It works best if all
the code in the script is organized into functions.

Gabor

On Sun, Aug 25, 2019 at 6:11 AM Cyclic Group Z_1 via R-devel
<[hidden email]> wrote:

>
> In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages.
>
> On the other hand, in Python, it is common to use a main function (through the `def main():` and  `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts.
>
> Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level.
>
> Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation?
>
> This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error.
>
> Best,
> CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Duncan Murdoch-2
In reply to this post by R devel mailing list
On 25/08/2019 12:08 a.m., Cyclic Group Z_1 via R-devel wrote:
> In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages.
>
> On the other hand, in Python, it is common to use a main function (through the `def main():` and  `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts.
>
> Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level.
>
> Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation?

Lexical scoping means that all of the problems of global variables are
available to writers who use main().  You could treat the evaluation
frame of your main function exactly like the global workspace:  define
functions within it, read and modify local variables from those
functions, etc.

The benefit of using main() if you avoid defining all the other
functions within it is that other functions normally operate on their
arguments with few side effects.  You achieve this in R by putting those
other functions in packages, and running those functions in short
scripts.  That's how I've always recommended large projects be
organized.  You don't want a long script for anything, and you don't
want multiple source files unless they're in a package.

Duncan Murdoch

>
> This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error.
>
> Best,
> CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
In reply to this post by Gábor Csárdi
 This seems like a nice idiom; I've seen others use    if(!interactive()){        main()    }to a similar effect.
Best,CG
    On Sunday, August 25, 2019, 01:16:06 AM CDT, Gábor Csárdi <[hidden email]> wrote:  
 
 This is what I usually put in scripts:

if (is.null(sys.calls())) {
  main()
}

This is mostly equivalent to the Python idiom. It the script runs from
Rscript, then it will run main(). It also lets you source() the
script, and debug its functions, test them, etc. It works best if all
the code in the script is organized into functions.

Gabor

On Sun, Aug 25, 2019 at 6:11 AM Cyclic Group Z_1 via R-devel
<[hidden email]> wrote:

>
> In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages.
>
> On the other hand, in Python, it is common to use a main function (through the `def main():` and  `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts.
>
> Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level.
>
> Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation?
>
> This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error.
>
> Best,
> CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel 
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
In reply to this post by Duncan Murdoch-2


This is a fair point; structuring functions into packages is probably ultimately the gold standard for code organization in R. However, lexical scoping in R is really not much different than in other languages, such as Python, in which use of main functions and defining other named functions outside of main are encouraged. For example, in Scheme, from which R derives its scoping rules, the community generally organizes code with almost exclusively functions and few non-function global variables at top level. The common use of globals in R seems to be mostly a consequence of historical interactive use and, relatedly, an inherited practice from S.

It is true, though, that since anonymous functions (such as in lapply) play a large part in idiomatic R code, as you put it, "[l]exical scoping means that all of the problems of global variables are available to writers who use main()." Nevertheless, using a main function with other functions defined outside it seems like a good quick alternative that offers similar advantages to making a package when functions are tightly coupled to the script and the project may not be large or generalizable enough to warrant making a package.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Duncan Murdoch-2
On 25/08/2019 7:09 p.m., Cyclic Group Z_1 wrote:
>
>
> This is a fair point; structuring functions into packages is probably ultimately the gold standard for code organization in R. However, lexical scoping in R is really not much different than in other languages, such as Python, in which use of main functions and defining other named functions outside of main are encouraged. For example, in Scheme, from which R derives its scoping rules, the community generally organizes code with almost exclusively functions and few non-function global variables at top level. The common use of globals in R seems to be mostly a consequence of historical interactive use and, relatedly, an inherited practice from S.
>
> It is true, though, that since anonymous functions (such as in lapply) play a large part in idiomatic R code, as you put it, "[l]exical scoping means that all of the problems of global variables are available to writers who use main()." Nevertheless, using a main function with other functions defined outside it seems like a good quick alternative that offers similar advantages to making a package when functions are tightly coupled to the script and the project may not be large or generalizable enough to warrant making a package.
>

I think the idea that making a package is too hard is just wrong.
Packages in R have lots of requirements, but nowadays there are tools
that make them easy.  Eleven years ago at UseR in Dortmund I wrote a
package during a 45 minute presentation, and things are much easier now.

If you make a complex project without putting most of the code into a
package, you don't have something that you will be able to modify in a
year or two, because you won't have proper documentation.

Scripts are for throwaways, not for anything worth keeping.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Gábor Csárdi
In reply to this post by R devel mailing list
That is unfortunately wrong, though. Whether the script runs as "main"
and whether R is in interactive mode are independent properties. I
guess most of the time it works, because _usually_ you run the whole
script (main()) in non-interactive mode, and source() it in
interactive mode, but this is not necessarily always the case, e.g.
you might want to source() in non-interactive mode to run some tests,
or use the functions of the script in another script, in which cases
you don't want to run main().

G.

On Sun, Aug 25, 2019 at 11:47 PM Cyclic Group Z_1
<[hidden email]> wrote:

>
> This seems like a nice idiom; I've seen others use
>     if(!interactive()){
>         main()
>     }
> to a similar effect.
>
> Best,
> CG
>
> On Sunday, August 25, 2019, 01:16:06 AM CDT, Gábor Csárdi <[hidden email]> wrote:
>
>
> This is what I usually put in scripts:
>
> if (is.null(sys.calls())) {
>   main()
> }
>

[...]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
Right, I did not mean to imply these tests are equivalent. Only that both similarly exclude execution of main() under some context. 

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
In reply to this post by Duncan Murdoch-2
Duncan Murdoch wrote:
>  Scripts are for throwaways, not for anything worth keeping.

I totally agree and have a tangentially relevant question about the <<-
operator.  Currently 'name <<- value' means to look up the environment
stack until you find 'name'  and (a) if you find 'name' in some frame bind
it to a new value in that frame and (b) if you do not find it make a new
entry for it in .GlobalEnv.

Should R deprecate the second part of that and give an error if 'name' is
not already present in the environment stack?  This would catch misspelling
errors in functions that collect results from recursive calls.  E.g.,

collectStrings <- function(list) {
    strings <- character() # to be populated by .collect
    .collect <- function(x) {
        if (is.list(x)) {
            lapply(x, .collect)
        } else if (is.character(x)) {
            strings <<- c(strings, x)
        }
        misspelledStrings <<- c(strings, names(x)) # oops, would like to be
told about this error
        NULL
    }
    .collect(list)
    strings
}

This gives the incorrect:
> collectStrings(list(i="One", ii=list(a=1, b="Two")))
[1] "One" "Two"
> misspelledStrings
[1] "One" "Two" "i"   "ii"

instead of what we would get if 'misspelledStrings' were 'strings'.
> collectStrings(list(i="One", ii=list(a=1, b="Two")))
[1] "One" "Two" "a"   "b"   "i"   "ii"

If someone really wanted to assign into .GlobalEnv the assign() function is
available.

In S '<<-' only had meaning (b) and R added meaning (a).  Perhaps it is
time to drop meaning (b).  We could start by triggering a warning about it
if some environment variable were set, as is being done for non-scalar &&
and ||.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Sun, Aug 25, 2019 at 5:09 PM Duncan Murdoch <[hidden email]>
wrote:

> On 25/08/2019 7:09 p.m., Cyclic Group Z_1 wrote:
> >
> >
> > This is a fair point; structuring functions into packages is probably
> ultimately the gold standard for code organization in R. However, lexical
> scoping in R is really not much different than in other languages, such as
> Python, in which use of main functions and defining other named functions
> outside of main are encouraged. For example, in Scheme, from which R
> derives its scoping rules, the community generally organizes code with
> almost exclusively functions and few non-function global variables at top
> level. The common use of globals in R seems to be mostly a consequence of
> historical interactive use and, relatedly, an inherited practice from S.
> >
> > It is true, though, that since anonymous functions (such as in lapply)
> play a large part in idiomatic R code, as you put it, "[l]exical scoping
> means that all of the problems of global variables are available to writers
> who use main()." Nevertheless, using a main function with other functions
> defined outside it seems like a good quick alternative that offers similar
> advantages to making a package when functions are tightly coupled to the
> script and the project may not be large or generalizable enough to warrant
> making a package.
> >
>
> I think the idea that making a package is too hard is just wrong.
> Packages in R have lots of requirements, but nowadays there are tools
> that make them easy.  Eleven years ago at UseR in Dortmund I wrote a
> package during a 45 minute presentation, and things are much easier now.
>
> If you make a complex project without putting most of the code into a
> package, you don't have something that you will be able to modify in a
> year or two, because you won't have proper documentation.
>
> Scripts are for throwaways, not for anything worth keeping.
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Duncan Murdoch-2
On 26/08/2019 1:58 p.m., William Dunlap wrote:

> Duncan Murdoch wrote:
>  > Scripts are for throwaways, not for anything worth keeping.
>
> I totally agree and have a tangentially relevant question about the <<-
> operator.  Currently 'name <<- value' means to look up the environment
> stack until you find 'name'  and (a) if you find 'name' in some frame
> bind it to a new value in that frame and (b) if you do not find it make
> a new entry for it in .GlobalEnv.
>
> Should R deprecate the second part of that and give an error if 'name'
> is not already present in the environment stack?  This would catch
> misspelling errors in functions that collect results from recursive
> calls.  E.g.,

I like that suggestion.  Package tests have been complaining about
packages writing to .GlobalEnv for a while now, so there probably aren't
many instances of b) in CRAN packages; that change might be relatively
painless.

Duncan Murdoch

>
> collectStrings <- function(list) {
>      strings <- character() # to be populated by .collect
>      .collect <- function(x) {
>          if (is.list(x)) {
>              lapply(x, .collect)
>          } else if (is.character(x)) {
>              strings <<- c(strings, x)
>          }
>          misspelledStrings <<- c(strings, names(x)) # oops, would like
> to be told about this error
>          NULL
>      }
>      .collect(list)
>      strings
> }
>
> This gives the incorrect:
>  > collectStrings(list(i="One", ii=list(a=1, b="Two")))
> [1] "One" "Two"
>  > misspelledStrings
> [1] "One" "Two" "i"   "ii"
>
> instead of what we would get if 'misspelledStrings' were 'strings'.
>  > collectStrings(list(i="One", ii=list(a=1, b="Two")))
> [1] "One" "Two" "a"   "b"   "i"   "ii"
>
> If someone really wanted to assign into .GlobalEnv the assign() function
> is available.
>
> In S '<<-' only had meaning (b) and R added meaning (a).  Perhaps it is
> time to drop meaning (b).  We could start by triggering a warning about
> it if some environment variable were set, as is being done for
> non-scalar && and ||.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
>
> On Sun, Aug 25, 2019 at 5:09 PM Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 25/08/2019 7:09 p.m., Cyclic Group Z_1 wrote:
>      >
>      >
>      > This is a fair point; structuring functions into packages is
>     probably ultimately the gold standard for code organization in R.
>     However, lexical scoping in R is really not much different than in
>     other languages, such as Python, in which use of main functions and
>     defining other named functions outside of main are encouraged. For
>     example, in Scheme, from which R derives its scoping rules, the
>     community generally organizes code with almost exclusively functions
>     and few non-function global variables at top level. The common use
>     of globals in R seems to be mostly a consequence of historical
>     interactive use and, relatedly, an inherited practice from S.
>      >
>      > It is true, though, that since anonymous functions (such as in
>     lapply) play a large part in idiomatic R code, as you put it,
>     "[l]exical scoping means that all of the problems of global
>     variables are available to writers who use main()." Nevertheless,
>     using a main function with other functions defined outside it seems
>     like a good quick alternative that offers similar advantages to
>     making a package when functions are tightly coupled to the script
>     and the project may not be large or generalizable enough to warrant
>     making a package.
>      >
>
>     I think the idea that making a package is too hard is just wrong.
>     Packages in R have lots of requirements, but nowadays there are tools
>     that make them easy.  Eleven years ago at UseR in Dortmund I wrote a
>     package during a 45 minute presentation, and things are much easier now.
>
>     If you make a complex project without putting most of the code into a
>     package, you don't have something that you will be able to modify in a
>     year or two, because you won't have proper documentation.
>
>     Scripts are for throwaways, not for anything worth keeping.
>
>     Duncan Murdoch
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Martin Maechler
>>>>> Duncan Murdoch
>>>>>     on Mon, 26 Aug 2019 14:19:36 -0400 writes:

    > On 26/08/2019 1:58 p.m., William Dunlap wrote:
    >> Duncan Murdoch wrote:
    >> > Scripts are for throwaways, not for anything worth keeping.
    >>
    >> I totally agree and have a tangentially relevant question about the <<-
    >> operator.  Currently 'name <<- value' means to look up the environment
    >> stack until you find 'name'  and (a) if you find 'name' in some frame
    >> bind it to a new value in that frame and (b) if you do not find it make
    >> a new entry for it in .GlobalEnv.
    >>
    >> Should R deprecate the second part of that and give an error if 'name'
    >> is not already present in the environment stack?  This would catch
    >> misspelling errors in functions that collect results from recursive
    >> calls.  E.g.,

    > I like that suggestion.  Package tests have been complaining about
    > packages writing to .GlobalEnv for a while now, so there probably aren't
    > many instances of b) in CRAN packages; that change might be relatively
    > painless.

    > Duncan Murdoch

I don't agree currently : AFAICS, there's no other case (in S or) R where an
assignment only works if there's no object with that name.

In addition: If I wanted such a functionality I'd rather have with a
function that has several arguments and this behavior was
switchable via  <argname> = TRUE/FALSE , rather than with
`<<-` which has always exactly 2 arguments.

[This is my personal opinion only; other R Core members may well
 think differently about this]

Martin

    >> collectStrings <- function(list) {
    >>     strings <- character() # to be populated by .collect
    >>     .collect <- function(x) {
    >>         if (is.list(x)) {
    >>             lapply(x, .collect)
    >>         } else if (is.character(x)) {
    >>             strings <<- c(strings, x)
    >>         }
    >>         misspelledStrings <<- c(strings, names(x)) # oops, would like
    >> to be told about this error
    >>         NULL
    >>     }
    >>     .collect(list)
    >>     strings
    >> }
    >>
    >> This gives the incorrect:
    >> > collectStrings(list(i="One", ii=list(a=1, b="Two")))
    >> [1] "One" "Two"
    >> > misspelledStrings
    >> [1] "One" "Two" "i"   "ii"
    >>
    >> instead of what we would get if 'misspelledStrings' were 'strings'.
    >> > collectStrings(list(i="One", ii=list(a=1, b="Two")))
    >> [1] "One" "Two" "a"   "b"   "i"   "ii"
    >>
    >> If someone really wanted to assign into .GlobalEnv the assign() function
    >> is available.
    >>
    >> In S '<<-' only had meaning (b) and R added meaning (a).  Perhaps it is
    >> time to drop meaning (b).  We could start by triggering a warning about
    >> it if some environment variable were set, as is being done for
    >> non-scalar && and ||.
    >>
    >> Bill Dunlap
    >> TIBCO Software
    >> wdunlap tibco.com <http://tibco.com>
    >>
    >>
    >> On Sun, Aug 25, 2019 at 5:09 PM Duncan Murdoch <[hidden email]
    >> <mailto:[hidden email]>> wrote:
    >>
    >> On 25/08/2019 7:09 p.m., Cyclic Group Z_1 wrote:
    >> >
    >> >
    >> > This is a fair point; structuring functions into packages is
    >> probably ultimately the gold standard for code organization in R.
    >> However, lexical scoping in R is really not much different than in
    >> other languages, such as Python, in which use of main functions and
    >> defining other named functions outside of main are encouraged. For
    >> example, in Scheme, from which R derives its scoping rules, the
    >> community generally organizes code with almost exclusively functions
    >> and few non-function global variables at top level. The common use
    >> of globals in R seems to be mostly a consequence of historical
    >> interactive use and, relatedly, an inherited practice from S.
    >> >
    >> > It is true, though, that since anonymous functions (such as in
    >> lapply) play a large part in idiomatic R code, as you put it,
    >> "[l]exical scoping means that all of the problems of global
    >> variables are available to writers who use main()." Nevertheless,
    >> using a main function with other functions defined outside it seems
    >> like a good quick alternative that offers similar advantages to
    >> making a package when functions are tightly coupled to the script
    >> and the project may not be large or generalizable enough to warrant
    >> making a package.
    >> >
    >>
    >> I think the idea that making a package is too hard is just wrong.
    >> Packages in R have lots of requirements, but nowadays there are tools
    >> that make them easy.  Eleven years ago at UseR in Dortmund I wrote a
    >> package during a 45 minute presentation, and things are much easier now.
    >>
    >> If you make a complex project without putting most of the code into a
    >> package, you don't have something that you will be able to modify in a
    >> year or two, because you won't have proper documentation.
    >>
    >> Scripts are for throwaways, not for anything worth keeping.
    >>
    >> Duncan Murdoch
    >>
    >> ______________________________________________
    >> [hidden email] <mailto:[hidden email]> mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >>

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Peter Meissner-3
In reply to this post by R devel mailing list
Hey,

I always found it a strength of R compared to many other langaugas that
simple things (running a script, doing something interactive, writing a
function, using lambdas, installing packages, getting help, ...) are very
very simple.

R is a commandline statistics program that happens to be a very elegant,
simple and consistent programming language too.

That beeing said I think the main task of scripts is to get things done via
running them end to end in a fresh session. Now, it very well may happen
that a lot of stuff has to be done. Than splitting up scripts into
subscripts and sourcing them from a meta script is a straightforward
solution. It might also be that some functionality is put into functions to
be reused in other places. This can be done by putting those function
definitions into separate files. Than one cane use source wherever those
functions are needed. Now, putting stuff that runs code and scripts that
define/provovide functions into the same script is a bad idea. Using the
main()-idioms described might prevent this the problems stemming from
mixing functions and function execution. But it would also encourage this
mixing which is - I think, a bad idea anyways.

Therefore, I am against fostering a main()-idiom - it adds complexity and
encourages bad code structuring (putting application code and function
definition code into one file).

If one needs code to behave differenlty in interactive sessions than in
non-interactive sessions - if( interactive() ){ } is one way to solve this.

If more solid software developement is needed packages are the way to go.


Best, Peter


Am So., 25. Aug. 2019 um 06:11 Uhr schrieb Cyclic Group Z_1 via R-devel <
[hidden email]>:

> In R scripts (as opposed to packages), even in reproducible scripts, it
> seems fairly conventional to use the global workspace as a sort of main
> function, and thus R scripts often populate the global environment with
> many variables, which may be mutated. Although this makes sense given R has
> historically been used interactively and this practice is common for
> scripting languages, this appears to disagree with the software-engineering
> principle of avoiding a mutating global state. Although this is just a rule
> of thumb, in R scripts, the frequent use of global variables is much more
> pronounced than in other languages.
>
> On the other hand, in Python, it is common to use a main function (through
> the `def main():` and  `if __name__ == "__main__":` idioms). This is
> mentioned both in the documentation as well as in the writing of Python's
> main creator. Although this is more beneficial in Python than in R because
> Python code is structured into modules, which serve as both scripts and
> packages, whereas R separates these conceptually, a similar practice of
> creating a main function would help avoid the issues from mutating global
> state common to other languages and facilitate maintainability, especially
> for longer scripts.
>
> Although many great R texts (Advanced R, Art of R Programming, etc.)
> caution against assignment in a parent enclosure (e.g., using `<<-`, or
> `assign`), I have not seen many promote the use of a main function and
> avoiding mutating global variables from top level.
>
> Would it be a good idea to promote use of main functions and limiting
> global-state mutation for longer scripts and dedicated applications (not
> one-off scripts)? Should these practices be mentioned in the standard
> documentation?
>
> This question was motivated largely by this discussion on Reddit:
> https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ .
> Apologies beforehand if any of these (partially subjective) assessments are
> in error.
>
> Best,
> CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Abby Spurdle
In reply to this post by R devel mailing list
> this appears to disagree with the software-engineering principle of avoiding a mutating global state

I disagree.
In embedded systems engineering, for example, it's customary to use
global variables to represent ports.

Also, I note that the use of global variables, is similar to using pen
and paper, to do mathematics and statistics.
(Which is good).
Whether that's consistent with software engineering principles or not,
I don't know.

However, I partly agree with you.
Given that there's interest from various parties in running R in
various ways, it may be good to document some of the options
available.

"Running R" (in "R Installation and Administration") links to
"Appendix B Invoking R" (in "An Introduction to R").
However, these sections do not cover the topics in this thread.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Abby Spurdle
> "Running R" (in "R Installation and Administration") links to
> "Appendix B Invoking R" (in "An Introduction to R").
> However, these sections do not cover the topics in this thread.

Sorry, I made a mistake.
It is in the documentation (B.4 Scripting with R)
e.g.

(excerpts only)
R CMD BATCH "--args arg1 arg2" foo.R &
args <- commandArgs(TRUE)
Rscript foo.R arg1 arg2

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Henrik Bengtsson-5
In reply to this post by Abby Spurdle
FWIW, one could imagine introducing a helper function global();

global <- function(expr) { eval(substitute(expr), envir = globalenv(),
enclos = baseenv()) }

to make it explicit that any assignments (and evaluation in general)
take place in the global environment, e.g.

> local({ global(a <- 2) })
> a
[1] 2

That "looks" nicer than assign("a", 2, envir = globalenv()) and it's
safer than assuming a <<- 2 will "reach" the global environment.

/Henrik

On Tue, Aug 27, 2019 at 3:19 PM Abby Spurdle <[hidden email]> wrote:

>
> > this appears to disagree with the software-engineering principle of avoiding a mutating global state
>
> I disagree.
> In embedded systems engineering, for example, it's customary to use
> global variables to represent ports.
>
> Also, I note that the use of global variables, is similar to using pen
> and paper, to do mathematics and statistics.
> (Which is good).
> Whether that's consistent with software engineering principles or not,
> I don't know.
>
> However, I partly agree with you.
> Given that there's interest from various parties in running R in
> various ways, it may be good to document some of the options
> available.
>
> "Running R" (in "R Installation and Administration") links to
> "Appendix B Invoking R" (in "An Introduction to R").
> However, these sections do not cover the topics in this thread.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
In reply to this post by Abby Spurdle
Definitely, I agree that global variables have a place in programming. They play an especially important role in low-level software, such as embedded programming, as you mentioned, and systems programming. I generally would disagree with anyone that says global variables should never be used, and they may be the best implementation option when something is "truly global."

However, in R scripting conventions, they are the default. I don't think it is controversial to say that in software engineering culture, there is a generally held principle that global variables should be minimized because they can be dangerous (granted, the original "Globals considered harmful" article is quite old, and many of the criticisms not applicable to modern languages). I do think it is equally important, though, to understand when to break this rule.

I like your suggestion of documenting this as an alternative option, though it seems the general sentiment is against this, which I respect.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
In reply to this post by Peter Meissner-3
> That beeing said I think the main task of scripts is to get things done via running them end to end in a fresh session. Now, it very well may happen that a lot of stuff has to be done. Than splitting up scripts into subscripts and sourcing them from a meta script is a straightforward solution. It might also be that some functionality is put into functions to be reused in other places. This can be done by putting those function definitions into separate files. Than one cane use source wherever those functions are needed. Now, putting stuff that runs code and scripts that define/provovide functions into the same script is a bad idea. Using the main()-idioms described might prevent this the problems stemming from mixing functions and function execution. But it would also encourage this mixing which is - I think, a bad idea anyways.

I actually would agree entirely that files should not serve as both source files for re-used functions as well as application code. The suggestion for a main() idiom is merely to reduce variable scope and bring R practices more in line with generally recommended programming practices, not so that they can act as packages/modules/libraries. When I compared R scripts containing main functions to packages, I only mean in the sense that they help manage scope (the latter through package namespaces). Any other named functions besides main would be functions specifically tied to the script. 

I do see your point, though, that this could result in bad practice, namely the usage mixing you described. 

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Peter Meissner-3
Firtst, I think that thinking about best practice advise and beeing able to
accomandate different usage scenarios is a good thing despite me arguing
against introducing the main()-idiom.

Let's have another turn on the global-environment is bad argument.

It has two parts:

(1) Glattering namespace. Glattering name space might become a problem
because you might end up having used all reasonable words already so one
has to extend the space for names with new namespaces. For scripting, this
usually should be no problem since one can always create more space through
the usage of environments - put code into a function, put objects into
environments, or write a package. Glattering name space might also become a
problem if things get complex. If your code base gets larger on name might
be overwritten by the other on accident. This is a problem that can be
solved by not simply extending the name space (more space) but by
structuring it - keeping related things together, hiding unused helpers
e.g. by putting them in a function, or an environment, or writing a
package.

Now, if we put everything into main() we have not solved much. Now instead
of 100 objects glattering the global environment we have e.g. 5 obejcts in
the global environment and 95 objects in the main()-function environment.

(2) Changing global state. A thing that is a little bit related to the
global environment is the idea of global state and the problems that arise
when changing global state. But the global environment in R is not the same
as a global state. First, all normal stuff in R (except environments, R6
objects, data.tables) are passed by copy (never mind how its implented
under the hood). So when I assign a value to a new name, this will behave
like if I made a copy - thus I simply do not care what happens to the value
of the original because my copy's value is independent. Next, it is
possible to misuse the global environment (or nay parent environment) as
global state via either explicitly using assign(..., ..., env =
globalenv()) or by using the <<- operator. Also, one has access to objects
of enclosing envíronment when e.g. executing code in a function environment
but this is read only by default. Although this is possible and it is done
from time to time, this is not how things are done 99% of the time. The
common practice - and I would say best practice also - is to use pure
function that only depend on their inputs and do not change anything except
returing a value. Using pure functions mainly prevents 99% of the problems
with global state while using more name spaces does only chop these kind of
problems into smaller and thus more numerous problems.


Best, Peter

Am Mi., 28. Aug. 2019 um 05:56 Uhr schrieb Cyclic Group Z_1 <
[hidden email]>:

> > That beeing said I think the main task of scripts is to get things done
> via running them end to end in a fresh session. Now, it very well may
> happen that a lot of stuff has to be done. Than splitting up scripts into
> subscripts and sourcing them from a meta script is a straightforward
> solution. It might also be that some functionality is put into functions to
> be reused in other places. This can be done by putting those function
> definitions into separate files. Than one cane use source wherever those
> functions are needed. Now, putting stuff that runs code and scripts that
> define/provovide functions into the same script is a bad idea. Using the
> main()-idioms described might prevent this the problems stemming from
> mixing functions and function execution. But it would also encourage this
> mixing which is - I think, a bad idea anyways.
>
> I actually would agree entirely that files should not serve as both source
> files for re-used functions as well as application code. The suggestion for
> a main() idiom is merely to reduce variable scope and bring R practices
> more in line with generally recommended programming practices, not so that
> they can act as packages/modules/libraries. When I compared R scripts
> containing main functions to packages, I only mean in the sense that they
> help manage scope (the latter through package namespaces). Any other named
> functions besides main would be functions specifically tied to the script.
>
> I do see your point, though, that this could result in bad practice,
> namely the usage mixing you described.
>
> Best,
> CG
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

R devel mailing list
I appreciate the well-thought-out comments.

To your first point, I am not sure what "glattering" means precisely (a Google search revealed nothing useful), but I assume it means something to the effect of overfilling the main namespace with too many names. Per Norm Matloff's counterpoint in The Art of R Programming regarding this issue, this is mostly avoided by well-defined, (sufficiently) long names. Also, when a program is properly modularized, one generally wouldn't have this many objects at the same time unless the complexity of a program demands it. You can, for example, use named function scope outside main or anonymous functions to limit variable scope to operations that need a given variable. Using main() with any named functions closely tied to a script defined outside it actually addresses this "glattering namespace" issue, since, if we treat the global scope as a main function instead of using a main() idiom, any functions that are defined in global scope will contain all global variables within its search path. Alternatively, one can put all named functions in a package; in some cases, however, it will make more sense to keep a function defined within the script. Unless you never modularize your code into functions and flatten everything out into a common namespace, using main would be helpful to avoid namespace-glattering. Maybe I'm missing something, but I'm not sure how namespace-glattering favors not using a main() idiom, since avoiding globals doesn't mean not structuring your code properly; it actually seems to favor using main(). Given any properly structured program (organizing functions as needed), the implementation that puts all variables into the global workspace (same as the top-level functions) will be less safe since all functions will contain all globals within its search path. (Unless, of course, every single function is put into a package).

To your second point, I agree that many of the issues associated with global state/environment are generally less problematic when using pure (or as pure as possible) functions. On a related note, lexically scoped functional languages (especially pure functional ones) generally encourage modularizing everything into functions, rather than having a lot of objects exposed to the top level (not to say that globals are not used, only that they are not the default choice). So the typical R way of doing this tends to disagree with how things are normally done in functional programming. Chopping our code into well-abstracted functions (and therefore namespaces) is the functional way to do things and helps to minimize the state to which any particular function has access. Organizing the functions we want to be pure so that they are not defined in the same environment in which they are called actually helps to ensure function purity in the input direction, since those functions will not have lexical-scope access to called variables. (That is, you may have written an impure function without realizing it; organizing functions so they are not defined in the same environment as when they are called helps to ensure purity.)

Perhaps I am mistaken, but in either case, your points actually favor a main() idiom, unless you take using main() to mean using main() with extra bits (e.g., flattening your code structure).

Admittedly, putting every single function into a package and not having any named functions in your script generally addresses all of these issues. 

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Conventions: Use of globals and main functions

Peter Meissner-3
The point is, that there are several possible problems.

But.

One the one hand they are not really problematic in my opinion (I do not
care if my function has potential access to objects outside of its
environment because this access is read-only at worst and it's not common
practice to use this potential anyways).
On the other hand I am not sure what the main()-idiom would actually add to
the table other than allowing for the dual use of function definition and
function execution code in the same script - which we agreed upon is bad
practice.

Best, Peter

Am Mi., 28. Aug. 2019 um 17:58 Uhr schrieb Cyclic Group Z_1 <
[hidden email]>:

> I appreciate the well-thought-out comments.
>
> To your first point, I am not sure what "glattering" means precisely (a
> Google search revealed nothing useful), but I assume it means something to
> the effect of overfilling the main namespace with too many names. Per Norm
> Matloff's counterpoint in The Art of R Programming regarding this issue,
> this is mostly avoided by well-defined, (sufficiently) long names. Also,
> when a program is properly modularized, one generally wouldn't have this
> many objects at the same time unless the complexity of a program demands
> it. You can, for example, use named function scope outside main or
> anonymous functions to limit variable scope to operations that need a given
> variable. Using main() with any named functions closely tied to a script
> defined outside it actually addresses this "glattering namespace" issue,
> since, if we treat the global scope as a main function instead of using a
> main() idiom, any functions that are defined in global scope will contain
> all global variables within its search path. Alternatively, one can put all
> named functions in a package; in some cases, however, it will make more
> sense to keep a function defined within the script. Unless you never
> modularize your code into functions and flatten everything out into a
> common namespace, using main would be helpful to avoid
> namespace-glattering. Maybe I'm missing something, but I'm not sure how
> namespace-glattering favors not using a main() idiom, since avoiding
> globals doesn't mean not structuring your code properly; it actually seems
> to favor using main(). Given any properly structured program (organizing
> functions as needed), the implementation that puts all variables into the
> global workspace (same as the top-level functions) will be less safe since
> all functions will contain all globals within its search path. (Unless, of
> course, every single function is put into a package).
>
> To your second point, I agree that many of the issues associated with
> global state/environment are generally less problematic when using pure (or
> as pure as possible) functions. On a related note, lexically scoped
> functional languages (especially pure functional ones) generally encourage
> modularizing everything into functions, rather than having a lot of objects
> exposed to the top level (not to say that globals are not used, only that
> they are not the default choice). So the typical R way of doing this tends
> to disagree with how things are normally done in functional programming.
> Chopping our code into well-abstracted functions (and therefore namespaces)
> is the functional way to do things and helps to minimize the state to which
> any particular function has access. Organizing the functions we want to be
> pure so that they are not defined in the same environment in which they are
> called actually helps to ensure function purity in the input direction,
> since those functions will not have lexical-scope access to called
> variables. (That is, you may have written an impure function without
> realizing it; organizing functions so they are not defined in the same
> environment as when they are called helps to ensure purity.)
>
> Perhaps I am mistaken, but in either case, your points actually favor a
> main() idiom, unless you take using main() to mean using main() with extra
> bits (e.g., flattening your code structure).
>
> Admittedly, putting every single function into a package and not having
> any named functions in your script generally addresses all of these issues.
>
> Best,
> CG
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
12