looking for formula parser that allows coefficients

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

looking for formula parser that allows coefficients

PaulJohnson32gmail
Can you point me at any packages that allow users to write a
formula with coefficients?

I want to write a data simulator that has a matrix X with lots
of columns, and then users can generate predictive models
by entering a formula that uses some of the variables, allowing
interactions, like

y ~ 2 + 1.1 * x1 + 3 * x3 + 0.1 * x1:x3 + 0.2 * x2:x2

Currently, in the rockchalk package, I have a function simulates
data (genCorrelatedData2), but my interface to enter the beta
coefficients is poor.  I assumed user would always enter 0's as
place holder for the unused coefficients, and the intercept is
always first. The unnamed vector is too confusing.  I have them specify:

c(2, 1.1, 0, 3, 0, 0, 0.2, ...)

I the documentation I say (ridiculously) it is easy to figure out from
the examples, but it really isnt.
It function prints out the equation it thinks you intended, thats
minimum protection against user error, but still not very good:

dat <- genCorrelatedData2(N = 10, rho = 0.0,
          beta = c(1, 2, 1, 1, 0, 0.2, 0, 0, 0),
          means = c(0,0,0), sds = c(1,1,1), stde = 0)
[1] "The equation that was calculated was"
y = 1 + 2*x1 + 1*x2 + 1*x3
 + 0*x1*x1 + 0.2*x2*x1 + 0*x3*x1
 + 0*x1*x2 + 0*x2*x2 + 0*x3*x2
 + 0*x1*x3 + 0*x2*x3 + 0*x3*x3
 + N(0,0) random error

But still, it is not very good.

As I look at this now, I realize expect just the vech, not the whole vector
of all interaction terms, so it is even more difficult than I thought to get the
correct input.Hence, I'd like to let the user write a formula.

The alternative for the user interface is to have named coefficients.
I can more or less easily allow a named vector for beta

beta = c("(Intercept)" = 1, "x1" = 2, "x2" = 1, "x3" = 1, "x2:x1" = 0.1)

I could build a formula from that.  That's not too bad. But I still think
it would be cool to allow formula input.

Have you ever seen it done?
pj
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: looking for formula parser that allows coefficients

Fox, John
Dear Paul,

Is it possible that you're overthinking this? That is, to you really need an R model formula or just want to evaluate an arithmetic expression using the columns of X?

If the latter, the following approach may work for you:

> evalFormula <- function(X, expr){
+   if (is.null(colnames(X))) colnames(X) <- paste0("x", 1:ncol(X))
+   with(as.data.frame(X), eval(parse(text=expr)))
+ }

> X <- matrix(1:20, 5, 4)
> X
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

> evalFormula(X, '2 + 3*x1 + 4*x2 + 5*x3 + 6*x1*x2')
[1] 120 180 252 336 432

I hope that this helps,
 John

-----------------------------------------------------------------
John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/



> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Paul
> Johnson
> Sent: Tuesday, August 21, 2018 6:46 PM
> To: R-help <[hidden email]>
> Subject: [R] looking for formula parser that allows coefficients
>
> Can you point me at any packages that allow users to write a formula with
> coefficients?
>
> I want to write a data simulator that has a matrix X with lots of columns, and
> then users can generate predictive models by entering a formula that uses
> some of the variables, allowing interactions, like
>
> y ~ 2 + 1.1 * x1 + 3 * x3 + 0.1 * x1:x3 + 0.2 * x2:x2
>
> Currently, in the rockchalk package, I have a function simulates data
> (genCorrelatedData2), but my interface to enter the beta coefficients is poor.
> I assumed user would always enter 0's as place holder for the unused
> coefficients, and the intercept is always first. The unnamed vector is too
> confusing.  I have them specify:
>
> c(2, 1.1, 0, 3, 0, 0, 0.2, ...)
>
> I the documentation I say (ridiculously) it is easy to figure out from the
> examples, but it really isnt.
> It function prints out the equation it thinks you intended, thats minimum
> protection against user error, but still not very good:
>
> dat <- genCorrelatedData2(N = 10, rho = 0.0,
>           beta = c(1, 2, 1, 1, 0, 0.2, 0, 0, 0),
>           means = c(0,0,0), sds = c(1,1,1), stde = 0) [1] "The equation that was
> calculated was"
> y = 1 + 2*x1 + 1*x2 + 1*x3
>  + 0*x1*x1 + 0.2*x2*x1 + 0*x3*x1
>  + 0*x1*x2 + 0*x2*x2 + 0*x3*x2
>  + 0*x1*x3 + 0*x2*x3 + 0*x3*x3
>  + N(0,0) random error
>
> But still, it is not very good.
>
> As I look at this now, I realize expect just the vech, not the whole vector of all
> interaction terms, so it is even more difficult than I thought to get the correct
> input.Hence, I'd like to let the user write a formula.
>
> The alternative for the user interface is to have named coefficients.
> I can more or less easily allow a named vector for beta
>
> beta = c("(Intercept)" = 1, "x1" = 2, "x2" = 1, "x3" = 1, "x2:x1" = 0.1)
>
> I could build a formula from that.  That's not too bad. But I still think it would
> be cool to allow formula input.
>
> Have you ever seen it done?
> pj
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: looking for formula parser that allows coefficients

Gabor Grothendieck
In reply to this post by PaulJohnson32gmail
Some string manipulation can convert the formula to a named vector such as
the one shown at the end of your post.

library(gsubfn)

# input
fo <- y ~ 2 - 1.1 * x1 + x3 - x1:x3 + 0.2 * x2:x2

pat <- "([+-])? *(\\d\\S*)? *\\*? *([[:alpha:]]\\S*)?"
ch <- format(fo[[3]])
m <- matrix(strapplyc(ch, pat)[[1]], 3)
m <- m[, colSums(m != "") > 0]
m[2, m[2, ] == ""] <- 1
m[3, m[3, ] == ""] <- "(Intercept)"
co <- as.numeric(paste0(m[1, ], m[2, ]))
v <- m[3, ]
setNames(co, v)
## (Intercept)          x1          x3       x1:x3       x2:x2
##         2.0        -1.1         1.0        -1.0         0.2
On Tue, Aug 21, 2018 at 6:46 PM Paul Johnson <[hidden email]> wrote:

>
> Can you point me at any packages that allow users to write a
> formula with coefficients?
>
> I want to write a data simulator that has a matrix X with lots
> of columns, and then users can generate predictive models
> by entering a formula that uses some of the variables, allowing
> interactions, like
>
> y ~ 2 + 1.1 * x1 + 3 * x3 + 0.1 * x1:x3 + 0.2 * x2:x2
>
> Currently, in the rockchalk package, I have a function simulates
> data (genCorrelatedData2), but my interface to enter the beta
> coefficients is poor.  I assumed user would always enter 0's as
> place holder for the unused coefficients, and the intercept is
> always first. The unnamed vector is too confusing.  I have them specify:
>
> c(2, 1.1, 0, 3, 0, 0, 0.2, ...)
>
> I the documentation I say (ridiculously) it is easy to figure out from
> the examples, but it really isnt.
> It function prints out the equation it thinks you intended, thats
> minimum protection against user error, but still not very good:
>
> dat <- genCorrelatedData2(N = 10, rho = 0.0,
>           beta = c(1, 2, 1, 1, 0, 0.2, 0, 0, 0),
>           means = c(0,0,0), sds = c(1,1,1), stde = 0)
> [1] "The equation that was calculated was"
> y = 1 + 2*x1 + 1*x2 + 1*x3
>  + 0*x1*x1 + 0.2*x2*x1 + 0*x3*x1
>  + 0*x1*x2 + 0*x2*x2 + 0*x3*x2
>  + 0*x1*x3 + 0*x2*x3 + 0*x3*x3
>  + N(0,0) random error
>
> But still, it is not very good.
>
> As I look at this now, I realize expect just the vech, not the whole vector
> of all interaction terms, so it is even more difficult than I thought to get the
> correct input.Hence, I'd like to let the user write a formula.
>
> The alternative for the user interface is to have named coefficients.
> I can more or less easily allow a named vector for beta
>
> beta = c("(Intercept)" = 1, "x1" = 2, "x2" = 1, "x3" = 1, "x2:x1" = 0.1)
>
> I could build a formula from that.  That's not too bad. But I still think
> it would be cool to allow formula input.
>
> Have you ever seen it done?
> pj
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: looking for formula parser that allows coefficients

Gabor Grothendieck
Also here is a solution that uses formula processing rather than
string processing.
No packages are used.

Parse <- function(e) {
  if (length(e) == 1) {
    if (is.numeric(e)) return(e)
    else setNames(1, as.character(e))
  } else {
    if (isChar(e[[1]], "*")) {
       x1 <- Recall(e[[2]])
       x2 <- Recall(e[[3]])
       setNames(unname(x1 * x2), paste0(names(x1), names(x2)))
    } else if (isChar(e[[1]], "+")) c(Recall(e[[2]]), Recall(e[[3]]))
    else if (isChar(e[[1]], "-")) {
      if (length(e) == 2) -1 * Recall(e[[2]])
      else c(Recall(e[[2]]), -Recall(e[[3]]))
    } else if (isChar(e[[1]], ":")) setNames(1, paste(e[-1], collapse = ":"))
  }
}

# test
fo <- y ~ 2 - 1.1 * x1 + x3 - x1:x3 + 0.2 * x2:x2
Parse(fo[[3]])

giving:

         x1    x3 x1:x3 x2:x2
  2.0  -1.1   1.0  -1.0   0.2
On Wed, Aug 22, 2018 at 11:50 AM Paul Johnson <[hidden email]> wrote:

>
> Thanks as usual.  I owe you more KU decorations soon.
> On Wed, Aug 22, 2018 at 2:34 AM Gabor Grothendieck
> <[hidden email]> wrote:
> >
> > Some string manipulation can convert the formula to a named vector such as
> > the one shown at the end of your post.
> >
> > library(gsubfn)
> >
> > # input
> > fo <- y ~ 2 - 1.1 * x1 + x3 - x1:x3 + 0.2 * x2:x2
> >
> > pat <- "([+-])? *(\\d\\S*)? *\\*? *([[:alpha:]]\\S*)?"
> > ch <- format(fo[[3]])
> > m <- matrix(strapplyc(ch, pat)[[1]], 3)
> > m <- m[, colSums(m != "") > 0]
> > m[2, m[2, ] == ""] <- 1
> > m[3, m[3, ] == ""] <- "(Intercept)"
> > co <- as.numeric(paste0(m[1, ], m[2, ]))
> > v <- m[3, ]
> > setNames(co, v)
> > ## (Intercept)          x1          x3       x1:x3       x2:x2
> > ##         2.0        -1.1         1.0        -1.0         0.2
> > On Tue, Aug 21, 2018 at 6:46 PM Paul Johnson <[hidden email]> wrote:
> > >
> > > Can you point me at any packages that allow users to write a
> > > formula with coefficients?
> > >
> > > I want to write a data simulator that has a matrix X with lots
> > > of columns, and then users can generate predictive models
> > > by entering a formula that uses some of the variables, allowing
> > > interactions, like
> > >
> > > y ~ 2 + 1.1 * x1 + 3 * x3 + 0.1 * x1:x3 + 0.2 * x2:x2
> > >
> > > Currently, in the rockchalk package, I have a function simulates
> > > data (genCorrelatedData2), but my interface to enter the beta
> > > coefficients is poor.  I assumed user would always enter 0's as
> > > place holder for the unused coefficients, and the intercept is
> > > always first. The unnamed vector is too confusing.  I have them specify:
> > >
> > > c(2, 1.1, 0, 3, 0, 0, 0.2, ...)
> > >
> > > I the documentation I say (ridiculously) it is easy to figure out from
> > > the examples, but it really isnt.
> > > It function prints out the equation it thinks you intended, thats
> > > minimum protection against user error, but still not very good:
> > >
> > > dat <- genCorrelatedData2(N = 10, rho = 0.0,
> > >           beta = c(1, 2, 1, 1, 0, 0.2, 0, 0, 0),
> > >           means = c(0,0,0), sds = c(1,1,1), stde = 0)
> > > [1] "The equation that was calculated was"
> > > y = 1 + 2*x1 + 1*x2 + 1*x3
> > >  + 0*x1*x1 + 0.2*x2*x1 + 0*x3*x1
> > >  + 0*x1*x2 + 0*x2*x2 + 0*x3*x2
> > >  + 0*x1*x3 + 0*x2*x3 + 0*x3*x3
> > >  + N(0,0) random error
> > >
> > > But still, it is not very good.
> > >
> > > As I look at this now, I realize expect just the vech, not the whole vector
> > > of all interaction terms, so it is even more difficult than I thought to get the
> > > correct input.Hence, I'd like to let the user write a formula.
> > >
> > > The alternative for the user interface is to have named coefficients.
> > > I can more or less easily allow a named vector for beta
> > >
> > > beta = c("(Intercept)" = 1, "x1" = 2, "x2" = 1, "x3" = 1, "x2:x1" = 0.1)
> > >
> > > I could build a formula from that.  That's not too bad. But I still think
> > > it would be cool to allow formula input.
> > >
> > > Have you ever seen it done?
> > > pj
> > > --
> > > Paul E. Johnson   http://pj.freefaculty.org
> > > Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
> > >
> > > To write to me directly, please address me at pauljohn at ku.edu.
> > >
> > > ______________________________________________
> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> > --
> > Statistics & Software Consulting
> > GKX Group, GKX Associates Inc.
> > tel: 1-877-GKX-GROUP
> > email: ggrothendieck at gmail.com
>
>
>
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: looking for formula parser that allows coefficients

Gabor Grothendieck
The isChar function used in Parse is:

  isChar <- function(e, ch) identical(e, as.symbol(ch))
On Fri, Aug 24, 2018 at 10:06 PM Gabor Grothendieck
<[hidden email]> wrote:

>
> Also here is a solution that uses formula processing rather than
> string processing.
> No packages are used.
>
> Parse <- function(e) {
>   if (length(e) == 1) {
>     if (is.numeric(e)) return(e)
>     else setNames(1, as.character(e))
>   } else {
>     if (isChar(e[[1]], "*")) {
>        x1 <- Recall(e[[2]])
>        x2 <- Recall(e[[3]])
>        setNames(unname(x1 * x2), paste0(names(x1), names(x2)))
>     } else if (isChar(e[[1]], "+")) c(Recall(e[[2]]), Recall(e[[3]]))
>     else if (isChar(e[[1]], "-")) {
>       if (length(e) == 2) -1 * Recall(e[[2]])
>       else c(Recall(e[[2]]), -Recall(e[[3]]))
>     } else if (isChar(e[[1]], ":")) setNames(1, paste(e[-1], collapse = ":"))
>   }
> }
>
> # test
> fo <- y ~ 2 - 1.1 * x1 + x3 - x1:x3 + 0.2 * x2:x2
> Parse(fo[[3]])
>
> giving:
>
>          x1    x3 x1:x3 x2:x2
>   2.0  -1.1   1.0  -1.0   0.2
> On Wed, Aug 22, 2018 at 11:50 AM Paul Johnson <[hidden email]> wrote:
> >
> > Thanks as usual.  I owe you more KU decorations soon.
> > On Wed, Aug 22, 2018 at 2:34 AM Gabor Grothendieck
> > <[hidden email]> wrote:
> > >
> > > Some string manipulation can convert the formula to a named vector such as
> > > the one shown at the end of your post.
> > >
> > > library(gsubfn)
> > >
> > > # input
> > > fo <- y ~ 2 - 1.1 * x1 + x3 - x1:x3 + 0.2 * x2:x2
> > >
> > > pat <- "([+-])? *(\\d\\S*)? *\\*? *([[:alpha:]]\\S*)?"
> > > ch <- format(fo[[3]])
> > > m <- matrix(strapplyc(ch, pat)[[1]], 3)
> > > m <- m[, colSums(m != "") > 0]
> > > m[2, m[2, ] == ""] <- 1
> > > m[3, m[3, ] == ""] <- "(Intercept)"
> > > co <- as.numeric(paste0(m[1, ], m[2, ]))
> > > v <- m[3, ]
> > > setNames(co, v)
> > > ## (Intercept)          x1          x3       x1:x3       x2:x2
> > > ##         2.0        -1.1         1.0        -1.0         0.2
> > > On Tue, Aug 21, 2018 at 6:46 PM Paul Johnson <[hidden email]> wrote:
> > > >
> > > > Can you point me at any packages that allow users to write a
> > > > formula with coefficients?
> > > >
> > > > I want to write a data simulator that has a matrix X with lots
> > > > of columns, and then users can generate predictive models
> > > > by entering a formula that uses some of the variables, allowing
> > > > interactions, like
> > > >
> > > > y ~ 2 + 1.1 * x1 + 3 * x3 + 0.1 * x1:x3 + 0.2 * x2:x2
> > > >
> > > > Currently, in the rockchalk package, I have a function simulates
> > > > data (genCorrelatedData2), but my interface to enter the beta
> > > > coefficients is poor.  I assumed user would always enter 0's as
> > > > place holder for the unused coefficients, and the intercept is
> > > > always first. The unnamed vector is too confusing.  I have them specify:
> > > >
> > > > c(2, 1.1, 0, 3, 0, 0, 0.2, ...)
> > > >
> > > > I the documentation I say (ridiculously) it is easy to figure out from
> > > > the examples, but it really isnt.
> > > > It function prints out the equation it thinks you intended, thats
> > > > minimum protection against user error, but still not very good:
> > > >
> > > > dat <- genCorrelatedData2(N = 10, rho = 0.0,
> > > >           beta = c(1, 2, 1, 1, 0, 0.2, 0, 0, 0),
> > > >           means = c(0,0,0), sds = c(1,1,1), stde = 0)
> > > > [1] "The equation that was calculated was"
> > > > y = 1 + 2*x1 + 1*x2 + 1*x3
> > > >  + 0*x1*x1 + 0.2*x2*x1 + 0*x3*x1
> > > >  + 0*x1*x2 + 0*x2*x2 + 0*x3*x2
> > > >  + 0*x1*x3 + 0*x2*x3 + 0*x3*x3
> > > >  + N(0,0) random error
> > > >
> > > > But still, it is not very good.
> > > >
> > > > As I look at this now, I realize expect just the vech, not the whole vector
> > > > of all interaction terms, so it is even more difficult than I thought to get the
> > > > correct input.Hence, I'd like to let the user write a formula.
> > > >
> > > > The alternative for the user interface is to have named coefficients.
> > > > I can more or less easily allow a named vector for beta
> > > >
> > > > beta = c("(Intercept)" = 1, "x1" = 2, "x2" = 1, "x3" = 1, "x2:x1" = 0.1)
> > > >
> > > > I could build a formula from that.  That's not too bad. But I still think
> > > > it would be cool to allow formula input.
> > > >
> > > > Have you ever seen it done?
> > > > pj
> > > > --
> > > > Paul E. Johnson   http://pj.freefaculty.org
> > > > Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
> > > >
> > > > To write to me directly, please address me at pauljohn at ku.edu.
> > > >
> > > > ______________________________________________
> > > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> > >
> > > --
> > > Statistics & Software Consulting
> > > GKX Group, GKX Associates Inc.
> > > tel: 1-877-GKX-GROUP
> > > email: ggrothendieck at gmail.com
> >
> >
> >
> > --
> > Paul E. Johnson   http://pj.freefaculty.org
> > Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
> >
> > To write to me directly, please address me at pauljohn at ku.edu.
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.