vctrs: a type system for the tidyverse

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

vctrs: a type system for the tidyverse

hadley wickham
Hi all,

I wanted to share with you an experimental package that I’m currently
working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for
vctrs is to think deeply about the output “type” of functions like
`c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
strategy throughout the tidyverse (i.e. all the functions listed at
<https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is
going to be a big change, I thought it would be very useful to get
comments from a wide audience, so I’m reaching out to R-devel to get
your thoughts.

There is quite a lot already in the readme
(<https://github.com/r-lib/vctrs#vctrs>), so here I’ll try to motivate
vctrs as succinctly as possible by comparing `base::c()` to its
equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
known, but to refresh your memory, I’ve highlighted a few at
<https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise
because of two main challenges: `c()` has to both combine vectors *and*
strip attributes, and it only dispatches on the first argument.

The design of vctrs is largely driven by a pair of principles:

-   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`

-   The type of `vec_c(x, vec_c(y, z))` should be the same as
    `vec_c(vec_c(x, y), z)`

i.e. the type should be associative and commutative. I think these are
good principles because they makes types simpler to understand and to
implement.

Method dispatch for `vec_c()` is quite simple because associativity and
commutativity mean that we can determine the output type only by
considering a pair of inputs at a time. To this end, vctrs provides
`vec_type2()` which takes two inputs and returns their common type
(represented as zero length vector):

    str(vec_type2(integer(), double()))
    #>  num(0)

    str(vec_type2(factor("a"), factor("b")))
    #>  Factor w/ 2 levels "a","b":

    # NB: not all types have a common/unifying type
    str(vec_type2(Sys.Date(), factor("a")))
    #> Error: No common type for date and factor

(`vec_type()` currently implements double dispatch through a combination
of S3 dispatch and if-else blocks, but this will change to a pure S3
approach in the near future.)

To find the common type of multiple vectors, we can use `Reduce()`:

    vecs <- list(TRUE, 1:10, 1.5)

    type <- Reduce(vec_type2, vecs)
    str(type)
    #>  num(0)

There’s one other piece of the puzzle: casting one vector to another
type. That’s implemented by `vec_cast()` (which also uses double
dispatch):

    str(lapply(vecs, vec_cast, to = type))
    #> List of 3
    #>  $ : num 1
    #>  $ : num [1:10] 1 2 3 4 5 6 7 8 9 10
    #>  $ : num 1.5

All up, this means that we can implement the essence of `vec_c()` in
only a few lines:

    vec_c2 <- function(...) {
      args <- list(...)
      type <- Reduce(vec_type, args)

      cast <- lapply(type, vec_cast, to = type)
      unlist(cast, recurse = FALSE)
    }

    vec_c(factor("a"), factor("b"))
    #> [1] a b
    #> Levels: a b

    vec_c(Sys.Date(), Sys.time())
    #> [1] "2018-08-06 00:00:00 CDT" "2018-08-06 11:20:32 CDT"

(The real implementation is little more complex:
<https://github.com/r-lib/vctrs/blob/master/R/c.R>)

On top of this foundation, vctrs expands in a few different ways:

-   To consider the “type” of a data frame, and what the common type of
    two data frames should be. This leads to a natural implementation of
    `vec_rbind()` which includes all columns that appear in any input.

-   To create a new “list\_of” type, a list where every element is of
    fixed type (enforced by `[<-`, `[[<-`, and `$<-`)

-   To think a little about the “shape” of a vector, and to consider
    recycling as part of the type system. (This thinking is not yet
    fully fleshed out)

Thanks for making it to the bottom of this long email :) I would love to
hear your thoughts on vctrs. It’s something that I’ve been having a lot
of fun exploring, and I’d like to make sure it is as robust as possible
(and the motivations are as clear as possible) before we start using it
in other packages.

Hadley


--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Gabe Becker
Hadley,

Looks interesting and like a fun project from what you said in the email (I
don't have time right now to dig deep into the readme) A few thoughts.

First off, you are using the word "type" throughout this email; You seem to
mean class (judging by your Date and factor examples, and the fact you
mention S3 dispatch) as opposed to type in the sense of what is returned by
R's  typeof() function. I think it would be clearer if you called it class
throughout unless that isn't actually what you mean (in which case I would
have other questions...)

More thoughts inline.

On Mon, Aug 6, 2018 at 9:21 AM, Hadley Wickham <[hidden email]> wrote:

> Hi all,
>
> I wanted to share with you an experimental package that I’m currently
> working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for
> vctrs is to think deeply about the output “type” of functions like
> `c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
> strategy throughout the tidyverse (i.e. all the functions listed at
> <https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is
> going to be a big change, I thought it would be very useful to get
> comments from a wide audience, so I’m reaching out to R-devel to get
> your thoughts.
>
> There is quite a lot already in the readme
> (<https://github.com/r-lib/vctrs#vctrs>), so here I’ll try to motivate
> vctrs as succinctly as possible by comparing `base::c()` to its
> equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
> known, but to refresh your memory, I’ve highlighted a few at
> <https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise
> because of two main challenges: `c()` has to both combine vectors *and*
> strip attributes, and it only dispatches on the first argument.
>
> The design of vctrs is largely driven by a pair of principles:
>
> -   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`
>
> -   The type of `vec_c(x, vec_c(y, z))` should be the same as
>     `vec_c(vec_c(x, y), z)`
>
> i.e. the type should be associative and commutative. I think these are
> good principles because they makes types simpler to understand and to
> implement.
>
> Method dispatch for `vec_c()` is quite simple because associativity and
> commutativity mean that we can determine the output type only by
> considering a pair of inputs at a time. To this end, vctrs provides
> `vec_type2()` which takes two inputs and returns their common type
> (represented as zero length vector):
>
>     str(vec_type2(integer(), double()))
>     #>  num(0)
>
>     str(vec_type2(factor("a"), factor("b")))
>     #>  Factor w/ 2 levels "a","b":
>

What is the reasoning behind taking the union of the levels here? I'm not
sure that is actually the behavior I would want if I have a vector of
factors and I try to append some new data to it. I might want/ expect to
retain the existing levels and get either NAs or an error if the new data
has (present) levels not in the first data. The behavior as above doesn't
seem in-line with what I understand the purpose of factors to be (explicit
restriction of possible values).

I guess what I'm saying is that while I agree associativity is good for
most things, it doesn't seem like the right behavior to me in the case of
factors.

Also, while we're on factors, what does

vec_type2(factor("a"), "a")

return, character or factor with levels "a"?



>
>     # NB: not all types have a common/unifying type
>     str(vec_type2(Sys.Date(), factor("a")))
>     #> Error: No common type for date and factor
>

Why is this not a list? Do you have the additional restraint that vec_type2
must return the class of one of its operands? If so, what is the
justification of that? Are you not counting list as a "type of vector"?


>
> (`vec_type()` currently implements double dispatch through a combination
> of S3 dispatch and if-else blocks, but this will change to a pure S3
> approach in the near future.)
>
> To find the common type of multiple vectors, we can use `Reduce()`:
>
>     vecs <- list(TRUE, 1:10, 1.5)
>
>     type <- Reduce(vec_type2, vecs)
>     str(type)
>     #>  num(0)
>
> There’s one other piece of the puzzle: casting one vector to another
> type. That’s implemented by `vec_cast()` (which also uses double
> dispatch):
>
>     str(lapply(vecs, vec_cast, to = type))
>     #> List of 3
>     #>  $ : num 1
>     #>  $ : num [1:10] 1 2 3 4 5 6 7 8 9 10
>     #>  $ : num 1.5
>
> All up, this means that we can implement the essence of `vec_c()` in
> only a few lines:
>
>     vec_c2 <- function(...) {
>       args <- list(...)
>       type <- Reduce(vec_type, args)
>
>       cast <- lapply(type, vec_cast, to = type)
>       unlist(cast, recurse = FALSE)
>     }
>
>     vec_c(factor("a"), factor("b"))
>     #> [1] a b
>     #> Levels: a b
>
>     vec_c(Sys.Date(), Sys.time())
>     #> [1] "2018-08-06 00:00:00 CDT" "2018-08-06 11:20:32 CDT"
>
> (The real implementation is little more complex:
> <https://github.com/r-lib/vctrs/blob/master/R/c.R>)
>
> On top of this foundation, vctrs expands in a few different ways:
>
> -   To consider the “type” of a data frame, and what the common type of
>     two data frames should be. This leads to a natural implementation of
>     `vec_rbind()` which includes all columns that appear in any input.
>

I must admit I'm a bit surprised here. rbind is one of the few places that
immediately come to mind where R takes a fail early and loud approach to
likely errors (as opposed to the more permissive do soemthing  that could
be what they meant appraoch of, e.g., out-of-bounds indexing). Are we sure
we want rbind to get less strict with respect to compatibility of the
data.frames being combined? Another "permissive" option would be to return
a data.frame which has only the intersection of the columns. There are
certainly times when that is what I want (rather than columns with tons of
NAs in them) and it would be convenient not to need to do the column
subsetting myself. This behavior would also meet your design goals of
associativity and commutivity.

I want to be clear, I think what you describe is a useful operation, if it
is what is intended, but perhaps a different name rather than calling it
rbind? maybe vec_rcbind to indicate that both rows and columns are being
potentially added to any given individual input.

Best,
~G


> -   To create a new “list\_of” type, a list where every element is of
>     fixed type (enforced by `[<-`, `[[<-`, and `$<-`)
>
> -   To think a little about the “shape” of a vector, and to consider
>     recycling as part of the type system. (This thinking is not yet
>     fully fleshed out)
>
> Thanks for making it to the bottom of this long email :) I would love to
> hear your thoughts on vctrs. It’s something that I’ve been having a lot
> of fun exploring, and I’d like to make sure it is as robust as possible
> (and the motivations are as clear as possible) before we start using it
> in other packages.
>
> Hadley
>
>
> --
> http://hadley.nz
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
> First off, you are using the word "type" throughout this email; You seem to
> mean class (judging by your Date and factor examples, and the fact you
> mention S3 dispatch) as opposed to type in the sense of what is returned by
> R's  typeof() function. I think it would be clearer if you called it class
> throughout unless that isn't actually what you mean (in which case I would
> have other questions...)

I used "type" to hand wave away the precise definition - it's not S3
class or base type (i.e. typeof()) but some hybrid of the two. I do
want to emphasise that it's a type system, not a oo system, in that
coercions are not defined by superclass/subclass relationships.

> More thoughts inline.
>
> On Mon, Aug 6, 2018 at 9:21 AM, Hadley Wickham <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I wanted to share with you an experimental package that I’m currently
>> working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for
>> vctrs is to think deeply about the output “type” of functions like
>> `c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
>> strategy throughout the tidyverse (i.e. all the functions listed at
>> <https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is
>> going to be a big change, I thought it would be very useful to get
>> comments from a wide audience, so I’m reaching out to R-devel to get
>> your thoughts.
>>
>> There is quite a lot already in the readme
>> (<https://github.com/r-lib/vctrs#vctrs>), so here I’ll try to motivate
>> vctrs as succinctly as possible by comparing `base::c()` to its
>> equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
>> known, but to refresh your memory, I’ve highlighted a few at
>> <https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise
>> because of two main challenges: `c()` has to both combine vectors *and*
>> strip attributes, and it only dispatches on the first argument.
>>
>> The design of vctrs is largely driven by a pair of principles:
>>
>> -   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`
>>
>> -   The type of `vec_c(x, vec_c(y, z))` should be the same as
>>     `vec_c(vec_c(x, y), z)`
>>
>> i.e. the type should be associative and commutative. I think these are
>> good principles because they makes types simpler to understand and to
>> implement.
>>
>> Method dispatch for `vec_c()` is quite simple because associativity and
>> commutativity mean that we can determine the output type only by
>> considering a pair of inputs at a time. To this end, vctrs provides
>> `vec_type2()` which takes two inputs and returns their common type
>> (represented as zero length vector):
>>
>>     str(vec_type2(integer(), double()))
>>     #>  num(0)
>>
>>     str(vec_type2(factor("a"), factor("b")))
>>     #>  Factor w/ 2 levels "a","b":
>
>
> What is the reasoning behind taking the union of the levels here? I'm not
> sure that is actually the behavior I would want if I have a vector of
> factors and I try to append some new data to it. I might want/ expect to
> retain the existing levels and get either NAs or an error if the new data
> has (present) levels not in the first data. The behavior as above doesn't
> seem in-line with what I understand the purpose of factors to be (explicit
> restriction of possible values).

Originally (like a week ago 😀), we threw an error if the factors
didn't have the same level, and provided an optional coercion to
character. I decided that while correct (the factor levels are a
parameter of the type, and hence factors with different levels aren't
comparable), that this fights too much against how people actually use
factors in practice. It also seems like base R is moving more in this
direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas
in R 3.5 it returns FALSE.

I'm not wedded to the current approach, but it feels like the same
principle should apply in comparisons like x == y (even though == is
outside the scope of vctrs, ideally the underlying principles would be
robust enough to suggest what should happen).

> I guess what I'm saying is that while I agree associativity is good for most
> things, it doesn't seem like the right behavior to me in the case of
> factors.

I think associativity is such a strong and useful principle that it
may be worth making some sacrifices for factors. That said, my claim
of associativity is only on the type, not the values of the type:
vec_c(fa, fb) and vec_c(fb, fa) both return factors, but the levels
are in different orders.

> Also, while we're on factors, what does
>
> vec_type2(factor("a"), "a")
>
> return, character or factor with levels "a"?

Character. Coercing to character would potentially lose too much
information. I think you could argue that this could be an error, but
again I feel like this would make the type system a little too strict
and cause extra friction for most uses.

>>     # NB: not all types have a common/unifying type
>>     str(vec_type2(Sys.Date(), factor("a")))
>>     #> Error: No common type for date and factor
>
>
> Why is this not a list? Do you have the additional restraint that vec_type2
> must return the class of one of its operands? If so, what is the
> justification of that? Are you not counting list as a "type of vector"?

You can always request a list, with `vec_type2(Sys.Date(),
factor("a"), .type = list())` - generally the philosophy is too not
make major changes to the type without explicit user input.

I can't currently fully articulate my reasoning for why some coercions
happen automatically, and why some don't. I think these decisions have
to be made somewhat on the basis of pragmatics, and what R users are
currently familiar with. You can see a visual summary of implicit
casts (arrows) + explicit casts (circles) at
https://github.com/r-lib/vctrs/blob/master/man/figures/combined.png.
This matrix must be symmetric, and I think it should be block
diagonal, but I don't otherwise know what the constraints are.

>> On top of this foundation, vctrs expands in a few different ways:
>>
>> -   To consider the “type” of a data frame, and what the common type of
>>     two data frames should be. This leads to a natural implementation of
>>     `vec_rbind()` which includes all columns that appear in any input.
>
>
> I must admit I'm a bit surprised here. rbind is one of the few places that
> immediately come to mind where R takes a fail early and loud approach to
> likely errors (as opposed to the more permissive do soemthing  that could be
> what they meant appraoch of, e.g., out-of-bounds indexing). Are we sure we
> want rbind to get less strict with respect to compatibility of the
> data.frames being combined?

Pragmatically, it's clearly needed for data analysis.

Also note that there are some inputs to rbind that lead to silent data loss:

rbind(data.frame(x = 1:3), c(1, 1000000))
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 1

So while it's pretty good in general, there are still a few
infelicities (In particular, I suspect R-core might be interested in
fixing this one)

> Another "permissive" option would be to return a
> data.frame which has only the intersection of the columns. There are
> certainly times when that is what I want (rather than columns with tons of
> NAs in them) and it would be convenient not to need to do the column
> subsetting myself. This behavior would also meet your design goals of
> associativity and commutivity.

Yes, I think that would make sense as an option and would be trivial
to implemet (issue at https://github.com/r-lib/vctrs/issues/46).

Another thing I need to implement is the ability to specify the types
of some columns. Currently it's all or nothing:

vec_rbind(
  data.frame(x = F, y = 1),
  data.frame(x = 1L, y = 2),
  .type = data.frame(x = logical())
)

#>       x
#> 1 FALSE
#> 2  TRUE
#> Warning messages:
#> 1: Lossy conversion from data.frame to data.frame
#> Dropped variables: y
#> 2: Lossy conversion from data.frame to data.frame
#> Dropped variables: y

> I want to be clear, I think what you describe is a useful operation, if it
> is what is intended, but perhaps a different name rather than calling it
> rbind? maybe vec_rcbind to indicate that both rows and columns are being
> potentially added to any given individual input.

Sorry, I should have mentioned that this is unlikely to be the final
name. As well as the problem you mention, I think calling them
vec_cbind() and vec_rbind() over-emphasises the symmetry between the
two operations. cbind() and rbind() are symmetric for matrices, but
for data frames, rbind() is more about common types, and cbind() is
more about common shapes.

Thanks for your feedback, it's very useful!

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
>>> Method dispatch for `vec_c()` is quite simple because associativity and
>>> commutativity mean that we can determine the output type only by
>>> considering a pair of inputs at a time. To this end, vctrs provides
>>> `vec_type2()` which takes two inputs and returns their common type
>>> (represented as zero length vector):
>>>
>>>     str(vec_type2(integer(), double()))
>>>     #>  num(0)
>>>
>>>     str(vec_type2(factor("a"), factor("b")))
>>>     #>  Factor w/ 2 levels "a","b":
>>
>>
>> What is the reasoning behind taking the union of the levels here? I'm not
>> sure that is actually the behavior I would want if I have a vector of
>> factors and I try to append some new data to it. I might want/ expect to
>> retain the existing levels and get either NAs or an error if the new data
>> has (present) levels not in the first data. The behavior as above doesn't
>> seem in-line with what I understand the purpose of factors to be (explicit
>> restriction of possible values).
>
> Originally (like a week ago 😀), we threw an error if the factors
> didn't have the same level, and provided an optional coercion to
> character. I decided that while correct (the factor levels are a
> parameter of the type, and hence factors with different levels aren't
> comparable), that this fights too much against how people actually use
> factors in practice. It also seems like base R is moving more in this
> direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas
> in R 3.5 it returns FALSE.

I now have a better argument, I think:

If you squint your brain a little, I think you can see that each set
of automatic coercions is about increasing resolution. Integers are
low resolution versions of doubles, and dates are low resolution
versions of date-times. Logicals are low resolution version of
integers because there's a strong convention that `TRUE` and `FALSE`
can be used interchangeably with `1` and `0`.

But what is the resolution of a factor? We must take a somewhat
pragmatic approach because base R often converts character vectors to
factors, and we don't want to be burdensome to users. So we say that a
factor `x` has finer resolution than factor `y` if the levels of `y`
are contained in `x`. So to find the common type of two factors, we
take the union of the levels of each factor, given a factor that has
finer resolution than both. Finally, you can think of a character
vector as a factor with every possible level, so factors and character
vectors are coercible.

(extracted from the in-progress vignette explaining how to extend
vctrs to work with your own vctrs, now that vctrs has been rewritten
to use double dispatch)

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Martin Maechler
>>>>> Hadley Wickham
>>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:

    >>>> Method dispatch for `vec_c()` is quite simple because
    >>>> associativity and commutativity mean that we can
    >>>> determine the output type only by considering a pair of
    >>>> inputs at a time. To this end, vctrs provides
    >>>> `vec_type2()` which takes two inputs and returns their
    >>>> common type (represented as zero length vector):
    >>>>
    >>>> str(vec_type2(integer(), double())) #> num(0)
    >>>>
    >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
    >>>> levels "a","b":
    >>>
    >>>
    >>> What is the reasoning behind taking the union of the
    >>> levels here? I'm not sure that is actually the behavior
    >>> I would want if I have a vector of factors and I try to
    >>> append some new data to it. I might want/ expect to
    >>> retain the existing levels and get either NAs or an
    >>> error if the new data has (present) levels not in the
    >>> first data. The behavior as above doesn't seem in-line
    >>> with what I understand the purpose of factors to be
    >>> (explicit restriction of possible values).
    >>
    >> Originally (like a week ago 😀), we threw an error if the
    >> factors didn't have the same level, and provided an
    >> optional coercion to character. I decided that while
    >> correct (the factor levels are a parameter of the type,
    >> and hence factors with different levels aren't
    >> comparable), that this fights too much against how people
    >> actually use factors in practice. It also seems like base
    >> R is moving more in this direction, i.e. in 3.4
    >> factor("a") == factor("b") is an error, whereas in R 3.5
    >> it returns FALSE.

    > I now have a better argument, I think:

    > If you squint your brain a little, I think you can see
    > that each set of automatic coercions is about increasing
    > resolution. Integers are low resolution versions of
    > doubles, and dates are low resolution versions of
    > date-times. Logicals are low resolution version of
    > integers because there's a strong convention that `TRUE`
    > and `FALSE` can be used interchangeably with `1` and `0`.

    > But what is the resolution of a factor? We must take a
    > somewhat pragmatic approach because base R often converts
    > character vectors to factors, and we don't want to be
    > burdensome to users. So we say that a factor `x` has finer
    > resolution than factor `y` if the levels of `y` are
    > contained in `x`. So to find the common type of two
    > factors, we take the union of the levels of each factor,
    > given a factor that has finer resolution than
    > both. Finally, you can think of a character vector as a
    > factor with every possible level, so factors and character
    > vectors are coercible.

    > (extracted from the in-progress vignette explaining how to
    > extend vctrs to work with your own vctrs, now that vctrs
    > has been rewritten to use double dispatch)

I like this argumentation, and find it very nice indeed!
It confirms my own gut feeling which had lead me to agreeing
with you, Hadley, that taking the union of all factor levels
should be done here.

As Gabe mentioned (and you've explained about) the term "type"
is really confusing here.  As you know, the R internals are all
about SEXPs, TYPEOF(), etc, and that's what the R level
typeof(.) also returns.  As you want to use something slightly
different, it should be different naming, ideally something not
existing yet in the R / S world, maybe 'kind' ?

Martin


    > Hadley

    > --
    > http://hadley.nz

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Gabe Becker
In reply to this post by hadley wickham
Hadley,

Responses inline.

On Wed, Aug 8, 2018 at 7:34 AM, Hadley Wickham <[hidden email]> wrote:

> >>> Method dispatch for `vec_c()` is quite simple because associativity and
> >>> commutativity mean that we can determine the output type only by
> >>> considering a pair of inputs at a time. To this end, vctrs provides
> >>> `vec_type2()` which takes two inputs and returns their common type
> >>> (represented as zero length vector):
> >>>
> >>>     str(vec_type2(integer(), double()))
> >>>     #>  num(0)
> >>>
> >>>     str(vec_type2(factor("a"), factor("b")))
> >>>     #>  Factor w/ 2 levels "a","b":
> >>
> >>
> >> What is the reasoning behind taking the union of the levels here? I'm
> not
> >> sure that is actually the behavior I would want if I have a vector of
> >> factors and I try to append some new data to it. I might want/ expect to
> >> retain the existing levels and get either NAs or an error if the new
> data
> >> has (present) levels not in the first data. The behavior as above
> doesn't
> >> seem in-line with what I understand the purpose of factors to be
> (explicit
> >> restriction of possible values).
> >
> > Originally (like a week ago 😀), we threw an error if the factors
> > didn't have the same level, and provided an optional coercion to
> > character. I decided that while correct (the factor levels are a
> > parameter of the type, and hence factors with different levels aren't
> > comparable), that this fights too much against how people actually use
> > factors in practice. It also seems like base R is moving more in this
> > direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas
> > in R 3.5 it returns FALSE.
>
> I now have a better argument, I think:
>
> If you squint your brain a little, I think you can see that each set
> of automatic coercions is about increasing resolution. Integers are
> low resolution versions of doubles, and dates are low resolution
> versions of date-times. Logicals are low resolution version of
> integers because there's a strong convention that `TRUE` and `FALSE`
> can be used interchangeably with `1` and `0`.
>
> But what is the resolution of a factor? We must take a somewhat
> pragmatic approach because base R often converts character vectors to
> factors, and we don't want to be burdensome to users.


I don't know, I personally just don't buy this line of reasoning. Yes, you
can convert between characters and factors, but that doesn't make factors
"a special kind of character", which you seem to be implicitly arguing they
are. Fundamentally they are different objects with different purposes. As I
said in my previous email, the primary semantic purpose of factors is value
restriction. You don't WANT to increase the set of levels when your set of
values has already been carefully curated. Certainly not automagically.


> So we say that a
> factor `x` has finer resolution than factor `y` if the levels of `y`
> are contained in `x`. So to find the common type of two factors, we
> take the union of the levels of each factor, given a factor that has
> finer resolution than both.


I'm not so sure. I think a more useful definition of resolution may be that
it is about increasing the precision of information. In that case, a factor
with 4 levels each of which is present has a *higher* resolution than the
same data with additional-but-absent levels on the factor object.  Now that
may be different when the the new levels are not absent, but my point is
that its not clear to me that resolution is a useful way of talking about
factors.


> Finally, you can think of a character
> vector as a factor with every possible level, so factors and character
> vectors are coercible.
>



If users want unrestricted character type behavior, then IMHO they should
just be using characters, and it's quite easy for them to do so in any case
I can easily think of where they have somehow gotten their hands on a
factor. If, however, they want a factor, it must be - I imagine - because
they actually want the the semantics and behavior *specific* to factors.

Best,
~G


>
> (extracted from the in-progress vignette explaining how to extend
> vctrs to work with your own vctrs, now that vctrs has been rewritten
> to use double dispatch)
>
> Hadley
>
> --
> http://hadley.nz
>



--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Gabe Becker
In reply to this post by Martin Maechler
Actually, I sent that too quickly, I should have let it stew a bit more.
I've changed my mind about the resolution argument I Was trying to make.
There is more information, technically speaking, in the factor with empty
levels. I'm still not convinced that its the right behavior, personally. It
may just be me though, since Martin seems on board. Mostly I'm just very
wary of taking away the thing about factors that makes them fundamentally
not characters, and removing the effectiveness of the level restriction, in
practice, does that.

Best,
~G

On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <[hidden email]>
wrote:

> >>>>> Hadley Wickham
> >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
>
>     >>>> Method dispatch for `vec_c()` is quite simple because
>     >>>> associativity and commutativity mean that we can
>     >>>> determine the output type only by considering a pair of
>     >>>> inputs at a time. To this end, vctrs provides
>     >>>> `vec_type2()` which takes two inputs and returns their
>     >>>> common type (represented as zero length vector):
>     >>>>
>     >>>> str(vec_type2(integer(), double())) #> num(0)
>     >>>>
>     >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
>     >>>> levels "a","b":
>     >>>
>     >>>
>     >>> What is the reasoning behind taking the union of the
>     >>> levels here? I'm not sure that is actually the behavior
>     >>> I would want if I have a vector of factors and I try to
>     >>> append some new data to it. I might want/ expect to
>     >>> retain the existing levels and get either NAs or an
>     >>> error if the new data has (present) levels not in the
>     >>> first data. The behavior as above doesn't seem in-line
>     >>> with what I understand the purpose of factors to be
>     >>> (explicit restriction of possible values).
>     >>
>     >> Originally (like a week ago 😀), we threw an error if the
>     >> factors didn't have the same level, and provided an
>     >> optional coercion to character. I decided that while
>     >> correct (the factor levels are a parameter of the type,
>     >> and hence factors with different levels aren't
>     >> comparable), that this fights too much against how people
>     >> actually use factors in practice. It also seems like base
>     >> R is moving more in this direction, i.e. in 3.4
>     >> factor("a") == factor("b") is an error, whereas in R 3.5
>     >> it returns FALSE.
>
>     > I now have a better argument, I think:
>
>     > If you squint your brain a little, I think you can see
>     > that each set of automatic coercions is about increasing
>     > resolution. Integers are low resolution versions of
>     > doubles, and dates are low resolution versions of
>     > date-times. Logicals are low resolution version of
>     > integers because there's a strong convention that `TRUE`
>     > and `FALSE` can be used interchangeably with `1` and `0`.
>
>     > But what is the resolution of a factor? We must take a
>     > somewhat pragmatic approach because base R often converts
>     > character vectors to factors, and we don't want to be
>     > burdensome to users. So we say that a factor `x` has finer
>     > resolution than factor `y` if the levels of `y` are
>     > contained in `x`. So to find the common type of two
>     > factors, we take the union of the levels of each factor,
>     > given a factor that has finer resolution than
>     > both. Finally, you can think of a character vector as a
>     > factor with every possible level, so factors and character
>     > vectors are coercible.
>
>     > (extracted from the in-progress vignette explaining how to
>     > extend vctrs to work with your own vctrs, now that vctrs
>     > has been rewritten to use double dispatch)
>
> I like this argumentation, and find it very nice indeed!
> It confirms my own gut feeling which had lead me to agreeing
> with you, Hadley, that taking the union of all factor levels
> should be done here.
>
> As Gabe mentioned (and you've explained about) the term "type"
> is really confusing here.  As you know, the R internals are all
> about SEXPs, TYPEOF(), etc, and that's what the R level
> typeof(.) also returns.  As you want to use something slightly
> different, it should be different naming, ideally something not
> existing yet in the R / S world, maybe 'kind' ?
>
> Martin
>
>
>     > Hadley
>
>     > --
>     > http://hadley.nz
>
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>


--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
In reply to this post by Martin Maechler
>     > I now have a better argument, I think:
>
>     > If you squint your brain a little, I think you can see
>     > that each set of automatic coercions is about increasing
>     > resolution. Integers are low resolution versions of
>     > doubles, and dates are low resolution versions of
>     > date-times. Logicals are low resolution version of
>     > integers because there's a strong convention that `TRUE`
>     > and `FALSE` can be used interchangeably with `1` and `0`.
>
>     > But what is the resolution of a factor? We must take a
>     > somewhat pragmatic approach because base R often converts
>     > character vectors to factors, and we don't want to be
>     > burdensome to users. So we say that a factor `x` has finer
>     > resolution than factor `y` if the levels of `y` are
>     > contained in `x`. So to find the common type of two
>     > factors, we take the union of the levels of each factor,
>     > given a factor that has finer resolution than
>     > both. Finally, you can think of a character vector as a
>     > factor with every possible level, so factors and character
>     > vectors are coercible.
>
>     > (extracted from the in-progress vignette explaining how to
>     > extend vctrs to work with your own vctrs, now that vctrs
>     > has been rewritten to use double dispatch)
>
> I like this argumentation, and find it very nice indeed!
> It confirms my own gut feeling which had lead me to agreeing
> with you, Hadley, that taking the union of all factor levels
> should be done here.

That's great to hear :)

> As Gabe mentioned (and you've explained about) the term "type"
> is really confusing here.  As you know, the R internals are all
> about SEXPs, TYPEOF(), etc, and that's what the R level
> typeof(.) also returns.  As you want to use something slightly
> different, it should be different naming, ideally something not
> existing yet in the R / S world, maybe 'kind' ?

Agreed - I've been using type in the sense of "type system"
(particularly as it related to algebraic data types), but that's not
obvious from the current presentation, and as you note, is confusing
with existing notions of type in R. I like your suggestion of kind,
but I think it might be possible to just talk about classes, and
instead emphasise that while the components of the system are classes
(and indeed it's implemented using S3), the coercion/casting
relationship do not strictly follow the subclass/superclass
relationships.

A good motivating example is now ordered vs factor - I don't think you
can say that ordered or factor have greater resolution than the other
so:

vec_c(factor("a"), ordered("a"))
#> Error: No common type for factor and ordered

This is not what you'd expect from an _object_ system since ordered is
a subclass of factor.

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
In reply to this post by Gabe Becker
>> So we say that a
>> factor `x` has finer resolution than factor `y` if the levels of `y`
>> are contained in `x`. So to find the common type of two factors, we
>> take the union of the levels of each factor, given a factor that has
>> finer resolution than both.
>
> I'm not so sure. I think a more useful definition of resolution may be
> that it is about increasing the precision of information. In that case,
> a factor with 4 levels each of which is present has a higher resolution
> than the same data with additional-but-absent levels on the factor object.
> Now that may be different when the the new levels are not absent, but
> my point is that its not clear to me that resolution is a useful way of
> talking about factors.

An alternative way of framing factors is that they're about tracking
possible values, particular possible values that don't exist in the
data that you have. Thinking about factors in that way, makes unioning
the levels more natural.

> If users want unrestricted character type behavior, then IMHO they should
> just be using characters, and it's quite easy for them to do so in any case
> I can easily think of where they have somehow gotten their hands on a factor.
> If, however, they want a factor, it must be - I imagine - because they actually
> want the the semantics and behavior specific to factors.

I think this is true in the tidyverse, which will never give you a
factor unless you explicitly ask for one, but the default in base R
(at least as soon as a data frame is involved) is to turn character
vectors into factors.

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Iñaki Úcar
In reply to this post by Gabe Becker
El mié., 8 ago. 2018 a las 19:23, Gabe Becker (<[hidden email]>) escribió:
>
> Actually, I sent that too quickly, I should have let it stew a bit more.
> I've changed my mind about the resolution argument I Was trying to make.
> There is more information, technically speaking, in the factor with empty
> levels. I'm still not convinced that its the right behavior, personally. It
> may just be me though, since Martin seems on board. Mostly I'm just very
> wary of taking away the thing about factors that makes them fundamentally
> not characters, and removing the effectiveness of the level restriction, in
> practice, does that.

For what it's worth, I always thought about factors as fundamentally
characters, but with restrictions: a subspace of all possible strings.
And I'd say that a non-negligible number of R users may think about
them in a similar way.

In fact, if you search "concatenation factors", you'll see that back
in 2008 somebody asked on R-help [1] because he wanted to do exactly
what Hadley is describing (i.e., concatenation as character with
levels as a union of the levels), and he was surprised because...
well, the behaviour of c.factor is quite surprising if you don't read
the manual.

BTW, the solution proposed was unlist(list(fct1, fct2)).

[1] https://www.mail-archive.com/r-help@.../msg38360.html

Iñaki

>
> Best,
> ~G
>
> On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <[hidden email]>
> wrote:
>
> > >>>>> Hadley Wickham
> > >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
> >
> >     >>>> Method dispatch for `vec_c()` is quite simple because
> >     >>>> associativity and commutativity mean that we can
> >     >>>> determine the output type only by considering a pair of
> >     >>>> inputs at a time. To this end, vctrs provides
> >     >>>> `vec_type2()` which takes two inputs and returns their
> >     >>>> common type (represented as zero length vector):
> >     >>>>
> >     >>>> str(vec_type2(integer(), double())) #> num(0)
> >     >>>>
> >     >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
> >     >>>> levels "a","b":
> >     >>>
> >     >>>
> >     >>> What is the reasoning behind taking the union of the
> >     >>> levels here? I'm not sure that is actually the behavior
> >     >>> I would want if I have a vector of factors and I try to
> >     >>> append some new data to it. I might want/ expect to
> >     >>> retain the existing levels and get either NAs or an
> >     >>> error if the new data has (present) levels not in the
> >     >>> first data. The behavior as above doesn't seem in-line
> >     >>> with what I understand the purpose of factors to be
> >     >>> (explicit restriction of possible values).
> >     >>
> >     >> Originally (like a week ago ), we threw an error if the
> >     >> factors didn't have the same level, and provided an
> >     >> optional coercion to character. I decided that while
> >     >> correct (the factor levels are a parameter of the type,
> >     >> and hence factors with different levels aren't
> >     >> comparable), that this fights too much against how people
> >     >> actually use factors in practice. It also seems like base
> >     >> R is moving more in this direction, i.e. in 3.4
> >     >> factor("a") == factor("b") is an error, whereas in R 3.5
> >     >> it returns FALSE.
> >
> >     > I now have a better argument, I think:
> >
> >     > If you squint your brain a little, I think you can see
> >     > that each set of automatic coercions is about increasing
> >     > resolution. Integers are low resolution versions of
> >     > doubles, and dates are low resolution versions of
> >     > date-times. Logicals are low resolution version of
> >     > integers because there's a strong convention that `TRUE`
> >     > and `FALSE` can be used interchangeably with `1` and `0`.
> >
> >     > But what is the resolution of a factor? We must take a
> >     > somewhat pragmatic approach because base R often converts
> >     > character vectors to factors, and we don't want to be
> >     > burdensome to users. So we say that a factor `x` has finer
> >     > resolution than factor `y` if the levels of `y` are
> >     > contained in `x`. So to find the common type of two
> >     > factors, we take the union of the levels of each factor,
> >     > given a factor that has finer resolution than
> >     > both. Finally, you can think of a character vector as a
> >     > factor with every possible level, so factors and character
> >     > vectors are coercible.
> >
> >     > (extracted from the in-progress vignette explaining how to
> >     > extend vctrs to work with your own vctrs, now that vctrs
> >     > has been rewritten to use double dispatch)
> >
> > I like this argumentation, and find it very nice indeed!
> > It confirms my own gut feeling which had lead me to agreeing
> > with you, Hadley, that taking the union of all factor levels
> > should be done here.
> >
> > As Gabe mentioned (and you've explained about) the term "type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
> >
> > Martin
> >
> >
> >     > Hadley
> >
> >     > --
> >     > http://hadley.nz
> >
> >     > ______________________________________________
> >     > [hidden email] mailing list
> >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Joris FA Meys
 I sent this to  Iñaki personally by mistake. Thank you for notifying me.

On Wed, Aug 8, 2018 at 7:53 PM Iñaki Úcar <[hidden email]> wrote:

>
> For what it's worth, I always thought about factors as fundamentally
> characters, but with restrictions: a subspace of all possible strings.
> And I'd say that a non-negligible number of R users may think about
> them in a similar way.
>

That idea has been a common source of bugs and the most important reason
why I always explain my students that factors are a special kind of
numeric(integer), not character. Especially people coming from SPSS see
immediately the link with categorical variables in that way, and understand
that a factor is a modeling aid rather than an alternative for characters.
It is a categorical variable and a more readable way of representing a set
of dummy variables.

I do agree that some of the factor behaviour is confusing at best, but that
doesn't change the appropriate use and meaning of factors as categorical
variables.

Even more, I oppose the ideas that :

1) factors with different levels should be concatenated.

2) when combining factors, the union of the levels would somehow be a good
choice.

Factors with different levels are variables with different information, not
more or less information. If one factor codes low and high and another
codes low, mid and high, you can't say whether mid in one factor would be
low or high in the first one. The second has a higher resolution, and
that's exactly the reason why they should NOT be combined. Different levels
indicate a different grouping, and hence that data should never be used as
one set of dummy variables in any model.

Even when combining factors, the union of levels only makes sense to me if
there's no overlap between levels of both factors. In all other cases, a
researcher will need to determine whether levels with the same label do
mean the same thing in both factors, and that's not guaranteed. And when
we're talking a factor with a higher resolution and a lower resolution, the
correct thing to do modelwise is to recode one of the factors so they have
the same resolution and every level the same definition before you merge
that data.

So imho the combination of two factors with different levels (or even
levels in a different order) should give an error. Which R currently
doesn't throw, so I get there's room for improvement.

Cheers
Joris
--
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
On Thu, Aug 9, 2018 at 3:57 AM Joris Meys <[hidden email]> wrote:

>
>  I sent this to  Iñaki personally by mistake. Thank you for notifying me.
>
> On Wed, Aug 8, 2018 at 7:53 PM Iñaki Úcar <[hidden email]> wrote:
>
> >
> > For what it's worth, I always thought about factors as fundamentally
> > characters, but with restrictions: a subspace of all possible strings.
> > And I'd say that a non-negligible number of R users may think about
> > them in a similar way.
> >
>
> That idea has been a common source of bugs and the most important reason
> why I always explain my students that factors are a special kind of
> numeric(integer), not character. Especially people coming from SPSS see
> immediately the link with categorical variables in that way, and understand
> that a factor is a modeling aid rather than an alternative for characters.
> It is a categorical variable and a more readable way of representing a set
> of dummy variables.
>
> I do agree that some of the factor behaviour is confusing at best, but that
> doesn't change the appropriate use and meaning of factors as categorical
> variables.
>
> Even more, I oppose the ideas that :
>
> 1) factors with different levels should be concatenated.
>
> 2) when combining factors, the union of the levels would somehow be a good
> choice.
>
> Factors with different levels are variables with different information, not
> more or less information. If one factor codes low and high and another
> codes low, mid and high, you can't say whether mid in one factor would be
> low or high in the first one. The second has a higher resolution, and
> that's exactly the reason why they should NOT be combined. Different levels
> indicate a different grouping, and hence that data should never be used as
> one set of dummy variables in any model.
>
> Even when combining factors, the union of levels only makes sense to me if
> there's no overlap between levels of both factors. In all other cases, a
> researcher will need to determine whether levels with the same label do
> mean the same thing in both factors, and that's not guaranteed. And when
> we're talking a factor with a higher resolution and a lower resolution, the
> correct thing to do modelwise is to recode one of the factors so they have
> the same resolution and every level the same definition before you merge
> that data.
>
> So imho the combination of two factors with different levels (or even
> levels in a different order) should give an error. Which R currently
> doesn't throw, so I get there's room for improvement.

I 100% agree with you, and is this the behaviour that vctrs used to
have and dplyr currently has (at least in bind_rows()). But
pragmatically, my experience with dplyr is that people find this
behaviour confusing and unhelpful. And when I played the full
expression of this behaviour in vctrs, I found that it forced me to
think about the levels of factors more than I'd otherwise like to: it
made me think like a programmer, not like a data analyst. So in an
ideal world, yes, I think factors would have stricter behaviour, but
my sense is that imposing this strictness now will be onerous to most
analysts.

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Dirk Eddelbuettel
In reply to this post by hadley wickham

On 8 August 2018 at 12:40, Hadley Wickham wrote:
| I think this is true in the tidyverse, which will never give you a
| factor unless you explicitly ask for one, but the default in base R
| (at least as soon as a data frame is involved) is to turn character
| vectors into factors.

False. Base R does what the option stringsAsFactors tells it to. Whereas your
incorrect statement implies unconditional behaviour.  

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

Joris FA Meys
In reply to this post by hadley wickham
Hi Hadley,

my point actually came from a data analyst point of view. A character
variable is something used for extra information, eg the "any other ideas?"
field of a questionnaire. A categorical variable is a variable describing
categories defined by the researcher. If it is made clear that a factor is
the object type needed for a categorical variable, there is no confusion.
All my students get it. But I agree that in many cases people are taught
that a factor is somehow related to character variables. And that does not
make sense from a data analyst point of view if you think about variables
as continuous, ordinal and nominal in a model context.

So I don't think adding more confusing behaviour and pitfalls is a solution
to something that's essentially a misunderstanding. It's something that's
only solved by explaining it correctly imho.

Cheers
Joris

On Thu, Aug 9, 2018 at 2:36 PM Hadley Wickham <[hidden email]> wrote:

>
> I 100% agree with you, and is this the behaviour that vctrs used to
> have and dplyr currently has (at least in bind_rows()). But
> pragmatically, my experience with dplyr is that people find this
> behaviour confusing and unhelpful. And when I played the full
> expression of this behaviour in vctrs, I found that it forced me to
> think about the levels of factors more than I'd otherwise like to: it
> made me think like a programmer, not like a data analyst. So in an
> ideal world, yes, I think factors would have stricter behaviour, but
> my sense is that imposing this strictness now will be onerous to most
> analysts.
>
> Hadley
>
> --
> http://hadley.nz
>


--
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
On Thu, Aug 9, 2018 at 7:54 AM Joris Meys <[hidden email]> wrote:
>
> Hi Hadley,
>
> my point actually came from a data analyst point of view. A character variable is something used for extra information, eg the "any other ideas?" field of a questionnaire. A categorical variable is a variable describing categories defined by the researcher. If it is made clear that a factor is the object type needed for a categorical variable, there is no confusion. All my students get it. But I agree that in many cases people are taught that a factor is somehow related to character variables. And that does not make sense from a data analyst point of view if you think about variables as continuous, ordinal and nominal in a model context.
>
> So I don't think adding more confusing behaviour and pitfalls is a solution to something that's essentially a misunderstanding. It's something that's only solved by explaining it correctly imho.

I agree with your definition of character and factor variables. It's
an important distinction, and I agree that the blurring of factors and
characters is generally undesirable. However, the merits of respecting
R's existing behaviour, and Martin Mächler's support, means that I'm
not going to change vctr's approach at this point in time. However, I
hear from you and Gabe that this is an important issue, and I'll
definitely keep it in mind as I solicit further feedback from users.

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
In reply to this post by hadley wickham
> > As Gabe mentioned (and you've explained about) the term "type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
>
> Agreed - I've been using type in the sense of "type system"
> (particularly as it related to algebraic data types), but that's not
> obvious from the current presentation, and as you note, is confusing
> with existing notions of type in R. I like your suggestion of kind,
> but I think it might be possible to just talk about classes, and
> instead emphasise that while the components of the system are classes
> (and indeed it's implemented using S3), the coercion/casting
> relationship do not strictly follow the subclass/superclass
> relationships.

I've taken another pass through (the first part of) the readme
(https://github.com/r-lib/vctrs#vctrs), and I'm now confident that I
can avoid using "type" by itself, and instead always use it in a
compound phrase (like type system) to avoid confusion. That leaves the
`.type` argument to many vctrs functions. I'm considering change it to
.prototype, because what you actually give it is a zero-length vector
of the class you want, i.e. a prototype of the desired output. What do
you think of prototype as a name?

Do you have any thoughts on good names for distinction vectors without
a class (i.e. logical, integer, double, ...) from vectors with a class
(e.g. factors, dates, etc). I've been thinking bare vector and S3
vector (leaving room to later think about S4 vectors). Do those sound
reasonable to you?

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

R devel mailing list
> I'm now confident that I
> can avoid using "type" by itself, and instead always use it in a
> compound phrase (like type system) to avoid confusion. That leaves the
> `.type` argument to many vctrs functions. I'm considering change it to
> .prototype, because what you actually give it is a zero-length vector
> of the class you want, i.e. a prototype of the desired output. What do
> you think of prototype as a name?


The term “type system” in computer science is used in very different ways.
What the note describes is not a type system, but rather a set of
coercions used by a small number of functions in one package.

Typically it refers to a set of rules (either statically enforced
by the compiler or dynamically enforced by the runtime) that ensure
that some particular category of errors can be caught by the
language.

There is none of that here.

My suggestion would be to avoid “type system”.


"The short-term goal of vctrs is to develop a type system for vectors which will help reason about functions that combine different types of input (e.g. c(), ifelse(), rbind()). The vctrs type system encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).”

==>

The short-term goal of vctrs is to specify the behavior of functions that combine different types of vectors (e.g. c(), ifelse(), rbind()). The specification encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).

and so on.

-j
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

hadley wickham
On Thu, Aug 9, 2018 at 4:26 PM jan Vitek <[hidden email]> wrote:

>
> > I'm now confident that I
> > can avoid using "type" by itself, and instead always use it in a
> > compound phrase (like type system) to avoid confusion. That leaves the
> > `.type` argument to many vctrs functions. I'm considering change it to
> > .prototype, because what you actually give it is a zero-length vector
> > of the class you want, i.e. a prototype of the desired output. What do
> > you think of prototype as a name?
>
>
> The term “type system” in computer science is used in very different ways.
> What the note describes is not a type system, but rather a set of
> coercions used by a small number of functions in one package.
>
> Typically it refers to a set of rules (either statically enforced
> by the compiler or dynamically enforced by the runtime) that ensure
> that some particular category of errors can be caught by the
> language.
>
> There is none of that here.

I think there's a bit of that flavour here:

vec_c(factor("a"), Sys.Date())
#> Error: No common type for factor and date

This isn't a type system imposed by the language, but I don't think
that's a reason not to call it a type system.

That said, I agree that calling it a type system is currently
overselling it, and I have made your proposed change to the README
(and added a very-long term goal of making a type system that could be
applied using (e.g.) annotations).

> "The short-term goal of vctrs is to develop a type system for vectors which will help reason about functions that combine different types of input (e.g. c(), ifelse(), rbind()). The vctrs type system encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).”
>
> ==>
>
> The short-term goal of vctrs is to specify the behavior of functions that combine different types of vectors (e.g. c(), ifelse(), rbind()). The specification encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).

Thanks for the nice wording!

Hadley


--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

luke-tierney
Some ideas from the 'numeric tower' notion in scheme/lisp might also
be useful.

Best,

luke

On Thu, 9 Aug 2018, Hadley Wickham wrote:

> On Thu, Aug 9, 2018 at 4:26 PM jan Vitek <[hidden email]> wrote:
>>
>>> I'm now confident that I
>>> can avoid using "type" by itself, and instead always use it in a
>>> compound phrase (like type system) to avoid confusion. That leaves the
>>> `.type` argument to many vctrs functions. I'm considering change it to
>>> .prototype, because what you actually give it is a zero-length vector
>>> of the class you want, i.e. a prototype of the desired output. What do
>>> you think of prototype as a name?
>>
>>
>> The term “type system” in computer science is used in very different ways.
>> What the note describes is not a type system, but rather a set of
>> coercions used by a small number of functions in one package.
>>
>> Typically it refers to a set of rules (either statically enforced
>> by the compiler or dynamically enforced by the runtime) that ensure
>> that some particular category of errors can be caught by the
>> language.
>>
>> There is none of that here.
>
> I think there's a bit of that flavour here:
>
> vec_c(factor("a"), Sys.Date())
> #> Error: No common type for factor and date
>
> This isn't a type system imposed by the language, but I don't think
> that's a reason not to call it a type system.
>
> That said, I agree that calling it a type system is currently
> overselling it, and I have made your proposed change to the README
> (and added a very-long term goal of making a type system that could be
> applied using (e.g.) annotations).
>
>> "The short-term goal of vctrs is to develop a type system for vectors which will help reason about functions that combine different types of input (e.g. c(), ifelse(), rbind()). The vctrs type system encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).”
>>
>> ==>
>>
>> The short-term goal of vctrs is to specify the behavior of functions that combine different types of vectors (e.g. c(), ifelse(), rbind()). The specification encompasses base vectors (e.g. logical, numeric, character, list), S3 vectors (e.g. factor, ordered, Date, POSIXct), and data frames; and can be extended to deal with S3 vectors defined in other packages, as described in vignette("extending-vctrs”).
>
> Thanks for the nice wording!
>
> Hadley
>
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vctrs: a type system for the tidyverse

R devel mailing list
In reply to this post by hadley wickham

>
> I think there's a bit of that flavour here:
>
> vec_c(factor("a"), Sys.Date())
> #> Error: No common type for factor and date
>
> This isn't a type system imposed by the language, but I don't think
> that's a reason not to call it a type system.

All I am saying is that without a clear definition of what is the class
of errors being prevented CS folks would not think of it as a type system.

In Java, the type system guarantees that you will not have a Method Not
Understood Error.  In ML, the type system ensure that all operations are
applied to the data types that they are defined for.

Also, a type system gives a guarantee over all programs. Here the guarantee
only applies for certain functions.


This said it would be interesting to describe precisely what are things
that are errors that ought to be prevent for R.  

The discussion on what happens when you merge two vectors with different
factor levels is really interesting in that respect as it suggest that
there are non-trivial issues that need to be worked out.


-j

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel