Hi all,

I wanted to give you an update on vctrs (<

https://vctrs.r-lib.org/>)

since I last bought it up here in August. The biggest change is that I now

have a much clearer idea of what vctrs is! I’ll summarise that here,

and point you to the documentation if you’re interested in learning

more. I’m planning on submitting vctrs to CRAN in the near future, but

it’s very much a 0.1.0 release and I expect it to continue to evolve as

more people try it out and give me feedback. I’d love to hear your

thoughts\!

vctrs has three main goals:

- To define and motivate `vec_size()` and `vec_type()` as alternatives

to `length()` and `class()`.

- To define type- and size-stability, useful tools for analysing

function interfaces.

- To make it easier to create new S3 vector classes.

## Size and prototype

`vec_size()` was motivated by my desire to have a function that captures

the number of “observations” in a vector. This particularly important

for data frames because it’s useful to have a function such that

`f(data.frame(x))` equals `f(x)`. No base function has this property:

`NROW()` comes closest, but because it’s defined in terms of `length()`

for dimensionless objects, it always returns a number, even for types

that can’t go in a data frame, e.g. `data.frame(mean)` errors even

though `NROW(mean)` is `1`.

``` r

vec_size(1:10)

#> [1] 10

vec_size(as.POSIXlt(Sys.time() + 1:10))

#> [1] 10

vec_size(data.frame(x = 1:10))

#> [1] 10

vec_size(array(dim = c(10, 4, 1)))

#> [1] 10

vec_size(mean)

#> Error: `x` is a not a vector

```

`vec_size()` is paired with `vec_slice()` for subsetting, i.e.

`vec_slice()` is to `vec_size()` as `[` is to `length()`;

`vec_slice(data.frame(x), i)` equals `data.frame(vec_slice(x, i))`

(modulo variable/row names).

(I plan to make `vec_size()` and `vec_slice()` generic in the next

release, providing another point of differentiation from `NROW()`.)

Complementary to the size of a vector is its prototype, a

zero-observation slice of the vector. You can compute this using

`vec_type()`, but because many classes don’t have an informative print

method for a zero-length vector, I also provide `vec_ptype()` which

prints a brief summary. As well as the class, the prototype also

captures important attributes:

``` r

vec_ptype(1:10)

#> Prototype: integer

vec_ptype(array(1:40, dim = c(10, 4, 1)))

#> Prototype: integer[,4,1]

vec_ptype(Sys.time())

#> Prototype: datetime<local>

vec_ptype(data.frame(x = 1:10, y = letters[1:10]))

#> Prototype: data.frame<

#> x: integer

#> y: factor<5e105>

#> >

```

`vec_size()` and `vec_type()` are accompanied by functions that either

find or enforce a common size (using modified recycling rules) or common

type (by reducing a double-dispatching `vec_type2()` that determines the

common type from a pair of types).

You can read more about `vec_size()` and `vec_type()` at

<

https://vctrs.r-lib.org/articles/type-size.html>.

## Stability

The definitions of size and prototype are motivated by my experiences

doing code review. I find that I can often spot problems by running R

code in my head. Obviously my mental R interpreter is much simpler than

the real interpreter, but it seems to focus on prototypes and sizes, and

I’m suspicious of code where I can’t easily predict the class of every

new variable.

This leads me to two definitions. A function is **type-stable** iif:

- You can predict the output type knowing only the input types.

- The order of arguments in … does not affect the output type.

Similary, a function is **size-stable** iif:

- You can predict the output size knowing only the input sizes, or

there is a single numeric input that specifies the output size.

For example, `ifelse()` is type-unstable because the output type can be

different even when the input types are the same:

``` r

vec_ptype(ifelse(NA, 1L, 1L))

#> Prototype: logical

vec_ptype(ifelse(FALSE, 1L, 1L))

#> Prototype: integer

```

Size-stability is generally not a useful for analysing base R functions

because the definition is a bit too far away from base conventions. The

analogously defined length-stability is a bit better, but the definition

of length for non-vectors means that complete length-stability is rare.

For example, while `length(c(x, y))` usually equals `length(x) +

length(y)`, it does not hold for all possible inputs:

``` r

length(globalenv())

#> [1] 0

length(mean)

#> [1] 1

length(c(mean, globalenv()))

#> [1] 2

```

(I don’t mean to pick on base here; the tidyverse also has many

functions that violate these principles, but I wanted to stick to

functions that all readers would be familiar with.)

Type- and size-stable functions are desirable because they make it

possible to reason about code without knowing the precise values

involved. Of course, not all functions should be type- or size-stable: R

would be incredibly limited if you could predict the type or size of

`[[` and `read.csv()` without knowing the specific inputs\! But where

possible, I think using type- and size-stable functions makes code

easier to reason about and hence more likely to be bug free.

You can read more about size- and type-stability at

<

https://vctrs.r-lib.org/articles/stability.html>. This vignette

includes a detailed analysis of `c()` and a type- and size-stable

alternative called `vec_c()`.

## New vector types

Finally, vctrs provides `new_vctr()` and `new_rcrd()` to make it easier

to define new classes, following the conventions that I’ve found

helpful, including writing a constructor function that enforces the

types of the underlying vector and its attributes (more details at

<

https://adv-r.hadley.nz/s3.html>\>). vctrs also makes life easier by

implementing many base generics in terms of a small set of primitives:

- At the simplest level, `print()` and `str()` are defined in terms of

`format()`. `as.data.frame()` is implemented using the standard

approach used for factor, POSIXct, Date etc.

- `[[` and `[` use `NextMethod()` dispatch to the underlying base

function, then restore attributes with `vec_restore()`. I’m not sure

what the base equivalent of `vec_restore()` is, but it makes

subclassing easier, as described in

<

https://adv-r.hadley.nz/s3.html#s3-subclassing>.

- `==`, `!=`, `unique()`, `anyDuplicated()`, and `is.na()` are defined

in terms of `vec_proxy_equal()`. `<`, `<=`, `>=`, `>`, `min()`,

`max()`, `median()`, `quantile()`, and `xtfrm()` methods are defined

in terms of `vec_proxy_compare()`. More details + examples at

<

https://vctrs.r-lib.org/articles/s3-vector.html#equality-and-comparison>

- `+`, `-`, `/`, `*`, `^`, `%%`, `%/%`, `!`, `&`, and `|` operators

are defined in terms of a double-dispatch use `vec_arith()`.

Mathematical functions including the Summary group generics, the

Math group generics, and a handful of others are defined using

`vec_math()`. More details at

<

https://vctrs.r-lib.org/articles/s3-vector.html#arithmetic>

These generics make creating a new vector more rewarding more quickly:

you can easily sketch out the big picture before going back and filling

in all the methods that make your class unique. More details at

<

https://vctrs.r-lib.org/articles/s3-vector.html>.

Hadley

--

http://hadley.nz______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel