Function 'factor' issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Function 'factor' issues

R devel mailing list
In R devel, function 'factor' has been changed, allowing and merging duplicated 'labels'.

Issue 1: Handling of specified 'labels' without duplicates is slower than before.
Example:
x <- rep(1:26, 40000)
system.time(factor(x, levels=1:26, labels=letters))

Function 'factor' is already rather slow because of conversion to character. Please don't add slowdown.

Issue 2: While default 'labels' is 'levels', not specifying 'labels' may be different from specifying 'labels' to be the same as 'levels'.

Example 1:
as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
is different from
as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA), exclude = NULL))

File reg-tests-1d.R indicates that 'factor' behavior with NA is slightly changed, for the better. NA entry (because it is unmatched to 'levels' argument or is in 'exclude') is absorbed into NA in "levels" attribute (comes from 'labels' argument), if any. The issue is that it happens only when 'labels' is specified.

Function 'factor' could use match(xlevs, nlevs)[f]. It doesn't match NA to NA level. When 'f' is long enough, longer than 'xlevs', it is faster than match(xlevs[f], nlevs).

Example 2:
With
levs <- c("A","A")  ,
factor(levs, levels=levs)
gives error, but
factor(levs, levels=levs, labels=levs)
doesn't.

Note: In theory, if function 'factor' merged duplicated 'labels' in all cases, at least in
factor(c(sqrt(2)^2, 2))  ,
function 'factor' could do matching on original 'x' (without conversion to character), as in R before version 2.10.0. If function 'factor' did it,
factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels = c("sqrt(2)^2", "2"))
could take sqrt(2)^2 and 2 as distinct.


Another thing: Function 'factor' in R devel uses 'order' instead of 'sort.list'.

The case of as.factor(x) for
x <- as.data.frame(character(0))
in tests/isas-tests.Rout.save reveals that 'order' on data frame is strange.

x <- as.data.frame(character(0))
y <- unique(x)
length(y)  # 1
length(order(y))  # 0
length(as.character(y))  # 1

order(y) is not as long as as.character(y).

Another example:
length(mtcars)  # 11
length(order(mtcars))  # 352

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Function 'factor' issues

Martin Maechler
>>>>> Suharto Anggono Suharto Anggono via R-devel <[hidden email]>
>>>>>     on Sun, 15 Oct 2017 16:03:48 +0000 writes:

    > In R devel, function 'factor' has been changed, allowing and merging duplicated 'labels'.

Indeed.  That had been asked for and discussed a bit on this
list from June 14 to June 23, starting at
   https://stat.ethz.ch/pipermail/r-devel/2017-June/074451.html

    > Issue 1: Handling of specified 'labels' without duplicates is slower than before.
    > Example:
    > x <- rep(1:26, 40000)
    > system.time(factor(x, levels=1:26, labels=letters))

    > Function 'factor' is already rather slow because of conversion to character. Please don't add slowdown.

Indeed, I doo see a ~ 20%  performance loss for the example
above, and I may get to look into this.
However, in R-devel there have been important internal
changes (ALTREP additions) some of which are currently giving
some performance losses in some cases (but they have the
potential to give big performance _gains_ e.g. for simple
indexing into large vectors which may apply here !).
For factor(), these C level "ALTREP" changes may not be the reason at
all for the slow down;
I may find time to investigate further.

{{ For the ALTREP-change slowdowns I've noticed in some
   indexing/subset operations, we'll definitely have time to look into
   before R-devel is going to be released next spring... and as mentioned,
   these operations may even become considerably faster *thanks*
   to ALTREP ... }}

    > Issue 2: While default 'labels' is 'levels', not specifying 'labels' may be different from specifying 'labels' to be the same as 'levels'.

    > Example 1:
    > as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
    > is different from
    > as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA), exclude = NULL))

You are right.  But this is not so exceptional and part of the new feature of
'labels' allowing to "fix up" things in such cases.  While it
would be nice if this was not the case the same phenomenon
happens in other functions as well because of lazy evaluation.
I think I had noticed that already and at the time found
"not easy" to work around.
(There are many aspects about changing such important base functions:
 1. not breaking back compatibility ((unless in rare
    border cases, where we are sure it's worth))
 2. Keeping code relatively transparent
 3. Keep the semantics "simple" to document and as intuitive as possible
)

    > File reg-tests-1d.R indicates that 'factor' behavior with NA is slightly changed, for the better. NA entry (because it is unmatched to 'levels' argument or is in 'exclude') is absorbed into NA in "levels" attribute (comes from 'labels' argument), if any. The issue is that it happens only when 'labels' is specified.

I'm not sure anymore, but I think I had noticed that also in
June, considered to change it and found that such a changed
factor() would be too different from what it has "always been".
So, yes, IIRC, this current behavior is on purpose, if only for back compatibility.


    > Function 'factor' could use match(xlevs, nlevs)[f]. It doesn't match NA to NA level. When 'f' is long enough, longer than 'xlevs', it is faster than match(xlevs[f], nlevs).

    > Example 2:
    > With
    > levs <- c("A","A")  ,
    > factor(levs, levels=levs)
    > gives error, but
    > factor(levs, levels=levs, labels=levs)
    > doesn't.

yes, again that is a consequence of what you said above (before
'Example 1')

    > Note: In theory, if function 'factor' merged duplicated 'labels' in all cases, at least in
    > factor(c(sqrt(2)^2, 2))  ,
    > function 'factor' could do matching on original 'x' (without conversion to character), as in R before version 2.10.0. If function 'factor' did it,
    > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels = c("sqrt(2)^2", "2"))
    > could take sqrt(2)^2 and 2 as distinct.

Well, that may be interesting.. but I doubt if that's somewhere
we should go, easily, because  factor() has been documented to do
what it does now (with very slightly rounding such numbers via as.character(.))
and hence such a change would typically lead to much work for
too many people.

I do see that indeed the  as.character(.) inside factor() takes
most of the CPU time used in largish factor() examples [as your
first], and indeed, for the case of integer 'x', we really could
be much faster in factor construction.  

    > Another thing: Function 'factor' in R devel uses 'order' instead of 'sort.list'.

This has been by a change on purpose --- well documented as new
feature in NEWS --- to allow using *methods* for order(),
i.e. for the workhorse of order, xtfrm()  so that factor(OB)
works for more general objects OB.


    > The case of as.factor(x) for
    > x <- as.data.frame(character(0))
    > in tests/isas-tests.Rout.save reveals that 'order' on data frame is strange.

    > x <- as.data.frame(character(0))
    > y <- unique(x)
    > length(y)  # 1
    > length(order(y))  # 0
    > length(as.character(y))  # 1

    > order(y) is not as long as as.character(y).

    > Another example:
    > length(mtcars)  # 11
    > length(order(mtcars))  # 352

I agree that  order(<data.frame>) may look a bit strange;
I've spent more than an hour into looking into it, and making it
[actually,  rank(<data.frame>,..) ]
an error, but ended up finding much evidence that there's too
much related code, sometimes even in base R which assumes that a
numeric data frame behaves the same as a numeric matrix.

And also, if you carefully read the help files, of
  order(),
  xtfrm(),
  rank()

there's always mentioned that these work for R object 'x'
basically as long as   x[!is.na(x)]   returns a "nice"
(typically atomic) vector .. which is the case for such data frames.

The consequence, that  in R-devel, currently

    factor(mtcars)

just "works",  is indeed unexpected or even "shocking", and I
still don't know what the most elegant and reasonable way would
be to make this an error -- as it used to be when  sort.list()
was used instead of order().  I'd find it ugly (and even more
time consuming!) if factor() itself would have to check its
argument and signal an error for a data.frame.

The relevant call tree is

  factor() -> order() -> xtfrm() -> xtfrm.default() -> rank()

and as I said, rank(x,*) works when  x[!is.na(x)]  is an atomic
"numeric-like" vector  which is the case for a numeric data
frame such as 'mtcars'.

Martin Maechler
ETH Zurich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Function 'factor' issues

Gabriel Becker
Martin, Suharto, et al.,


On Wed, Oct 18, 2017 at 9:54 AM, Martin Maechler <[hidden email]
> wrote:

> *<snip>*
>
>     > Note: In theory, if function 'factor' merged duplicated 'labels' in
> all cases, at least in
>     > factor(c(sqrt(2)^2, 2))  ,
>     > function 'factor' could do matching on original 'x' (without
> conversion to character), as in R before version 2.10.0. If function
> 'factor' did it,
>     > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels =
> c("sqrt(2)^2", "2"))
>     > could take sqrt(2)^2 and 2 as distinct.
>
> Well, that may be interesting.. but I doubt if that's somewhere
> we should go, easily, because  factor() has been documented to do
> what it does now (with very slightly rounding such numbers via
> as.character(.))
> and hence such a change would typically lead to much work for
> too many people.
>
> I do see that indeed the  as.character(.) inside factor() takes
> most of the CPU time used in largish factor() examples [as your
> first], and indeed, for the case of integer 'x', we really could
> be much faster in factor construction.
>

Indeed; the ALTREP framework already has an alternative string (character
vector) implementation which defers conversion from another vector type.
Luke implemented it to drastically speed up the creation of default row
labels (1:n, I believe) on design matrices within lm/glm, under the
assumption that no one is ever going to look at the design matrix row
labels most of the time, and there's no reason to pay the cost of creating
them until they do.

In principle, we could look at doing the same for levels of a factor.

Furthermore, while I haven't put in the hooks yet to utilize them yet, even
in my local copy, ALTREP classes are allowed to include a custom unique
method, so in cases where

unique(as.character(x)) == as.character(unique(x))

we could avoid the conversion even when calling unique() (at the R level)
on such a deferred vector.

This would include, I think, the case where inferring levels from the
vector. I believe the point Martin made earlier about as.character
potentially doing some rounding means that the required identity would hold
when factors are being generated from integer vectors but would not be
guaranteed when factors are generated from (non-integer) numeric vectors.

Best,
~G


>
>     > Another thing: Function 'factor' in R devel uses 'order' instead of
> 'sort.list'.
>
> This has been by a change on purpose --- well documented as new
> feature in NEWS --- to allow using *methods* for order(),
> i.e. for the workhorse of order, xtfrm()  so that factor(OB)
> works for more general objects OB.
>
>
>     > The case of as.factor(x) for
>     > x <- as.data.frame(character(0))
>     > in tests/isas-tests.Rout.save reveals that 'order' on data frame is
> strange.
>
>     > x <- as.data.frame(character(0))
>     > y <- unique(x)
>     > length(y)  # 1
>     > length(order(y))  # 0
>     > length(as.character(y))  # 1
>
>     > order(y) is not as long as as.character(y).
>
>     > Another example:
>     > length(mtcars)  # 11
>     > length(order(mtcars))  # 352
>
> I agree that  order(<data.frame>) may look a bit strange;
> I've spent more than an hour into looking into it, and making it
> [actually,  rank(<data.frame>,..) ]
> an error, but ended up finding much evidence that there's too
> much related code, sometimes even in base R which assumes that a
> numeric data frame behaves the same as a numeric matrix.
>
> And also, if you carefully read the help files, of
>   order(),
>   xtfrm(),
>   rank()
>
> there's always mentioned that these work for R object 'x'
> basically as long as   x[!is.na(x)]   returns a "nice"
> (typically atomic) vector .. which is the case for such data frames.
>
> The consequence, that  in R-devel, currently
>
>     factor(mtcars)
>
> just "works",  is indeed unexpected or even "shocking", and I
> still don't know what the most elegant and reasonable way would
> be to make this an error -- as it used to be when  sort.list()
> was used instead of order().  I'd find it ugly (and even more
> time consuming!) if factor() itself would have to check its
> argument and signal an error for a data.frame.
>
> The relevant call tree is
>
>   factor() -> order() -> xtfrm() -> xtfrm.default() -> rank()
>
> and as I said, rank(x,*) works when  x[!is.na(x)]  is an atomic
> "numeric-like" vector  which is the case for a numeric data
> frame such as 'mtcars'.
>
> Martin Maechler
> ETH Zurich
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, PhD
Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Function 'factor' issues

R devel mailing list
In reply to this post by R devel mailing list
My idea (like in https://bugs.r-project.org/bugzilla/attachment.cgi?id=1540 ):
- For remapping, use
f <- match(xlevs, nlevs)[f]
instead of
f <- match(xlevs[f], nlevs)
(I have mentioned it).
- Remap only if length(nlevs) differs from length(xlevs) .


On use of 'order' in function 'factor' in R devel, factor.Rd still says 'sort.list' in "Details" section.

My comments on the part of "Details" section:
- Sortable 'x' is needed only when 'levels' is not specified.
- Complete requirements for properly working factor(x) in R devel: 'as.character', 'order', 'unique' corresponding to '['. Take data frame and "Surv" object (package survival) as examples.

--------------------------------------------
On Wed, 18/10/17, Martin Maechler <[hidden email]> wrote:

 Subject: Re: [Rd] Function 'factor' issues

 Cc: [hidden email]
 Date: Wednesday, 18 October, 2017, 11:54 PM

>>>>> Suharto Anggono Suharto Anggono via R-devel <[hidden email]>
>>>>>     on Sun, 15 Oct 2017 16:03:48 +0000 writes:

  
    > In R devel, function 'factor' has been changed, allowing and merging duplicated 'labels'.

Indeed.  That had been asked for and discussed a bit on this
list from June 14 to June 23, starting at
   https://stat.ethz.ch/pipermail/r-devel/2017-June/074451.html

    > Issue 1: Handling of specified 'labels' without duplicates is slower than before.
    > Example:
    > x <- rep(1:26, 40000)
    > system.time(factor(x, levels=1:26, labels=letters))

    > Function 'factor' is already rather slow because of conversion to character. Please don't add slowdown.

Indeed, I doo see a ~ 20%  performance loss for the example
above, and I may get to look into this.
However, in R-devel there have been important internal
changes (ALTREP additions) some of which are currently giving
some performance losses in some cases (but they have the
potential to give big performance _gains_ e.g. for simple
indexing into large vectors which may apply here !).
For factor(), these C level "ALTREP" changes may not be the reason at
all for the slow down;
I may find time to investigate further.

{{ For the ALTREP-change slowdowns I've noticed in some
   indexing/subset operations, we'll definitely have time to look into
   before R-devel is going to be released next spring... and as mentioned,
   these operations may even become considerably faster *thanks*
   to ALTREP ... }}

    > Issue 2: While default 'labels' is 'levels', not specifying 'labels' may be different from specifying 'labels' to be the same as 'levels'.

    > Example 1:
    > as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
    > is different from
    > as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA), exclude = NULL))

You are right.  But this is not so exceptional and part of the new feature of
'labels' allowing to "fix up" things in such cases.  While it
would be nice if this was not the case the same phenomenon
happens in other functions as well because of lazy evaluation.
I think I had noticed that already and at the time found
"not easy" to work around.
(There are many aspects about changing such important base functions:
 1. not breaking back compatibility ((unless in rare
    border cases, where we are sure it's worth))
 2. Keeping code relatively transparent
 3. Keep the semantics "simple" to document and as intuitive as possible
)

    > File reg-tests-1d.R indicates that 'factor' behavior with NA is slightly changed, for the better. NA entry (because it is unmatched to 'levels' argument or is in 'exclude') is absorbed into NA in "levels" attribute (comes from 'labels' argument), if any. The issue is that it happens only when 'labels' is specified.

I'm not sure anymore, but I think I had noticed that also in
June, considered to change it and found that such a changed
factor() would be too different from what it has "always been".
So, yes, IIRC, this current behavior is on purpose, if only for back compatibility.


    > Function 'factor' could use match(xlevs, nlevs)[f]. It doesn't match NA to NA level. When 'f' is long enough, longer than 'xlevs', it is faster than match(xlevs[f], nlevs).

    > Example 2:
    > With
    > levs <- c("A","A")  ,
    > factor(levs, levels=levs)
    > gives error, but
    > factor(levs, levels=levs, labels=levs)
    > doesn't.

yes, again that is a consequence of what you said above (before
'Example 1')

    > Note: In theory, if function 'factor' merged duplicated 'labels' in all cases, at least in
    > factor(c(sqrt(2)^2, 2))  ,
    > function 'factor' could do matching on original 'x' (without conversion to character), as in R before version 2.10.0. If function 'factor' did it,
    > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels = c("sqrt(2)^2", "2"))
    > could take sqrt(2)^2 and 2 as distinct.

Well, that may be interesting.. but I doubt if that's somewhere
we should go, easily, because  factor() has been documented to do
what it does now (with very slightly rounding such numbers via as.character(.))
and hence such a change would typically lead to much work for
too many people.

I do see that indeed the  as.character(.) inside factor() takes
most of the CPU time used in largish factor() examples [as your
first], and indeed, for the case of integer 'x', we really could
be much faster in factor construction.  

    > Another thing: Function 'factor' in R devel uses 'order' instead of 'sort.list'.

This has been by a change on purpose --- well documented as new
feature in NEWS --- to allow using *methods* for order(),
i.e. for the workhorse of order, xtfrm()  so that factor(OB)
works for more general objects OB.


    > The case of as.factor(x) for
    > x <- as.data.frame(character(0))
    > in tests/isas-tests.Rout.save reveals that 'order' on data frame is strange.

    > x <- as.data.frame(character(0))
    > y <- unique(x)
    > length(y)  # 1
    > length(order(y))  # 0
    > length(as.character(y))  # 1

    > order(y) is not as long as as.character(y).

    > Another example:
    > length(mtcars)  # 11
    > length(order(mtcars))  # 352

I agree that  order(<data.frame>) may look a bit strange;
I've spent more than an hour into looking into it, and making it
[actually,  rank(<data.frame>,..) ]
an error, but ended up finding much evidence that there's too
much related code, sometimes even in base R which assumes that a
numeric data frame behaves the same as a numeric matrix.

And also, if you carefully read the help files, of
  order(),
  xtfrm(),
  rank()

there's always mentioned that these work for R object 'x'
basically as long as   x[!is.na(x)]   returns a "nice"
(typically atomic) vector .. which is the case for such data frames.

The consequence, that  in R-devel, currently

    factor(mtcars)

just "works",  is indeed unexpected or even "shocking", and I
still don't know what the most elegant and reasonable way would
be to make this an error -- as it used to be when  sort.list()
was used instead of order().  I'd find it ugly (and even more
time consuming!) if factor() itself would have to check its
argument and signal an error for a data.frame.

The relevant call tree is

  factor() -> order() -> xtfrm() -> xtfrm.default() -> rank()

and as I said, rank(x,*) works when  x[!is.na(x)]  is an atomic
"numeric-like" vector  which is the case for a numeric data
frame such as 'mtcars'.

Martin Maechler
ETH Zurich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel