duplicated factor labels.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

duplicated factor labels.

PaulJohnson32gmail
Dear R devel

I've been wondering about this for a while. I am sorry to ask for your
time, but can one of you help me understand this?

This concerns duplicated labels, not levels, in the factor function.

I think it is hard to understand that factor() fails, but levels()
after does not

>  x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels,  :
  factor level [3] is duplicated
> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y
[1] 1    <NA> <NA> 4    4    4
Levels: 1 4

If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?

That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to  you now.)

In the factor function, the class of f is assigned *after* levels(f) is called

    levels(f) <- ## nl == nL or 1
    if (nl == nL) as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if(ordered) "ordered", "factor")

At that point, f is an integer, and levels(f) is a primitive

> `levels<-`
function (x, value)  .Primitive("levels<-")

That's what generates the error.  I don't understand well what
.Primitive means here. I need to walk past that detail.

Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.

factor <- function (x = character(), levels, labels = levels, exclude = NA,
    ordered = is.ordered(x), nmax = NA)
{
    if (is.null(x))
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    }
    force(ordered)
    if (!is.character(x))
        x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx))
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL)))
        stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
            nl, nL), domain = NA)
    ## class() moved up 3 rows
    class(f) <- c(if (ordered) "ordered", "factor")
    levels(f) <- if (nl == nL)
                  as.character(labels)
         else paste0(labels, seq_along(levels))
    f
}

> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y
[1] 1    <NA> <NA> 4    4    4
Levels: 1 4
> attributes(y)
$class
[1] "factor"

$levels
[1] "1" "4"

That's a "good" answer for me.

But I broke your function. I eliminated the check for duplicated levels.

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y
[1] 1    4    <NA> <NA> <NA> <NA>
Levels: 1 4

Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza

    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    } ##next is new part
        else {
        levels <- unique(levels)
    }

That will cause an error when there are duplicated levels because
there are more labels than levels:

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
  invalid 'labels'; length 6 should be 1 or 2

So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
opinion?

pj
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Martin Maechler
>>>>> Paul Johnson <[hidden email]>
>>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:

    > Dear R devel
    > I've been wondering about this for a while. I am sorry to ask for your
    > time, but can one of you help me understand this?

    > This concerns duplicated labels, not levels, in the factor function.

    > I think it is hard to understand that factor() fails, but levels()
    > after does not

    >> x <- 1:6
    >> xlevels <- 1:6
    >> xlabels <- c(1, NA, NA, 4, 4, 4)
    >> y <- factor(x, levels = xlevels, labels = xlabels)
    > Error in `levels<-`(`*tmp*`, value = if (nl == nL)
    > as.character(labels) else paste0(labels,  :
    > factor level [3] is duplicated
    >> y <- factor(x, levels = xlevels)
    >> levels(y) <- xlabels
    >> y
    > [1] 1    <NA> <NA> 4    4    4
    > Levels: 1 4

    > If the latter use of levels() causes a good, expected result, couldn't
    > factor(..., labels = xlabels) be made to the same thing?

I may misunderstand, but I think you are confusing 'labels' and 'levels'
here, (and you are not alone in this!) mostly because  R's
factor() function treats them as arguments in a way that can be
confusing.. (but I don't think we'd want to change that; it's
been documented and in use for  > 25 year (in S, S+, R).

Note that after the above,

> dput(y)
structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"), class = "factor")

and that of course _is_ a valid factor .. which you can easily
get directly via e.g.

> identical(y, factor(c(1,NA,NA,4,4,4)))
[1] TRUE

or also  via

> identical(y, factor(c("1",NA,NA,"4","4","4")))
[1] TRUE

I really don't see a need for a change of factor().
It should remain as simple as possible (but not simpler :-).

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Joris FA Meys
To extwnd on Martin 's explanation :

In factor(), levels are the unique input values and labels the unique
output values. So the function levels() actually displays the labels.

Cheers
Joris


On 15 Jun 2017 17:15, "Martin Maechler" <[hidden email]> wrote:

>>>>> Paul Johnson <[hidden email]>
>>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:

    > Dear R devel
    > I've been wondering about this for a while. I am sorry to ask for your
    > time, but can one of you help me understand this?

    > This concerns duplicated labels, not levels, in the factor function.

    > I think it is hard to understand that factor() fails, but levels()
    > after does not

    >> x <- 1:6
    >> xlevels <- 1:6
    >> xlabels <- c(1, NA, NA, 4, 4, 4)
    >> y <- factor(x, levels = xlevels, labels = xlabels)
    > Error in `levels<-`(`*tmp*`, value = if (nl == nL)
    > as.character(labels) else paste0(labels,  :
    > factor level [3] is duplicated
    >> y <- factor(x, levels = xlevels)
    >> levels(y) <- xlabels
    >> y
    > [1] 1    <NA> <NA> 4    4    4
    > Levels: 1 4

    > If the latter use of levels() causes a good, expected result, couldn't
    > factor(..., labels = xlabels) be made to the same thing?

I may misunderstand, but I think you are confusing 'labels' and 'levels'
here, (and you are not alone in this!) mostly because  R's
factor() function treats them as arguments in a way that can be
confusing.. (but I don't think we'd want to change that; it's
been documented and in use for  > 25 year (in S, S+, R).

Note that after the above,

> dput(y)
structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"), class = "factor")

and that of course _is_ a valid factor .. which you can easily
get directly via e.g.

> identical(y, factor(c(1,NA,NA,4,4,4)))
[1] TRUE

or also  via

> identical(y, factor(c("1",NA,NA,"4","4","4")))
[1] TRUE

I really don't see a need for a change of factor().
It should remain as simple as possible (but not simpler :-).

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

PaulJohnson32gmail
On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
> To extwnd on Martin 's explanation :
>
> In factor(), levels are the unique input values and labels the unique output
> values. So the function levels() actually displays the labels.
>

Dear Joris

I think we agree. Currently, factor insists both levels and labels be unique.

I wish that it would not accept nonunique labels. I also understand it
is impractical to change this now in base R.

I don't think I succeeded in explaining why this would be nicer.
Here's another example. Fairly often, we see input data like

x <- c("Male", "Man", "male", "Man", "Female")

The first four represent the same value.  I'd like to go in one step
to a new factor variable with enumerated types "Male" and "Female".
This fails

xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
        labels = c("Male", "Male", "Male", "Female"))

Instead, we need 2 steps.

xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
levels(xf) <- c("Male", "Male", "Male", "Female")

I think it is quirky that `levels<-.factor` allows the duplicated
labels, whereas factor does not.

I wrote a function rockchalk::combineLevels to simplify combining
levels, but most of the students here like plyr::mapvalues to do it.
The use of levels() can be tricky because one must enumerate all
values, not just the ones being changed.

But I do understand Martin's point. Its been this way 25 years, it
won't change. :).

> Cheers
> Joris
>
>


--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Joris FA Meys
Hi Paul,

Now I see what you're getting at. I misread your original mail completely.
So we definitely agree, and wholeheartedly even.

The use case you just gave, is definitely in my top 5 of frustrations about
R. I would like to be able to assign the same label to multiple levels
without having to use eg dplyr::recode_factor() or some other vectorized
switch statement to recode all data first.

I understand "it's been like that 25 years", but I've looked hard to find a
use case where adding this behaviour would invalid existing code and
couldn't come up with something.

So I add my (totally insignificant) vote for adding the possibility of
assigning the same label to multiple levels in factor() itself.

Cheers and thank you for bringing this up!


On Fri, Jun 16, 2017 at 6:02 PM, Paul Johnson <[hidden email]> wrote:

> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
> > To extwnd on Martin 's explanation :
> >
> > In factor(), levels are the unique input values and labels the unique
> output
> > values. So the function levels() actually displays the labels.
> >
>
> Dear Joris
>
> I think we agree. Currently, factor insists both levels and labels be
> unique.
>
> I wish that it would not accept nonunique labels. I also understand it
> is impractical to change this now in base R.
>
> I don't think I succeeded in explaining why this would be nicer.
> Here's another example. Fairly often, we see input data like
>
> x <- c("Male", "Man", "male", "Man", "Female")
>
> The first four represent the same value.  I'd like to go in one step
> to a new factor variable with enumerated types "Male" and "Female".
> This fails
>
> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>         labels = c("Male", "Male", "Male", "Female"))
>
> Instead, we need 2 steps.
>
> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
> levels(xf) <- c("Male", "Male", "Male", "Female")
>
> I think it is quirky that `levels<-.factor` allows the duplicated
> labels, whereas factor does not.
>
> I wrote a function rockchalk::combineLevels to simplify combining
> levels, but most of the students here like plyr::mapvalues to do it.
> The use of levels() can be tricky because one must enumerate all
> values, not just the ones being changed.
>
> But I do understand Martin's point. Its been this way 25 years, it
> won't change. :).
>
> > Cheers
> > Joris
> >
> >
>
>
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis
> http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.
>



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Martin Maechler
In reply to this post by PaulJohnson32gmail
>>>>> Paul Johnson <[hidden email]>
>>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:

    > On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
    >> To extwnd on Martin 's explanation :
    >>
    >> In factor(), levels are the unique input values and labels the unique output
    >> values. So the function levels() actually displays the labels.
    >>

    > Dear Joris

    > I think we agree. Currently, factor insists both levels and labels be unique.

    > I wish that it would not accept nonunique labels. I also understand it
    > is impractical to change this now in base R.

    > I don't think I succeeded in explaining why this would be nicer.
    > Here's another example. Fairly often, we see input data like

    > x <- c("Male", "Man", "male", "Man", "Female")

    > The first four represent the same value.  I'd like to go in one step
    > to a new factor variable with enumerated types "Male" and "Female".
    > This fails

    > xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
    > labels = c("Male", "Male", "Male", "Female"))

    > Instead, we need 2 steps.

    > xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
    > levels(xf) <- c("Male", "Male", "Male", "Female")

    > I think it is quirky that `levels<-.factor` allows the duplicated
    > labels, whereas factor does not.

    > I wrote a function rockchalk::combineLevels to simplify combining
    > levels, but most of the students here like plyr::mapvalues to do it.
    > The use of levels() can be tricky because one must enumerate all
    > values, not just the ones being changed.

    > But I do understand Martin's point. Its been this way 25 years, it
    > won't change. :).

Well.. the above is a bit out of context.

Your first example really did not make a point to me (and Joris)
and I showed that you could use even two different simple factor() calls to
produce what you wanted
        yc <- factor(c("1",NA,NA,"4","4","4"))
        yn <- factor(c( 1, NA,NA, 4,  4,  4))

Your new example is indeed  much more convincing !

(Note though that the two steps that are needed can be written
 more shortly

The  "been this way 25 years"  is one a reason to be very
cautious(*) with changes, but not a reason for no changes!

(*) Indeed as some of you have noted we really should not "break behavior".
    This means to me we cannot accept a change there which gives
    an error or a different result in cases the old behavior gave a valid factor.

I'm looking at a possible change currently
[not promising that a change will happen ...]


Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Martin Maechler
>>>>> Martin Maechler <[hidden email]>
>>>>>     on Thu, 22 Jun 2017 11:43:59 +0200 writes:

>>>>> Paul Johnson <[hidden email]>
>>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:

    >> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
    >>> To extwnd on Martin 's explanation :
    >>>
    >>> In factor(), levels are the unique input values and labels the unique output
    >>> values. So the function levels() actually displays the labels.
    >>>

    >> Dear Joris

    >> I think we agree. Currently, factor insists both levels and labels be unique.

    >> I wish that it would not accept nonunique labels. I also understand it
    >> is impractical to change this now in base R.

    >> I don't think I succeeded in explaining why this would be nicer.
    >> Here's another example. Fairly often, we see input data like

    >> x <- c("Male", "Man", "male", "Man", "Female")

    >> The first four represent the same value.  I'd like to go in one step
    >> to a new factor variable with enumerated types "Male" and "Female".
    >> This fails

    >> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
    >> labels = c("Male", "Male", "Male", "Female"))

    >> Instead, we need 2 steps.

    >> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
    >> levels(xf) <- c("Male", "Male", "Male", "Female")

    >> I think it is quirky that `levels<-.factor` allows the duplicated
    >> labels, whereas factor does not.

    >> I wrote a function rockchalk::combineLevels to simplify combining
    >> levels, but most of the students here like plyr::mapvalues to do it.
    >> The use of levels() can be tricky because one must enumerate all
    >> values, not just the ones being changed.

    >> But I do understand Martin's point. Its been this way 25 years, it
    >> won't change. :).

    > Well.. the above is a bit out of context.

    > Your first example really did not make a point to me (and Joris)
    > and I showed that you could use even two different simple factor() calls to
    > produce what you wanted
    > yc <- factor(c("1",NA,NA,"4","4","4"))
    > yn <- factor(c( 1, NA,NA, 4,  4,  4))

    > Your new example is indeed  much more convincing !

    > (Note though that the two steps that are needed can be written
    > more shortly

    > The  "been this way 25 years"  is one a reason to be very
    > cautious(*) with changes, but not a reason for no changes!

    > (*) Indeed as some of you have noted we really should not "break behavior".
    > This means to me we cannot accept a change there which gives
    > an error or a different result in cases the old behavior gave a valid factor.

    > I'm looking at a possible change currently
    > [not promising that a change will happen ...]

In the end, I've liked the change (after 2-3 iterations), and
now been brave to commit to R-devel (svn 72845).

With the change, I had to disable one of our own regression
checks (tests/reg-tests-1b.R, line 726):

The following is now (in R-devel -> R 3.5.0) valid:

   > factor(1:2, labels = c("A","A"))
   [1] A A
   Levels: A
   >

I wonder how many CRAN package checks will "break" from
this (my guess is in the order of a dozen), but I hope
that these breakages will be benign, e.g., similar to the above
case where before an error was expected via tools :: assertError(.)

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Peter Dalgaard-2
Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors.

Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time?

-pd

> On 23 Jun 2017, at 10:42 , Martin Maechler <[hidden email]> wrote:
>
>>>>>> Martin Maechler <[hidden email]>
>>>>>>    on Thu, 22 Jun 2017 11:43:59 +0200 writes:
>
>>>>>> Paul Johnson <[hidden email]>
>>>>>>    on Fri, 16 Jun 2017 11:02:34 -0500 writes:
>
>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
>>>> To extwnd on Martin 's explanation :
>>>>
>>>> In factor(), levels are the unique input values and labels the unique output
>>>> values. So the function levels() actually displays the labels.
>>>>
>
>>> Dear Joris
>
>>> I think we agree. Currently, factor insists both levels and labels be unique.
>
>>> I wish that it would not accept nonunique labels. I also understand it
>>> is impractical to change this now in base R.
>
>>> I don't think I succeeded in explaining why this would be nicer.
>>> Here's another example. Fairly often, we see input data like
>
>>> x <- c("Male", "Man", "male", "Man", "Female")
>
>>> The first four represent the same value.  I'd like to go in one step
>>> to a new factor variable with enumerated types "Male" and "Female".
>>> This fails
>
>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>>> labels = c("Male", "Male", "Male", "Female"))
>
>>> Instead, we need 2 steps.
>
>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
>>> levels(xf) <- c("Male", "Male", "Male", "Female")
>
>>> I think it is quirky that `levels<-.factor` allows the duplicated
>>> labels, whereas factor does not.
>
>>> I wrote a function rockchalk::combineLevels to simplify combining
>>> levels, but most of the students here like plyr::mapvalues to do it.
>>> The use of levels() can be tricky because one must enumerate all
>>> values, not just the ones being changed.
>
>>> But I do understand Martin's point. Its been this way 25 years, it
>>> won't change. :).
>
>> Well.. the above is a bit out of context.
>
>> Your first example really did not make a point to me (and Joris)
>> and I showed that you could use even two different simple factor() calls to
>> produce what you wanted
>> yc <- factor(c("1",NA,NA,"4","4","4"))
>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
>
>> Your new example is indeed  much more convincing !
>
>> (Note though that the two steps that are needed can be written
>> more shortly
>
>> The  "been this way 25 years"  is one a reason to be very
>> cautious(*) with changes, but not a reason for no changes!
>
>> (*) Indeed as some of you have noted we really should not "break behavior".
>> This means to me we cannot accept a change there which gives
>> an error or a different result in cases the old behavior gave a valid factor.
>
>> I'm looking at a possible change currently
>> [not promising that a change will happen ...]
>
> In the end, I've liked the change (after 2-3 iterations), and
> now been brave to commit to R-devel (svn 72845).
>
> With the change, I had to disable one of our own regression
> checks (tests/reg-tests-1b.R, line 726):
>
> The following is now (in R-devel -> R 3.5.0) valid:
>
>> factor(1:2, labels = c("A","A"))
>   [1] A A
>   Levels: A
>>
>
> I wonder how many CRAN package checks will "break" from
> this (my guess is in the order of a dozen), but I hope
> that these breakages will be benign, e.g., similar to the above
> case where before an error was expected via tools :: assertError(.)
>
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Uwe Ligges-3


On 23.06.2017 11:51, peter dalgaard wrote:
> Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors.
>
> Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time?

Indeed.

The read.spss code now allows for two things, one is to do what Martin
implemented, the other one is to keep the labels seperated and rename
them to be unique (the latter is the default now, explanation follows
below).

Quite often we found something like the following example in SPSS files,
translated into R speak:

factor(c(1,3,2,4,5,3,1), levels=1:5, labels=c("Strongly disagree",
"Disagree", "Neither agree nor disagree", "Agree", "Agree"))

where the last is a simple copy and paste error and should be "Strongly
agree".

I had the chance to look at > 1300 SPSS files our consulting center
collected during the last 20 year, and in several hundred cases we found
such a problem that was copy & paste error and simply wrong.
Only in < 5 cases condensing several levels into one was appropriate,
hence we decided to keep duplicated levels by changing the names as the
default.

Based on this experience I'd propose no to touch factor but rather add a
function that easily allows for this reduction, if we do not have that
already.

Best,
Uwe





>
> -pd
>
>> On 23 Jun 2017, at 10:42 , Martin Maechler <[hidden email]> wrote:
>>
>>>>>>> Martin Maechler <[hidden email]>
>>>>>>>     on Thu, 22 Jun 2017 11:43:59 +0200 writes:
>>
>>>>>>> Paul Johnson <[hidden email]>
>>>>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:
>>
>>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
>>>>> To extwnd on Martin 's explanation :
>>>>>
>>>>> In factor(), levels are the unique input values and labels the unique output
>>>>> values. So the function levels() actually displays the labels.
>>>>>
>>
>>>> Dear Joris
>>
>>>> I think we agree. Currently, factor insists both levels and labels be unique.
>>
>>>> I wish that it would not accept nonunique labels. I also understand it
>>>> is impractical to change this now in base R.
>>
>>>> I don't think I succeeded in explaining why this would be nicer.
>>>> Here's another example. Fairly often, we see input data like
>>
>>>> x <- c("Male", "Man", "male", "Man", "Female")
>>
>>>> The first four represent the same value.  I'd like to go in one step
>>>> to a new factor variable with enumerated types "Male" and "Female".
>>>> This fails
>>
>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>>>> labels = c("Male", "Male", "Male", "Female"))
>>
>>>> Instead, we need 2 steps.
>>
>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
>>>> levels(xf) <- c("Male", "Male", "Male", "Female")
>>
>>>> I think it is quirky that `levels<-.factor` allows the duplicated
>>>> labels, whereas factor does not.
>>
>>>> I wrote a function rockchalk::combineLevels to simplify combining
>>>> levels, but most of the students here like plyr::mapvalues to do it.
>>>> The use of levels() can be tricky because one must enumerate all
>>>> values, not just the ones being changed.
>>
>>>> But I do understand Martin's point. Its been this way 25 years, it
>>>> won't change. :).
>>
>>> Well.. the above is a bit out of context.
>>
>>> Your first example really did not make a point to me (and Joris)
>>> and I showed that you could use even two different simple factor() calls to
>>> produce what you wanted
>>> yc <- factor(c("1",NA,NA,"4","4","4"))
>>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
>>
>>> Your new example is indeed  much more convincing !
>>
>>> (Note though that the two steps that are needed can be written
>>> more shortly
>>
>>> The  "been this way 25 years"  is one a reason to be very
>>> cautious(*) with changes, but not a reason for no changes!
>>
>>> (*) Indeed as some of you have noted we really should not "break behavior".
>>> This means to me we cannot accept a change there which gives
>>> an error or a different result in cases the old behavior gave a valid factor.
>>
>>> I'm looking at a possible change currently
>>> [not promising that a change will happen ...]
>>
>> In the end, I've liked the change (after 2-3 iterations), and
>> now been brave to commit to R-devel (svn 72845).
>>
>> With the change, I had to disable one of our own regression
>> checks (tests/reg-tests-1b.R, line 726):
>>
>> The following is now (in R-devel -> R 3.5.0) valid:
>>
>>> factor(1:2, labels = c("A","A"))
>>    [1] A A
>>    Levels: A
>>>
>>
>> I wonder how many CRAN package checks will "break" from
>> this (my guess is in the order of a dozen), but I hope
>> that these breakages will be benign, e.g., similar to the above
>> case where before an error was expected via tools :: assertError(.)
>>
>> Martin
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Martin Maechler
In reply to this post by Peter Dalgaard-2
>>>>> peter dalgaard <[hidden email]>
>>>>>     on Fri, 23 Jun 2017 11:51:05 +0200 writes:

    > Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors.
    > Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time?

Good point... but I think we should be relatively safe .. unless
"some time" is ca. 8 years :

We have had a warning for these for ca. 7.5 years, namely from
R version 2.10.0 (2009-10-26)    up to
R version 3.3.3 (2017-03-06) -- "Another Canoe"

   > factor(1:2, labels = c("A","A"))
   [1] A A
   Levels: A A
   Warning message:
   In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
   duplicated levels in factors are deprecated

   > x <- c("Male", "Man", "male", "Man", "Female")
   > ## the new "direct" way:
   xf <- factor(x, levels = c("Male", "Man",  "male", "Female"),
                   labels = c("Male", "Male", "Male", "Female"))
   Warning message:
   In `levels<-`(`*tmp*`, value = c("Male", "Male", "Male", "Female" :
     duplicated levels will not be allowed in factors anymore
   > xf
   [1] Male   Male   Male   Male   Female
   Levels: Male Male Male Female
   >

which gave a result somewhat similar to the new R-devel
result.  I would argue the new result should be fine....

Yes, if unwise people used  suppressWarnings(.) around their
code, they may be surprised now.... but that's what you get if
you suppress warnings without enough thought, no ?



    > -pd

    >> On 23 Jun 2017, at 10:42 , Martin Maechler <[hidden email]> wrote:
    >>
    >>>>>>> Martin Maechler <[hidden email]>
    >>>>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes:
    >>
    >>>>>>> Paul Johnson <[hidden email]>
    >>>>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes:
    >>
    >>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:
    >>>>> To extwnd on Martin 's explanation :
    >>>>>
    >>>>> In factor(), levels are the unique input values and labels the unique output
    >>>>> values. So the function levels() actually displays the labels.
    >>>>>
    >>
    >>>> Dear Joris
    >>
    >>>> I think we agree. Currently, factor insists both levels and labels be unique.
    >>
    >>>> I wish that it would not accept nonunique labels. I also understand it
    >>>> is impractical to change this now in base R.
    >>
    >>>> I don't think I succeeded in explaining why this would be nicer.
    >>>> Here's another example. Fairly often, we see input data like
    >>
    >>>> x <- c("Male", "Man", "male", "Man", "Female")
    >>
    >>>> The first four represent the same value.  I'd like to go in one step
    >>>> to a new factor variable with enumerated types "Male" and "Female".
    >>>> This fails
    >>
    >>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
    >>>> labels = c("Male", "Male", "Male", "Female"))
    >>
    >>>> Instead, we need 2 steps.
    >>
    >>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
    >>>> levels(xf) <- c("Male", "Male", "Male", "Female")
    >>
    >>>> I think it is quirky that `levels<-.factor` allows the duplicated
    >>>> labels, whereas factor does not.
    >>
    >>>> I wrote a function rockchalk::combineLevels to simplify combining
    >>>> levels, but most of the students here like plyr::mapvalues to do it.
    >>>> The use of levels() can be tricky because one must enumerate all
    >>>> values, not just the ones being changed.
    >>
    >>>> But I do understand Martin's point. Its been this way 25 years, it
    >>>> won't change. :).
    >>
    >>> Well.. the above is a bit out of context.
    >>
    >>> Your first example really did not make a point to me (and Joris)
    >>> and I showed that you could use even two different simple factor() calls to
    >>> produce what you wanted
    >>> yc <- factor(c("1",NA,NA,"4","4","4"))
    >>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
    >>
    >>> Your new example is indeed  much more convincing !
    >>
    >>> (Note though that the two steps that are needed can be written
    >>> more shortly
    >>
    >>> The  "been this way 25 years"  is one a reason to be very
    >>> cautious(*) with changes, but not a reason for no changes!
    >>
    >>> (*) Indeed as some of you have noted we really should not "break behavior".
    >>> This means to me we cannot accept a change there which gives
    >>> an error or a different result in cases the old behavior gave a valid factor.
    >>
    >>> I'm looking at a possible change currently
    >>> [not promising that a change will happen ...]
    >>
    >> In the end, I've liked the change (after 2-3 iterations), and
    >> now been brave to commit to R-devel (svn 72845).
    >>
    >> With the change, I had to disable one of our own regression
    >> checks (tests/reg-tests-1b.R, line 726):
    >>
    >> The following is now (in R-devel -> R 3.5.0) valid:
    >>
    >>> factor(1:2, labels = c("A","A"))
    >> [1] A A
    >> Levels: A
    >>>
    >>
    >> I wonder how many CRAN package checks will "break" from
    >> this (my guess is in the order of a dozen), but I hope
    >> that these breakages will be benign, e.g., similar to the above
    >> case where before an error was expected via tools :: assertError(.)
    >>
    >> Martin
    >>
    >> ______________________________________________
    >> [hidden email] mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > --
    > Peter Dalgaard, Professor,
    > Center for Statistics, Copenhagen Business School
    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    > Phone: (+45)38153501
    > Office: A 4.23
    > Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

Joris FA Meys
In reply to this post by Uwe Ligges-3
On Fri, Jun 23, 2017 at 2:20 PM, Uwe Ligges <[hidden email]
> wrote:

>
>
>
> I had the chance to look at > 1300 SPSS files our consulting center
> collected during the last 20 year, and in several hundred cases we found
> such a problem that was copy & paste error and simply wrong.
> Only in < 5 cases condensing several levels into one was appropriate,
> hence we decided to keep duplicated levels by changing the names as the
> default.
>

I understand where you're coming from. I know from personal experience
exactly how much this is a pain in the ass, but I also have to group
different labels in fewer categories in about every data set I get from
clients or students. Especially when things come from surveys with 30
different education categories etc.

So I would argue that checking for duplicate labels is a task for
read.spss() and can be added as an extra check if necessary. But I
personally don't see the fact that clients regularly mess up SPSS files as
enough of an argument to not change the behaviour of factor().


> Based on this experience I'd propose no to touch factor but rather add a
> function that easily allows for this reduction, if we do not have that
> already.
>

There are functions already that allow to do this, like the tidyverse
dplyr::recode_factor() function. It's rather trivial doing this with
logical operators and indices, and I have my own "recode" function so I
don't have to rely on any package or retype the same construct over and
over again but with different values.

But a clean and logical way to recode/group different levels when
constructing the factor, would be at least for me be very convenient. But
I'm just a guy and I'm not writing the code, so in the end it's up to you
guys.

Cheers
Joris
--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: duplicated factor labels.

PaulJohnson32gmail
In reply to this post by Uwe Ligges-3
On Fri, Jun 23, 2017 at 7:20 AM, Uwe Ligges
<[hidden email]> wrote:

>
>
> On 23.06.2017 11:51, peter dalgaard wrote:
>>
>> Hmm, the danger in this is that duplicated factor levels _used_ to be
>> allowed (i.e. multiple codes with the same level). Disallowing it is what
>> broke read.spss() on some files, because SPSS's concept of value labels is
>> not 1-to-1 with factors.
>>
>> Reallowing it with different semantics could be premature. I mean, if we
>> hadn't had the "forbidden" step, read.spss() could have changed behaviour
>> unnoticed. So what if there is code relying on duplicate factor levels,
>> which hasn't been run for some time?
>
>
> Indeed.
>
> The read.spss code now allows for two things, one is to do what Martin
> implemented, the other one is to keep the labels seperated and rename them
> to be unique (the latter is the default now, explanation follows below).
>
> Quite often we found something like the following example in SPSS files,
> translated into R speak:
>
> factor(c(1,3,2,4,5,3,1), levels=1:5, labels=c("Strongly disagree",
> "Disagree", "Neither agree nor disagree", "Agree", "Agree"))
>
> where the last is a simple copy and paste error and should be "Strongly
> agree".
>
> I had the chance to look at > 1300 SPSS files our consulting center
> collected during the last 20 year, and in several hundred cases we found
> such a problem that was copy & paste error and simply wrong.
> Only in < 5 cases condensing several levels into one was appropriate, hence
> we decided to keep duplicated levels by changing the names as the default.
>
> Based on this experience I'd propose no to touch factor but rather add a
> function that easily allows for this reduction, if we do not have that
> already.
>
> Best,
> Uwe
>
>
If the factor function stays the way it was, I have a suggestion Uwe's
suggest R should add a function to facilitate reduction of labels.

There is a function named "mapvalues" in H Wickham's plyr package and
it works well to combine levels.  It is a generally useful recoding
function, works with integers and characters as well. It also seems to
work with numeric variables.  This is pure R, it does not carry along
with it any external dependencies. No need for Rcpp, %>% or any thing
else.

Of the things we have tried with the average users who come and go,
mapvalues is the most understandable/successful. It is more convenient
than levels()<- because Users need not name all existing levels.

I suggest you consider putting that function in R base.

I'm pasting in the code to save you the trouble of looking it up. I
thought recursion  to re-code factors was clever.  I don't entirely
understand how it can work on double precision floats, it is relying
on match for that.


#' Replace specified values with new values, in a vector or factor.
#'
#' Item in \code{x} that match items \code{from} will be replaced by
#' items in \code{to}, matched by position. For example, items in \code{x} that
#' match the first element in \code{from} will be replaced by the first
#' element of \code{to}.
#'
#' If \code{x} is a factor, the matching levels of the factor will be
#' replaced with the new values.
#'
#' The related \code{revalue} function works only on character vectors
#' and factors, but this function works on vectors of any type and factors.
#'
#' @param x the factor or vector to modify
#' @param from a vector of the items to replace
#' @param to a vector of replacement values
#' @param warn_missing print a message if any of the old values are
#'   not actually present in \code{x}
#'
#' @seealso \code{\link{revalue}} to do the same thing but with a single
#'   named vector instead of two separate vectors.
#' @export
#' @examples
#' x <- c("a", "b", "c")
#' mapvalues(x, c("a", "c"), c("A", "C"))
#'
#' # Works on factors
#' y <- factor(c("a", "b", "c", "a"))
#' mapvalues(y, c("a", "c"), c("A", "C"))
#'
#' # Works on numeric vectors
#' z <- c(1, 4, 5, 9)
#' mapvalues(z, from = c(1, 5, 9), to = c(10, 50, 90))
mapvalues <- function(x, from, to, warn_missing = TRUE) {
  if (length(from) != length(to)) {
    stop("`from` and `to` vectors are not the same length.")
  }
  if (!is.atomic(x)) {
    stop("`x` must be an atomic vector.")
  }

  if (is.factor(x)) {
    # If x is a factor, call self but operate on the levels
    levels(x) <- mapvalues(levels(x), from, to, warn_missing)
    return(x)
  }

  mapidx <- match(x, from)
  mapidxNA  <- is.na(mapidx)

  # index of items in `from` that were found in `x`
  from_found <- sort(unique(mapidx))
  if (warn_missing && length(from_found) != length(from)) {
    message("The following `from` values were not present in `x`: ",
      paste(from[!(1:length(from) %in% from_found) ], collapse = ", "))
  }

  x[!mapidxNA] <- to[mapidx[!mapidxNA]]
  x
}

In the rockchalk package, I wrote a function called combineLevels that
is careful with ordinal variables and only allows adjacent values to
be combined. I'm not suggesting you go that far with this simple
piece.

>
>
>
>
>>
>> -pd
>>
>>> On 23 Jun 2017, at 10:42 , Martin Maechler <[hidden email]>
>>> wrote:
>>>
>>>>>>>> Martin Maechler <[hidden email]>
>>>>>>>>     on Thu, 22 Jun 2017 11:43:59 +0200 writes:
>>>
>>>
>>>>>>>> Paul Johnson <[hidden email]>
>>>>>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:
>>>
>>>
>>>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> To extwnd on Martin 's explanation :
>>>>>>
>>>>>> In factor(), levels are the unique input values and labels the unique
>>>>>> output
>>>>>> values. So the function levels() actually displays the labels.
>>>>>>
>>>
>>>>> Dear Joris
>>>
>>>
>>>>> I think we agree. Currently, factor insists both levels and labels be
>>>>> unique.
>>>
>>>
>>>>> I wish that it would not accept nonunique labels. I also understand it
>>>>> is impractical to change this now in base R.
>>>
>>>
>>>>> I don't think I succeeded in explaining why this would be nicer.
>>>>> Here's another example. Fairly often, we see input data like
>>>
>>>
>>>>> x <- c("Male", "Man", "male", "Man", "Female")
>>>
>>>
>>>>> The first four represent the same value.  I'd like to go in one step
>>>>> to a new factor variable with enumerated types "Male" and "Female".
>>>>> This fails
>>>
>>>
>>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>>>>> labels = c("Male", "Male", "Male", "Female"))
>>>
>>>
>>>>> Instead, we need 2 steps.
>>>
>>>
>>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
>>>>> levels(xf) <- c("Male", "Male", "Male", "Female")
>>>
>>>
>>>>> I think it is quirky that `levels<-.factor` allows the duplicated
>>>>> labels, whereas factor does not.
>>>
>>>
>>>>> I wrote a function rockchalk::combineLevels to simplify combining
>>>>> levels, but most of the students here like plyr::mapvalues to do it.
>>>>> The use of levels() can be tricky because one must enumerate all
>>>>> values, not just the ones being changed.
>>>
>>>
>>>>> But I do understand Martin's point. Its been this way 25 years, it
>>>>> won't change. :).
>>>
>>>
>>>> Well.. the above is a bit out of context.
>>>
>>>
>>>> Your first example really did not make a point to me (and Joris)
>>>> and I showed that you could use even two different simple factor() calls
>>>> to
>>>> produce what you wanted
>>>> yc <- factor(c("1",NA,NA,"4","4","4"))
>>>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
>>>
>>>
>>>> Your new example is indeed  much more convincing !
>>>
>>>
>>>> (Note though that the two steps that are needed can be written
>>>> more shortly
>>>
>>>
>>>> The  "been this way 25 years"  is one a reason to be very
>>>> cautious(*) with changes, but not a reason for no changes!
>>>
>>>
>>>> (*) Indeed as some of you have noted we really should not "break
>>>> behavior".
>>>> This means to me we cannot accept a change there which gives
>>>> an error or a different result in cases the old behavior gave a valid
>>>> factor.
>>>
>>>
>>>> I'm looking at a possible change currently
>>>> [not promising that a change will happen ...]
>>>
>>>
>>> In the end, I've liked the change (after 2-3 iterations), and
>>> now been brave to commit to R-devel (svn 72845).
>>>
>>> With the change, I had to disable one of our own regression
>>> checks (tests/reg-tests-1b.R, line 726):
>>>
>>> The following is now (in R-devel -> R 3.5.0) valid:
>>>
>>>> factor(1:2, labels = c("A","A"))
>>>
>>>    [1] A A
>>>    Levels: A
>>>>
>>>>
>>>
>>> I wonder how many CRAN package checks will "break" from
>>> this (my guess is in the order of a dozen), but I hope
>>> that these breakages will be benign, e.g., similar to the above
>>> case where before an error was expected via tools :: assertError(.)
>>>
>>> Martin
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel