# duplicated factor labels.

12 messages
Open this post in threaded view
|

## duplicated factor labels.

Open this post in threaded view
|

## Re: duplicated factor labels.

 >>>>> Paul Johnson <[hidden email]> >>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:     > Dear R devel     > I've been wondering about this for a while. I am sorry to ask for your     > time, but can one of you help me understand this?     > This concerns duplicated labels, not levels, in the factor function.     > I think it is hard to understand that factor() fails, but levels()     > after does not     >> x <- 1:6     >> xlevels <- 1:6     >> xlabels <- c(1, NA, NA, 4, 4, 4)     >> y <- factor(x, levels = xlevels, labels = xlabels)     > Error in `levels<-`(`*tmp*`, value = if (nl == nL)     > as.character(labels) else paste0(labels,  :     > factor level [3] is duplicated     >> y <- factor(x, levels = xlevels)     >> levels(y) <- xlabels     >> y     > [1] 1     4    4    4     > Levels: 1 4     > If the latter use of levels() causes a good, expected result, couldn't     > factor(..., labels = xlabels) be made to the same thing? I may misunderstand, but I think you are confusing 'labels' and 'levels' here, (and you are not alone in this!) mostly because  R's factor() function treats them as arguments in a way that can be confusing.. (but I don't think we'd want to change that; it's been documented and in use for  > 25 year (in S, S+, R). Note that after the above, > dput(y) structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"), class = "factor") and that of course _is_ a valid factor .. which you can easily get directly via e.g. > identical(y, factor(c(1,NA,NA,4,4,4))) [1] TRUE or also  via > identical(y, factor(c("1",NA,NA,"4","4","4"))) [1] TRUE I really don't see a need for a change of factor(). It should remain as simple as possible (but not simpler :-). Martin ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 To extwnd on Martin 's explanation : In factor(), levels are the unique input values and labels the unique output values. So the function levels() actually displays the labels. Cheers Joris On 15 Jun 2017 17:15, "Martin Maechler" <[hidden email]> wrote: >>>>> Paul Johnson <[hidden email]> >>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:     > Dear R devel     > I've been wondering about this for a while. I am sorry to ask for your     > time, but can one of you help me understand this?     > This concerns duplicated labels, not levels, in the factor function.     > I think it is hard to understand that factor() fails, but levels()     > after does not     >> x <- 1:6     >> xlevels <- 1:6     >> xlabels <- c(1, NA, NA, 4, 4, 4)     >> y <- factor(x, levels = xlevels, labels = xlabels)     > Error in `levels<-`(`*tmp*`, value = if (nl == nL)     > as.character(labels) else paste0(labels,  :     > factor level [3] is duplicated     >> y <- factor(x, levels = xlevels)     >> levels(y) <- xlabels     >> y     > [1] 1     4    4    4     > Levels: 1 4     > If the latter use of levels() causes a good, expected result, couldn't     > factor(..., labels = xlabels) be made to the same thing? I may misunderstand, but I think you are confusing 'labels' and 'levels' here, (and you are not alone in this!) mostly because  R's factor() function treats them as arguments in a way that can be confusing.. (but I don't think we'd want to change that; it's been documented and in use for  > 25 year (in S, S+, R). Note that after the above, > dput(y) structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"), class = "factor") and that of course _is_ a valid factor .. which you can easily get directly via e.g. > identical(y, factor(c(1,NA,NA,4,4,4))) [1] TRUE or also  via > identical(y, factor(c("1",NA,NA,"4","4","4"))) [1] TRUE I really don't see a need for a change of factor(). It should remain as simple as possible (but not simpler :-). Martin ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel        [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote: > To extwnd on Martin 's explanation : > > In factor(), levels are the unique input values and labels the unique output > values. So the function levels() actually displays the labels. > Dear Joris I think we agree. Currently, factor insists both levels and labels be unique. I wish that it would not accept nonunique labels. I also understand it is impractical to change this now in base R. I don't think I succeeded in explaining why this would be nicer. Here's another example. Fairly often, we see input data like x <- c("Male", "Man", "male", "Man", "Female") The first four represent the same value.  I'd like to go in one step to a new factor variable with enumerated types "Male" and "Female". This fails xf <- factor(x, levels = c("Male", "Man", "male", "Female"),         labels = c("Male", "Male", "Male", "Female")) Instead, we need 2 steps. xf <- factor(x, levels = c("Male", "Man", "male", "Female")) levels(xf) <- c("Male", "Male", "Male", "Female") I think it is quirky that `levels<-.factor` allows the duplicated labels, whereas factor does not. I wrote a function rockchalk::combineLevels to simplify combining levels, but most of the students here like plyr::mapvalues to do it. The use of levels() can be tricky because one must enumerate all values, not just the ones being changed. But I do understand Martin's point. Its been this way 25 years, it won't change. :). > Cheers > Joris > > -- Paul E. Johnson   http://pj.freefaculty.orgDirector, Center for Research Methods and Data Analysis http://crmda.ku.eduTo write to me directly, please address me at pauljohn at ku.edu. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 Hi Paul, Now I see what you're getting at. I misread your original mail completely. So we definitely agree, and wholeheartedly even. The use case you just gave, is definitely in my top 5 of frustrations about R. I would like to be able to assign the same label to multiple levels without having to use eg dplyr::recode_factor() or some other vectorized switch statement to recode all data first. I understand "it's been like that 25 years", but I've looked hard to find a use case where adding this behaviour would invalid existing code and couldn't come up with something. So I add my (totally insignificant) vote for adding the possibility of assigning the same label to multiple levels in factor() itself. Cheers and thank you for bringing this up! On Fri, Jun 16, 2017 at 6:02 PM, Paul Johnson <[hidden email]> wrote: > On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote: > > To extwnd on Martin 's explanation : > > > > In factor(), levels are the unique input values and labels the unique > output > > values. So the function levels() actually displays the labels. > > > > Dear Joris > > I think we agree. Currently, factor insists both levels and labels be > unique. > > I wish that it would not accept nonunique labels. I also understand it > is impractical to change this now in base R. > > I don't think I succeeded in explaining why this would be nicer. > Here's another example. Fairly often, we see input data like > > x <- c("Male", "Man", "male", "Man", "Female") > > The first four represent the same value.  I'd like to go in one step > to a new factor variable with enumerated types "Male" and "Female". > This fails > > xf <- factor(x, levels = c("Male", "Man", "male", "Female"), >         labels = c("Male", "Male", "Male", "Female")) > > Instead, we need 2 steps. > > xf <- factor(x, levels = c("Male", "Man", "male", "Female")) > levels(xf) <- c("Male", "Male", "Male", "Female") > > I think it is quirky that `levels<-.factor` allows the duplicated > labels, whereas factor does not. > > I wrote a function rockchalk::combineLevels to simplify combining > levels, but most of the students here like plyr::mapvalues to do it. > The use of levels() can be tricky because one must enumerate all > values, not just the ones being changed. > > But I do understand Martin's point. Its been this way 25 years, it > won't change. :). > > > Cheers > > Joris > > > > > > > -- > Paul E. Johnson   http://pj.freefaculty.org> Director, Center for Research Methods and Data Analysis > http://crmda.ku.edu> > To write to me directly, please address me at pauljohn at ku.edu. > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel :  +32 (0)9 264 61 79 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php        [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 In reply to this post by PaulJohnson32gmail >>>>> Paul Johnson <[hidden email]> >>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:     > On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:     >> To extwnd on Martin 's explanation :     >>     >> In factor(), levels are the unique input values and labels the unique output     >> values. So the function levels() actually displays the labels.     >>     > Dear Joris     > I think we agree. Currently, factor insists both levels and labels be unique.     > I wish that it would not accept nonunique labels. I also understand it     > is impractical to change this now in base R.     > I don't think I succeeded in explaining why this would be nicer.     > Here's another example. Fairly often, we see input data like     > x <- c("Male", "Man", "male", "Man", "Female")     > The first four represent the same value.  I'd like to go in one step     > to a new factor variable with enumerated types "Male" and "Female".     > This fails     > xf <- factor(x, levels = c("Male", "Man", "male", "Female"),     > labels = c("Male", "Male", "Male", "Female"))     > Instead, we need 2 steps.     > xf <- factor(x, levels = c("Male", "Man", "male", "Female"))     > levels(xf) <- c("Male", "Male", "Male", "Female")     > I think it is quirky that `levels<-.factor` allows the duplicated     > labels, whereas factor does not.     > I wrote a function rockchalk::combineLevels to simplify combining     > levels, but most of the students here like plyr::mapvalues to do it.     > The use of levels() can be tricky because one must enumerate all     > values, not just the ones being changed.     > But I do understand Martin's point. Its been this way 25 years, it     > won't change. :). Well.. the above is a bit out of context. Your first example really did not make a point to me (and Joris) and I showed that you could use even two different simple factor() calls to produce what you wanted         yc <- factor(c("1",NA,NA,"4","4","4"))         yn <- factor(c( 1, NA,NA, 4,  4,  4)) Your new example is indeed  much more convincing ! (Note though that the two steps that are needed can be written  more shortly The  "been this way 25 years"  is one a reason to be very cautious(*) with changes, but not a reason for no changes! (*) Indeed as some of you have noted we really should not "break behavior".     This means to me we cannot accept a change there which gives     an error or a different result in cases the old behavior gave a valid factor. I'm looking at a possible change currently [not promising that a change will happen ...] Martin ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 >>>>> Martin Maechler <[hidden email]> >>>>>     on Thu, 22 Jun 2017 11:43:59 +0200 writes: >>>>> Paul Johnson <[hidden email]> >>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:     >> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote:     >>> To extwnd on Martin 's explanation :     >>>     >>> In factor(), levels are the unique input values and labels the unique output     >>> values. So the function levels() actually displays the labels.     >>>     >> Dear Joris     >> I think we agree. Currently, factor insists both levels and labels be unique.     >> I wish that it would not accept nonunique labels. I also understand it     >> is impractical to change this now in base R.     >> I don't think I succeeded in explaining why this would be nicer.     >> Here's another example. Fairly often, we see input data like     >> x <- c("Male", "Man", "male", "Man", "Female")     >> The first four represent the same value.  I'd like to go in one step     >> to a new factor variable with enumerated types "Male" and "Female".     >> This fails     >> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),     >> labels = c("Male", "Male", "Male", "Female"))     >> Instead, we need 2 steps.     >> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))     >> levels(xf) <- c("Male", "Male", "Male", "Female")     >> I think it is quirky that `levels<-.factor` allows the duplicated     >> labels, whereas factor does not.     >> I wrote a function rockchalk::combineLevels to simplify combining     >> levels, but most of the students here like plyr::mapvalues to do it.     >> The use of levels() can be tricky because one must enumerate all     >> values, not just the ones being changed.     >> But I do understand Martin's point. Its been this way 25 years, it     >> won't change. :).     > Well.. the above is a bit out of context.     > Your first example really did not make a point to me (and Joris)     > and I showed that you could use even two different simple factor() calls to     > produce what you wanted     > yc <- factor(c("1",NA,NA,"4","4","4"))     > yn <- factor(c( 1, NA,NA, 4,  4,  4))     > Your new example is indeed  much more convincing !     > (Note though that the two steps that are needed can be written     > more shortly     > The  "been this way 25 years"  is one a reason to be very     > cautious(*) with changes, but not a reason for no changes!     > (*) Indeed as some of you have noted we really should not "break behavior".     > This means to me we cannot accept a change there which gives     > an error or a different result in cases the old behavior gave a valid factor.     > I'm looking at a possible change currently     > [not promising that a change will happen ...] In the end, I've liked the change (after 2-3 iterations), and now been brave to commit to R-devel (svn 72845). With the change, I had to disable one of our own regression checks (tests/reg-tests-1b.R, line 726): The following is now (in R-devel -> R 3.5.0) valid:    > factor(1:2, labels = c("A","A"))    [1] A A    Levels: A    > I wonder how many CRAN package checks will "break" from this (my guess is in the order of a dozen), but I hope that these breakages will be benign, e.g., similar to the above case where before an error was expected via tools :: assertError(.) Martin ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

 Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors. Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time? -pd > On 23 Jun 2017, at 10:42 , Martin Maechler <[hidden email]> wrote: > >>>>>> Martin Maechler <[hidden email]> >>>>>>    on Thu, 22 Jun 2017 11:43:59 +0200 writes: > >>>>>> Paul Johnson <[hidden email]> >>>>>>    on Fri, 16 Jun 2017 11:02:34 -0500 writes: > >>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <[hidden email]> wrote: >>>> To extwnd on Martin 's explanation : >>>> >>>> In factor(), levels are the unique input values and labels the unique output >>>> values. So the function levels() actually displays the labels. >>>> > >>> Dear Joris > >>> I think we agree. Currently, factor insists both levels and labels be unique. > >>> I wish that it would not accept nonunique labels. I also understand it >>> is impractical to change this now in base R. > >>> I don't think I succeeded in explaining why this would be nicer. >>> Here's another example. Fairly often, we see input data like > >>> x <- c("Male", "Man", "male", "Man", "Female") > >>> The first four represent the same value.  I'd like to go in one step >>> to a new factor variable with enumerated types "Male" and "Female". >>> This fails > >>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"), >>> labels = c("Male", "Male", "Male", "Female")) > >>> Instead, we need 2 steps. > >>> xf <- factor(x, levels = c("Male", "Man", "male", "Female")) >>> levels(xf) <- c("Male", "Male", "Male", "Female") > >>> I think it is quirky that `levels<-.factor` allows the duplicated >>> labels, whereas factor does not. > >>> I wrote a function rockchalk::combineLevels to simplify combining >>> levels, but most of the students here like plyr::mapvalues to do it. >>> The use of levels() can be tricky because one must enumerate all >>> values, not just the ones being changed. > >>> But I do understand Martin's point. Its been this way 25 years, it >>> won't change. :). > >> Well.. the above is a bit out of context. > >> Your first example really did not make a point to me (and Joris) >> and I showed that you could use even two different simple factor() calls to >> produce what you wanted >> yc <- factor(c("1",NA,NA,"4","4","4")) >> yn <- factor(c( 1, NA,NA, 4,  4,  4)) > >> Your new example is indeed  much more convincing ! > >> (Note though that the two steps that are needed can be written >> more shortly > >> The  "been this way 25 years"  is one a reason to be very >> cautious(*) with changes, but not a reason for no changes! > >> (*) Indeed as some of you have noted we really should not "break behavior". >> This means to me we cannot accept a change there which gives >> an error or a different result in cases the old behavior gave a valid factor. > >> I'm looking at a possible change currently >> [not promising that a change will happen ...] > > In the end, I've liked the change (after 2-3 iterations), and > now been brave to commit to R-devel (svn 72845). > > With the change, I had to disable one of our own regression > checks (tests/reg-tests-1b.R, line 726): > > The following is now (in R-devel -> R 3.5.0) valid: > >> factor(1:2, labels = c("A","A")) >   [1] A A >   Levels: A >> > > I wonder how many CRAN package checks will "break" from > this (my guess is in the order of a dozen), but I hope > that these breakages will be benign, e.g., similar to the above > case where before an error was expected via tools :: assertError(.) > > Martin > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email]  Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: duplicated factor labels.

Open this post in threaded view
|