|
Folks:
Over the years, many people -- including some who I would consider real expeRts -- have criticized factors and advocated the use (sometimes exclusively) of character vectors instead. I would just like to point out that, for me, factors provide one feature that I find to be very convenient: ordering of levels. ** As an example, suppose one has a character vector of labels "small," medium", and "large". Then most R functions (e.g. tapply()) will display results involving this vector in alphabetical order, which I think most would view as undesirable. By converting to a factor with levels in the logical order, displays will automatically be "logical." For example: > x <- sample(c("small","medium","large"),12,rep=TRUE) > table(x) x large medium small 2 3 7 > y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this > table(y) y small medium large 7 3 2 Naturally, this is just my opinion, and I understand why lots of smart people find factors irritating (at least!). So contrary opinions cheerily welcomed. But perhaps these comments might be helpful to those who have been "bitten" by factors or just wonder what all the fuss is about. ** Another advantage is reduced storage space, I believe. Please correct if wrong. Cheers, Bert -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I second to Bert's opinion, factors can be confusing, but they have quite nice features which can not be easily mimicked by plain character vectors. I find extremelly usefull possibility of manipulating its levels.
> fac<-factor(sample(letters[1:5], 20, replace=TRUE)) > fac [1] e e d d e e c e a e a e b b d e c c d b Levels: a b c d e > levels(fac)[2:4]<- "new.level" > fac [1] e e new.level new.level e e new.level [8] e a e a e new.level new.level [15] new.level e new.level new.level new.level new.level Levels: a new.level e > Regards Petr ________________________________________ Odesílate: [hidden email] [[hidden email]] za uživatele Bert Gunter [[hidden email]] Odesláno: 17. srpna 2012 19:32 To: [hidden email] Předmět: [R] Opinion: Why I find factors convenient to use Folks: Over the years, many people -- including some who I would consider real expeRts -- have criticized factors and advocated the use (sometimes exclusively) of character vectors instead. I would just like to point out that, for me, factors provide one feature that I find to be very convenient: ordering of levels. ** As an example, suppose one has a character vector of labels "small," medium", and "large". Then most R functions (e.g. tapply()) will display results involving this vector in alphabetical order, which I think most would view as undesirable. By converting to a factor with levels in the logical order, displays will automatically be "logical." For example: > x <- sample(c("small","medium","large"),12,rep=TRUE) > table(x) x large medium small 2 3 7 > y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this > table(y) y small medium large 7 3 2 Naturally, this is just my opinion, and I understand why lots of smart people find factors irritating (at least!). So contrary opinions cheerily welcomed. But perhaps these comments might be helpful to those who have been "bitten" by factors or just wonder what all the fuss is about. ** Another advantage is reduced storage space, I believe. Please correct if wrong. Cheers, Bert -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Bert Gunter
I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis.
However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<[hidden email]> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Bert Gunter <[hidden email]> wrote: >Folks: > >Over the years, many people -- including some who I would consider >real expeRts -- have criticized factors and advocated the use >(sometimes exclusively) of character vectors instead. I would just >like to point out that, for me, factors provide one feature that I >find to be very convenient: ordering of levels. ** > >As an example, suppose one has a character vector of labels "small," >medium", and "large". Then most R functions (e.g. tapply()) will >display results involving this vector in alphabetical order, which I >think most would view as undesirable. By converting to a factor with >levels in the logical order, displays will automatically be "logical." >For example: > >> x <- sample(c("small","medium","large"),12,rep=TRUE) >> table(x) >x > large medium small > 2 3 7 >> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would >do, but is not necessary for this >> table(y) >y > small medium large > 7 3 2 > >Naturally, this is just my opinion, and I understand why lots of smart >people find factors irritating (at least!). So contrary opinions >cheerily welcomed. But perhaps these comments might be helpful to >those who have been "bitten" by factors or just wonder what all the >fuss is about. > >** Another advantage is reduced storage space, I believe. Please >correct if wrong. > >Cheers, >Bert > >-- > >Bert Gunter >Genentech Nonclinical Biostatistics > >Internal Contact Info: >Phone: 467-7374 >Website: >http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > >______________________________________________ >[hidden email] mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hi,
On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller <[hidden email]> wrote: > I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis. > > However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases. Agreed here -- I actually haven't been tuned into any such recent conversation (if there was one), but if I were a gambling man, I'd bet that the majority of the problems people have with factors can probably be boiled down to the fact that the default value for stringsAsFactors is TRUE. I like factors -- that said, I am annoyed by them at times, but I still like them. Also, Bert mentioned that he thinks they save space over characters -- I believe that this is no longer true, but I'm not certain. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Steve, et. al:
Yes, if object.size() is to be believed, you're right: > x <-sample(c("small","medium","large"),1e4,rep=TRUE) > y <- factor(x) > object.size(x) 40120 bytes > object.size(y) 40336 bytes I stand (happily) corrected. -- Bert On Fri, Aug 17, 2012 at 11:09 AM, Steve Lianoglou <[hidden email]> wrote: > Hi, > > On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller > <[hidden email]> wrote: >> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis. >> >> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases. > > Agreed here -- I actually haven't been tuned into any such recent > conversation (if there was one), but if I were a gambling man, I'd bet > that the majority of the problems people have with factors can > probably be boiled down to the fact that the default value for > stringsAsFactors is TRUE. > > I like factors -- that said, I am annoyed by them at times, but I > still like them. > > Also, Bert mentioned that he thinks they save space over characters -- > I believe that this is no longer true, but I'm not certain. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hello,
No, factors may use less memory. System dependent? > x <-sample(c("small","medium","large"),1e4,rep=TRUE) > y <- factor(x) > object.size(x) 80184 bytes > object.size(y) 40576 bytes > > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252 [3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C [5] LC_TIME=Portuguese_Portugal.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rcapture_1.2-0 xts_0.8-0 zoo_1.7-7 loaded via a namespace (and not attached): [1] chron_2.3-39 fortunes_1.4-2 grid_2.15.1 lattice_0.20-6 tools_2.15.1 And I agree with what Steve said, stringsAsFactors = FALSE saves hours of debuging time. Rui Barradas Em 17-08-2012 19:19, Bert Gunter escreveu: > Steve, et. al: > > Yes, if object.size() is to be believed, you're right: > >> x <-sample(c("small","medium","large"),1e4,rep=TRUE) >> y <- factor(x) >> object.size(x) > 40120 bytes >> object.size(y) > 40336 bytes > > I stand (happily) corrected. > > -- Bert > > On Fri, Aug 17, 2012 at 11:09 AM, Steve Lianoglou > <[hidden email]> wrote: >> Hi, >> >> On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller >> <[hidden email]> wrote: >>> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis. >>> >>> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases. >> Agreed here -- I actually haven't been tuned into any such recent >> conversation (if there was one), but if I were a gambling man, I'd bet >> that the majority of the problems people have with factors can >> probably be boiled down to the fact that the default value for >> stringsAsFactors is TRUE. >> >> I like factors -- that said, I am annoyed by them at times, but I >> still like them. >> >> Also, Bert mentioned that he thinks they save space over characters -- >> I believe that this is no longer true, but I'm not certain. >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <[hidden email]> wrote:
> Hello, > > No, factors may use less memory. System dependent? I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on 64-bit Windows and Linux installation, but Bert's result on a 32-bit Linux machine. Peter > >> x <-sample(c("small","medium","large"),1e4,rep=TRUE) >> y <- factor(x) >> object.size(x) > 80184 bytes >> object.size(y) > 40576 bytes ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hello,
Em 17-08-2012 20:27, Bert Gunter escreveu: > ... so it may be just the way object.size() counts in the two cases, right? Or maybe the way character vectors and factors are coded. (64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to be 8 * 1e4 for pointers plus room for the strings themselves, and 40k for the factor seems more like 32 bit ints * 1e4 in consecutive memory locations. I confess to being too lazy to go check the sources, but if this is the case then it's an other point to factors, they are indeed more efficient memory-wise. And 64 bit OSs are to become more and more used, processors aren't becoming worse. There is also the statistical side of it. Factors are the natural way of coding nominal or categorical variables. The small/medium/large example is a good one. Or seasons, we like to see Fall or Autumn after Spring and Summer, not before. (btw, does anyone know why M/F?) And this has nothing to do with the usefullness of charaters, I like persons' names to be names, alphabetic. I've also made a simple check, apparently, character vectors are kept as a vector of pointers and a vector of unique strings. If we change one of the strings, even for something smaller, occupying less bytes, object.size will report an increase in size. Try x[1] <- "a" and see the new size of x. It's bigger and the number of pointers to strings is the same. For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was: > R.version [...] version.string R version 2.15.1 (2012-06-22) nickname Roasted Marshmallows Rui Barradas > > -- Bert > > On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder > <[hidden email]> wrote: >> On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <[hidden email]> wrote: >>> Hello, >>> >>> No, factors may use less memory. System dependent? >> I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on >> 64-bit Windows and Linux installation, but Bert's result on a 32-bit >> Linux machine. >> >> Peter >> >>>> x <-sample(c("small","medium","large"),1e4,rep=TRUE) >>>> y <- factor(x) >>>> object.size(x) >>> 80184 bytes >>>> object.size(y) >>> 40576 bytes > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Rui Barradas
On Fri, Aug 17, 2012 at 07:34:35PM +0100, Rui Barradas wrote:
> Hello, > > No, factors may use less memory. System dependent? > > > x <-sample(c("small","medium","large"),1e4,rep=TRUE) > > y <- factor(x) > > object.size(x) > 80184 bytes > > object.size(y) > 40576 bytes > > > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252 > [3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C > [5] LC_TIME=Portuguese_Portugal.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rcapture_1.2-0 xts_0.8-0 zoo_1.7-7 > > loaded via a namespace (and not attached): > [1] chron_2.3-39 fortunes_1.4-2 grid_2.15.1 lattice_0.20-6 tools_2.15.1 > > > And I agree with what Steve said, stringsAsFactors = FALSE saves hours > of debuging time. Hi. I use stringsAsFactors = FALSE quite frequently. If there is a discussion on R-devel, whether this should be the default, i would support this. Factors are very useful and sometimes necessary, but they are hard to manipulate. As Jeff Newmiller said, it is a good strategy to prepare the data as character type and convert to a factor, when they are complete. The users should know, how to use factors, however the strategy "convert to a factor eventually" is more consistent with not having stringsAsFactors = TRUE as the default. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Bert Gunter
On 08/18/2012 03:32 AM, Bert Gunter wrote:
> Folks: > ... > So contrary opinions > cheerily welcomed. But perhaps these comments might be helpful to > those who have been "bitten" by factors or just wonder what all the > fuss is about. > I tend to use stringsAsFactors=FALSE quite a bit, as I am often manipulating character strings, and that Error in strsplit(bugga, "") : non-character argument is so annoying. Almost as annoying as printing out a list of selected cases with some of the fields turning up as integers rather than the strings I expected. That said, I often convert the results to factors so that some other function will work properly. So I must express my gratitude for motivating me to add options(stringsAsFactors=FALSE) to that wonderful .First function that makes my life a little happier every day. Jim ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Bert Gunter
> -----Original Message----- > Over the years, many people -- including some who I would > consider real expeRts -- have criticized factors and > advocated the use (sometimes exclusively) of character > vectors instead. Exclusive use of character vectors is not going to do the job. The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language. Special behaviours I have in mind include: - Level order can be arbitrarily specified for display purposes - A control level can be intentionally chosen for contrasts - the option of "ordered" factors (for example, for polr and the like) So I think the language does and will require a 'factor' type in one form or another. _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default. S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}} ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hello,
Em 20-08-2012 12:30, S Ellison escreveu: > > >> -----Original Message----- >> Over the years, many people -- including some who I would >> consider real expeRts -- have criticized factors and >> advocated the use (sometimes exclusively) of character >> vectors instead. > Exclusive use of character vectors is not going to do the job. > > The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language. > > Special behaviours I have in mind include: > - Level order can be arbitrarily specified for display purposes > - A control level can be intentionally chosen for contrasts > - the option of "ordered" factors (for example, for polr and the like) > > So I think the language does and will require a 'factor' type in one form or another. > > _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default. I disagree with this last point. Just think of the number of questions to this list about, say, dates. When read from file using one of the forms of read.table, they usually cause problems. Unless the user is an experienced one, in which case he/she might not have a question to ask. Besides, the default TRUE is contradictory with "stick with character early and convert to factor a bit later". With both "early" and "later". A different thing is to have a very used function's default behavior change from one version of R to the next one. What about all the code in use? Maybe it's better to leave it be. Rui Barradas > > S Ellison > > ******************************************************************* > This email and any attachments are confidential. Any use...{{dropped:8}} > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Whether I use stringsAsFactors=FALSE or stringsAsFactors=TRUE tends to rely on where my data are coming from. If the data are coming from our Oracle databases (well controlled data), I import the with stringsAsFactors=TRUE and everything is great. If the data are given to me by a fellow in the form of an Excel spreadsheet, I have a good cry and then set stringsAsFactors=FALSE. Regardless, before I get to analyzing the data, I convert them all to factors. I imagine people's preferences for the default setting are strongly tied to the quality of the data with which they tend to work.
I would prefer the default argument be left as it is, however. Mostly because 1) I feel like it assumes you are importing data for analysis and not for data management; and more importantly 2) Changing the default would mean I have to change the way I approach data import--and I don't like to change. Benjamin Nutter | Biostatistician | Quantitative Health Sciences Cleveland Clinic | 9500 Euclid Ave. | Cleveland, OH 44195 | (216) 445-1365 -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Rui Barradas Sent: Monday, August 20, 2012 8:03 AM To: S Ellison Cc: r-help Subject: Re: [R] Opinion: Why I find factors convenient to use Hello, Em 20-08-2012 12:30, S Ellison escreveu: > > >> -----Original Message----- >> Over the years, many people -- including some who I would consider >> real expeRts -- have criticized factors and advocated the use >> (sometimes exclusively) of character vectors instead. > Exclusive use of character vectors is not going to do the job. > > The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language. > > Special behaviours I have in mind include: > - Level order can be arbitrarily specified for display purposes > - A control level can be intentionally chosen for contrasts > - the option of "ordered" factors (for example, for polr and the like) > > So I think the language does and will require a 'factor' type in one form or another. > > _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default. I disagree with this last point. Just think of the number of questions to this list about, say, dates. When read from file using one of the forms of read.table, they usually cause problems. Unless the user is an experienced one, in which case he/she might not have a question to ask. Besides, the default TRUE is contradictory with "stick with character early and convert to factor a bit later". With both "early" and "later". A different thing is to have a very used function's default behavior change from one version of R to the next one. What about all the code in use? Maybe it's better to leave it be. Rui Barradas > > S Ellison > > ******************************************************************* > This email and any attachments are confidential. Any > use...{{dropped:8}} > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. =================================== Please consider the environment before printing this e-mail Cleveland Clinic is ranked one of the top hospitals in America by U.S.News & World Report (2010). Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations. Confidentiality Note: This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy. Thank you. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Rui Barradas
Hi
> -----Original Message----- > From: [hidden email] [mailto:r-help-bounces@r- > project.org] On Behalf Of Rui Barradas > Sent: Monday, August 20, 2012 2:03 PM > To: S Ellison > Cc: r-help > Subject: Re: [R] Opinion: Why I find factors convenient to use > > Hello, > > Em 20-08-2012 12:30, S Ellison escreveu: > > > > > >> -----Original Message----- > >> Over the years, many people -- including some who I would consider > >> real expeRts -- have criticized factors and advocated the use > >> (sometimes exclusively) of character vectors instead. > > Exclusive use of character vectors is not going to do the job. > > > > The concept of a factor is fundamental to a lot of statistics; a > programming environment that does not implement factors and their > associated special behaviour is probably not a statistical programming > language. > > > > Special behaviours I have in mind include: > > - Level order can be arbitrarily specified for display purposes > > - A control level can be intentionally chosen for contrasts > > - the option of "ordered" factors (for example, for polr and the > like) > > > > So I think the language does and will require a 'factor' type in one > form or another. > > > > _When_ you decide to convert a character input to a factor is, of > course, up to the user,and for cleanup it's very often better to stick > with character early and convert to factor a bit later. But personally, > I think that there is sufficient control over the coding of data to > allow user discretion. and on the whole, it seems to me that character > input gets used as factor data so much of the time when it is used at > all that the default stringsAsFactors=TRUE setting seems the more > sensible default. > > I disagree with this last point. Just think of the number of questions > to this list about, say, dates. When read from file using one of the > forms of read.table, they usually cause problems. Unless the user is an Hm. I may be wrong but most confusion comes from: My numbers are not read as numbers and when I try to convert them by as.numeric they are changed and scrambled to integers. What can I do? Personally I do not find factors too much confusing, they behave almost the same as character vectors. ch<-sample(letters[1:4], 20, replace=T) ff<-factor(ch) ch[ch=="b"] [1] "b" "b" "b" "b" "b" "b" "b" ff[ff=="b"] [1] b b b b b b b Levels: a b c d paste(ch,1:5) [1] "b 1" "d 2" "d 3" "c 4" "d 5" "c 1" "b 2" "b 3" "c 4" "d 5" "b 1" "c 2" [13] "b 3" "c 4" "b 5" "c 1" "c 2" "c 3" "b 4" "a 5" paste(ff,1:5) [1] "b 1" "d 2" "d 3" "c 4" "d 5" "c 1" "b 2" "b 3" "c 4" "d 5" "b 1" "c 2" [13] "b 3" "c 4" "b 5" "c 1" "c 2" "c 3" "b 4" "a 5" ddch<-c("2000-05-05", "2001-05-05") ddf<-as.factor(ddch) str(as.Date(ddch)) Date[1:2], format: "2000-05-05" "2001-05-05" str(as.Date(ddf)) Date[1:2], format: "2000-05-05" "2001-05-05" The only problem is when you want to add some values to factors or to concatenate by c(some factor, some values), you need to do character conversion like that. my.c <- function(x, ...) { x.f <- as.character(x) if (is.factor(x)) res <- as.factor(c(x.f, ...)) else res <- c(x,...) res } But e.g. merge works fine ffx <- factor("x") str(merge(data.frame(ff), data.frame(ffx), by.x="ff", by.y="ffx", all=T)) 'data.frame': 21 obs. of 1 variable: $ ff: Factor w/ 5 levels "a","b","c","d",..: 1 1 2 2 2 2 2 2 3 3 ... So for me personally default read.table stringsAsFactors=TRUE is better as I have some code working with factors and without checking. > experienced one, in which case he/she might not have a question to ask. > Besides, the default TRUE is contradictory with "stick with character > early and convert to factor a bit later". With both "early" and > "later". > A different thing is to have a very used function's default behavior > change from one version of R to the next one. What about all the code > in use? Maybe it's better to leave it be. > > Rui Barradas > > > > S Ellison > > > > ******************************************************************* > > This email and any attachments are confidential. Any > > use...{{dropped:8}} > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by nutterb
>> on the whole, it seems to me that character
>> input gets used as factor data so much of the time when it is >> used at all that the default stringsAsFactors=TRUE setting >> seems the more sensible default. > > I disagree with this last point. Just think of the number of > questions to this list about, say, dates. Mileage on this issue is likely to vary within and between useRs. For a more than anecdotal answer to the question of whether 'on the whole', character input gets used as factor data, one would have to construct a more careful survey than this. S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}} ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
