Quantcast

Opinion: Why I find factors convenient to use

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Opinion: Why I find factors convenient to use

Bert Gunter
Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
like to point out that, for me, factors provide one feature that I
find to be very convenient: ordering of levels. **

As an example, suppose one has a character vector of labels "small,"
medium", and "large". Then most R functions (e.g. tapply()) will
display results involving this vector in alphabetical order, which I
think most would view as undesirable. By converting to a factor with
levels in the logical order, displays will automatically be "logical."
For example:

> x <- sample(c("small","medium","large"),12,rep=TRUE)
> table(x)
x
 large medium  small
     2      3      7
> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this
> table(y)
y
 small medium  large
     7      3      2

Naturally, this is just my opinion, and I understand why lots of smart
people find factors irritating (at least!). So contrary opinions
cheerily welcomed. But perhaps these comments might be helpful to
those who have been "bitten" by factors or just wonder what all the
fuss is about.

** Another advantage is reduced storage space, I believe. Please
correct if wrong.

Cheers,
Bert

--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

PIKAL Petr
I second to Bert's opinion, factors can be confusing, but they have quite nice features which can not be easily mimicked by plain character vectors. I find extremelly usefull possibility of manipulating its levels.

> fac<-factor(sample(letters[1:5], 20, replace=TRUE))
> fac
 [1] e e d d e e c e a e a e b b d e c c d b
Levels: a b c d e
> levels(fac)[2:4]<- "new.level"
> fac
 [1] e         e         new.level new.level e         e         new.level
 [8] e         a         e         a         e         new.level new.level
[15] new.level e         new.level new.level new.level new.level
Levels: a new.level e
>

Regards
Petr


________________________________________
Odesílate: [hidden email] [[hidden email]] za uživatele Bert Gunter [[hidden email]]
Odesláno: 17. srpna 2012 19:32
To: [hidden email]
Předmět: [R] Opinion: Why I find factors convenient to use

Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
like to point out that, for me, factors provide one feature that I
find to be very convenient: ordering of levels. **

As an example, suppose one has a character vector of labels "small,"
medium", and "large". Then most R functions (e.g. tapply()) will
display results involving this vector in alphabetical order, which I
think most would view as undesirable. By converting to a factor with
levels in the logical order, displays will automatically be "logical."
For example:

> x <- sample(c("small","medium","large"),12,rep=TRUE)
> table(x)
x
 large medium  small
     2      3      7
> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this
> table(y)
y
 small medium  large
     7      3      2

Naturally, this is just my opinion, and I understand why lots of smart
people find factors irritating (at least!). So contrary opinions
cheerily welcomed. But perhaps these comments might be helpful to
those who have been "bitten" by factors or just wonder what all the
fuss is about.

** Another advantage is reduced storage space, I believe. Please
correct if wrong.

Cheers,
Bert

--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Jeff Newmiller
In reply to this post by Bert Gunter
I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis.  

However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.



Bert Gunter <[hidden email]> wrote:

>Folks:
>
>Over the years, many people -- including some who I would consider
>real expeRts -- have criticized factors and advocated the use
>(sometimes exclusively) of character vectors instead. I would just
>like to point out that, for me, factors provide one feature that I
>find to be very convenient: ordering of levels. **
>
>As an example, suppose one has a character vector of labels "small,"
>medium", and "large". Then most R functions (e.g. tapply()) will
>display results involving this vector in alphabetical order, which I
>think most would view as undesirable. By converting to a factor with
>levels in the logical order, displays will automatically be "logical."
>For example:
>
>> x <- sample(c("small","medium","large"),12,rep=TRUE)
>> table(x)
>x
> large medium  small
>     2      3      7
>> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would
>do, but is not necessary for this
>> table(y)
>y
> small medium  large
>     7      3      2
>
>Naturally, this is just my opinion, and I understand why lots of smart
>people find factors irritating (at least!). So contrary opinions
>cheerily welcomed. But perhaps these comments might be helpful to
>those who have been "bitten" by factors or just wonder what all the
>fuss is about.
>
>** Another advantage is reduced storage space, I believe. Please
>correct if wrong.
>
>Cheers,
>Bert
>
>--
>
>Bert Gunter
>Genentech Nonclinical Biostatistics
>
>Internal Contact Info:
>Phone: 467-7374
>Website:
>http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Steve Lianoglou-6
Hi,

On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
<[hidden email]> wrote:
> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis.
>
> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases.

Agreed here -- I actually haven't been tuned into any such recent
conversation (if there was one), but if I were a gambling man, I'd bet
that the majority of the problems people have with factors can
probably be boiled down to the fact that the default value for
stringsAsFactors is TRUE.

I like factors -- that said, I am annoyed by them at times, but I
still like them.

Also, Bert mentioned that he thinks they save space over characters --
I believe that this is no longer true, but I'm not certain.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Bert Gunter
Steve, et. al:

Yes, if object.size() is to be believed, you're right:

> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
> y <- factor(x)
> object.size(x)
40120 bytes
> object.size(y)
40336 bytes

I stand (happily) corrected.

-- Bert

On Fri, Aug 17, 2012 at 11:09 AM, Steve Lianoglou
<[hidden email]> wrote:

> Hi,
>
> On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
> <[hidden email]> wrote:
>> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis.
>>
>> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases.
>
> Agreed here -- I actually haven't been tuned into any such recent
> conversation (if there was one), but if I were a gambling man, I'd bet
> that the majority of the problems people have with factors can
> probably be boiled down to the fact that the default value for
> stringsAsFactors is TRUE.
>
> I like factors -- that said, I am annoyed by them at times, but I
> still like them.
>
> Also, Bert mentioned that he thinks they save space over characters --
> I believe that this is no longer true, but I'm not certain.
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Rui Barradas
Hello,

No, factors may use less memory. System dependent?

 > x <-sample(c("small","medium","large"),1e4,rep=TRUE)
 > y <- factor(x)
 > object.size(x)
80184 bytes
 > object.size(y)
40576 bytes
 >
 > sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Portugal.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] Rcapture_1.2-0 xts_0.8-0      zoo_1.7-7

loaded via a namespace (and not attached):
[1] chron_2.3-39   fortunes_1.4-2 grid_2.15.1    lattice_0.20-6 tools_2.15.1


And I agree with what Steve said, stringsAsFactors = FALSE saves hours
of debuging time.

Rui Barradas

Em 17-08-2012 19:19, Bert Gunter escreveu:

> Steve, et. al:
>
> Yes, if object.size() is to be believed, you're right:
>
>> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
>> y <- factor(x)
>> object.size(x)
> 40120 bytes
>> object.size(y)
> 40336 bytes
>
> I stand (happily) corrected.
>
> -- Bert
>
> On Fri, Aug 17, 2012 at 11:09 AM, Steve Lianoglou
> <[hidden email]> wrote:
>> Hi,
>>
>> On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
>> <[hidden email]> wrote:
>>> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis.
>>>
>>> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases.
>> Agreed here -- I actually haven't been tuned into any such recent
>> conversation (if there was one), but if I were a gambling man, I'd bet
>> that the majority of the problems people have with factors can
>> probably be boiled down to the fact that the default value for
>> stringsAsFactors is TRUE.
>>
>> I like factors -- that said, I am annoyed by them at times, but I
>> still like them.
>>
>> Also, Bert mentioned that he thinks they save space over characters --
>> I believe that this is no longer true, but I'm not certain.
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

plangfelder
On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <[hidden email]> wrote:
> Hello,
>
> No, factors may use less memory. System dependent?

I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on
64-bit Windows and Linux installation, but Bert's result on a 32-bit
Linux machine.

Peter

>
>> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
>> y <- factor(x)
>> object.size(x)
> 80184 bytes
>> object.size(y)
> 40576 bytes

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Rui Barradas
Hello,

Em 17-08-2012 20:27, Bert Gunter escreveu:
> ... so it may be just the way object.size() counts in the two cases, right?

Or maybe the way character vectors and factors are coded.
(64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to
be 8 * 1e4 for pointers plus room for the strings themselves, and 40k
for the factor seems more like 32 bit ints * 1e4 in consecutive memory
locations. I confess to being too lazy to go check the sources, but if
this is the case then it's an other point to factors, they are indeed
more efficient memory-wise.
And 64 bit OSs are to become more and more used, processors aren't
becoming worse.

There is also the statistical side of it. Factors are the natural way of
coding nominal or categorical variables. The small/medium/large example
is a good one. Or seasons, we like to see Fall or Autumn after Spring
and Summer, not before. (btw, does anyone know why M/F?) And this has
nothing to do with the usefullness of charaters, I like persons' names
to be names, alphabetic.

I've also made a simple check, apparently, character vectors are kept as
a vector of pointers and a vector of unique strings. If we change one of
the strings, even for something smaller, occupying less bytes,
object.size will report an increase in size. Try x[1] <- "a" and see the
new size of x. It's bigger and the number of pointers to strings is the
same.

For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was:
 > R.version
[...]
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows

Rui Barradas

>
> -- Bert
>
> On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder
> <[hidden email]> wrote:
>> On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <[hidden email]> wrote:
>>> Hello,
>>>
>>> No, factors may use less memory. System dependent?
>> I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on
>> 64-bit Windows and Linux installation, but Bert's result on a 32-bit
>> Linux machine.
>>
>> Peter
>>
>>>> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
>>>> y <- factor(x)
>>>> object.size(x)
>>> 80184 bytes
>>>> object.size(y)
>>> 40576 bytes
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Petr Savicky
In reply to this post by Rui Barradas
On Fri, Aug 17, 2012 at 07:34:35PM +0100, Rui Barradas wrote:

> Hello,
>
> No, factors may use less memory. System dependent?
>
> > x <-sample(c("small","medium","large"),1e4,rep=TRUE)
> > y <- factor(x)
> > object.size(x)
> 80184 bytes
> > object.size(y)
> 40576 bytes
> >
> > sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
> [3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
> [5] LC_TIME=Portuguese_Portugal.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> other attached packages:
> [1] Rcapture_1.2-0 xts_0.8-0      zoo_1.7-7
>
> loaded via a namespace (and not attached):
> [1] chron_2.3-39   fortunes_1.4-2 grid_2.15.1    lattice_0.20-6 tools_2.15.1
>
>
> And I agree with what Steve said, stringsAsFactors = FALSE saves hours
> of debuging time.

Hi.

I use stringsAsFactors = FALSE quite frequently. If there is a discussion
on R-devel, whether this should be the default, i would support this.

Factors are very useful and sometimes necessary, but they are hard to manipulate.
As Jeff Newmiller said, it is a good strategy to prepare the data as character
type and convert to a factor, when they are complete. The users should know, how
to use factors, however the strategy "convert to a factor eventually" is
more consistent with not having stringsAsFactors = TRUE as the default.

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Jim Lemon
In reply to this post by Bert Gunter
On 08/18/2012 03:32 AM, Bert Gunter wrote:
> Folks:
> ...
> So contrary opinions
> cheerily welcomed. But perhaps these comments might be helpful to
> those who have been "bitten" by factors or just wonder what all the
> fuss is about.
>
I tend to use stringsAsFactors=FALSE quite a bit, as I am often
manipulating character strings, and that

Error in strsplit(bugga, "") : non-character argument

is so annoying. Almost as annoying as printing out a list of selected
cases with some of the fields turning up as integers rather than the
strings I expected. That said, I often convert the results to factors so
that some other function will work properly. So I must express my
gratitude for motivating me to add

options(stringsAsFactors=FALSE)

to that wonderful .First function that makes my life a little happier
every day.

Jim

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

S Ellison-2
In reply to this post by Bert Gunter
 

> -----Original Message-----
> Over the years, many people -- including some who I would
> consider real expeRts -- have criticized factors and
> advocated the use (sometimes exclusively) of character
> vectors instead.

Exclusive use of character vectors is not going to do the job.

The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language.

Special behaviours I have in mind include:
- Level order can be arbitrarily specified for display purposes
- A control level can be intentionally chosen for contrasts
- the option of "ordered" factors (for example, for polr and the like)

So I think the language does and will require a 'factor' type in one form or another.

 _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

Rui Barradas
Hello,

Em 20-08-2012 12:30, S Ellison escreveu:

>  
>
>> -----Original Message-----
>> Over the years, many people -- including some who I would
>> consider real expeRts -- have criticized factors and
>> advocated the use (sometimes exclusively) of character
>> vectors instead.
> Exclusive use of character vectors is not going to do the job.
>
> The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language.
>
> Special behaviours I have in mind include:
> - Level order can be arbitrarily specified for display purposes
> - A control level can be intentionally chosen for contrasts
> - the option of "ordered" factors (for example, for polr and the like)
>
> So I think the language does and will require a 'factor' type in one form or another.
>
>   _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default.

I disagree with this last point. Just think of the number of questions
to this list about, say, dates. When read from file using one of the
forms of read.table, they usually cause problems. Unless the user is an
experienced one, in which case he/she might not have a question to ask.
Besides, the default TRUE is contradictory with "stick with character
early and convert to factor a bit later". With both "early" and "later".
A different thing is to have a very used function's default behavior
change from one version of R to the next one. What about all the code in
use? Maybe it's better to leave it be.

Rui Barradas

>
> S Ellison
>
> *******************************************************************
> This email and any attachments are confidential. Any use...{{dropped:8}}
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

nutterb
Whether I use stringsAsFactors=FALSE or stringsAsFactors=TRUE tends to rely on where my data are coming from.  If the data are coming from our Oracle databases (well controlled data), I import the with stringsAsFactors=TRUE and everything is great.  If the data are given to me by a fellow in the form of an Excel spreadsheet, I have a good cry and then set stringsAsFactors=FALSE.  Regardless, before I get to analyzing the data, I convert them all to factors.  I imagine people's preferences for the default setting are strongly tied to the quality of the data with which they tend to work.

I would prefer the default argument be left as it is, however.  Mostly because
1) I feel like it assumes you are importing data for analysis and not for data management; and more importantly
2) Changing the default would mean I have to change the way I approach data import--and I don't like to change.

  Benjamin Nutter |  Biostatistician     |  Quantitative Health Sciences
  Cleveland Clinic    |  9500 Euclid Ave.  |  Cleveland, OH 44195  | (216) 445-1365


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Rui Barradas
Sent: Monday, August 20, 2012 8:03 AM
To: S Ellison
Cc: r-help
Subject: Re: [R] Opinion: Why I find factors convenient to use

Hello,

Em 20-08-2012 12:30, S Ellison escreveu:

>  
>
>> -----Original Message-----
>> Over the years, many people -- including some who I would consider
>> real expeRts -- have criticized factors and advocated the use
>> (sometimes exclusively) of character vectors instead.
> Exclusive use of character vectors is not going to do the job.
>
> The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language.
>
> Special behaviours I have in mind include:
> - Level order can be arbitrarily specified for display purposes
> - A control level can be intentionally chosen for contrasts
> - the option of "ordered" factors (for example, for polr and the like)
>
> So I think the language does and will require a 'factor' type in one form or another.
>
>   _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default.

I disagree with this last point. Just think of the number of questions to this list about, say, dates. When read from file using one of the forms of read.table, they usually cause problems. Unless the user is an experienced one, in which case he/she might not have a question to ask.
Besides, the default TRUE is contradictory with "stick with character early and convert to factor a bit later". With both "early" and "later".
A different thing is to have a very used function's default behavior change from one version of R to the next one. What about all the code in use? Maybe it's better to leave it be.

Rui Barradas

>
> S Ellison
>
> *******************************************************************
> This email and any attachments are confidential. Any
> use...{{dropped:8}}
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

===================================


 Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals in America by U.S.News & World Report (2010).  
Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations.


Confidentiality Note:  This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law.  If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.  If you have received this communication in error,  please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy.  

Thank you.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

PIKAL Petr
In reply to this post by Rui Barradas
Hi

> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Rui Barradas
> Sent: Monday, August 20, 2012 2:03 PM
> To: S Ellison
> Cc: r-help
> Subject: Re: [R] Opinion: Why I find factors convenient to use
>
> Hello,
>
> Em 20-08-2012 12:30, S Ellison escreveu:
> >
> >
> >> -----Original Message-----
> >> Over the years, many people -- including some who I would consider
> >> real expeRts -- have criticized factors and advocated the use
> >> (sometimes exclusively) of character vectors instead.
> > Exclusive use of character vectors is not going to do the job.
> >
> > The concept of a factor is fundamental to a lot of statistics; a
> programming environment that does not implement factors and their
> associated special behaviour is probably not a statistical programming
> language.
> >
> > Special behaviours I have in mind include:
> > - Level order can be arbitrarily specified for display purposes
> > - A control level can be intentionally chosen for contrasts
> > - the option of "ordered" factors (for example, for polr and the
> like)
> >
> > So I think the language does and will require a 'factor' type in one
> form or another.
> >
> >   _When_ you decide to convert a character input to a factor is, of
> course, up to the user,and for cleanup it's very often better to stick
> with character early and convert to factor a bit later. But personally,
> I think that there is sufficient control over the coding of data to
> allow user discretion. and on the whole, it seems to me that character
> input gets used as factor data so much of the time when it is used at
> all that the default stringsAsFactors=TRUE setting seems the more
> sensible default.
>
> I disagree with this last point. Just think of the number of questions
> to this list about, say, dates. When read from file using one of the
> forms of read.table, they usually cause problems. Unless the user is an

Hm. I may be wrong but most confusion comes from:

My numbers are not read as numbers and when I try to convert them by as.numeric they are changed and scrambled to integers. What can I do?

Personally I do not find factors too much confusing, they behave almost the same as character vectors.

ch<-sample(letters[1:4], 20, replace=T)
ff<-factor(ch)

ch[ch=="b"]
[1] "b" "b" "b" "b" "b" "b" "b"
ff[ff=="b"]
[1] b b b b b b b
Levels: a b c d

paste(ch,1:5)
 [1] "b 1" "d 2" "d 3" "c 4" "d 5" "c 1" "b 2" "b 3" "c 4" "d 5" "b 1" "c 2"
[13] "b 3" "c 4" "b 5" "c 1" "c 2" "c 3" "b 4" "a 5"
paste(ff,1:5)
 [1] "b 1" "d 2" "d 3" "c 4" "d 5" "c 1" "b 2" "b 3" "c 4" "d 5" "b 1" "c 2"
[13] "b 3" "c 4" "b 5" "c 1" "c 2" "c 3" "b 4" "a 5"

ddch<-c("2000-05-05", "2001-05-05")
ddf<-as.factor(ddch)
str(as.Date(ddch))
 Date[1:2], format: "2000-05-05" "2001-05-05"
str(as.Date(ddf))
 Date[1:2], format: "2000-05-05" "2001-05-05"

The only problem is when you want to add some values to factors or to concatenate by c(some factor, some values), you need to do character conversion like that.

my.c <- function(x, ...) {
x.f <- as.character(x)
if (is.factor(x)) res <- as.factor(c(x.f, ...)) else res <- c(x,...)
res
}

But e.g. merge works fine
ffx <- factor("x")

str(merge(data.frame(ff), data.frame(ffx), by.x="ff", by.y="ffx", all=T))
'data.frame':   21 obs. of  1 variable:
 $ ff: Factor w/ 5 levels "a","b","c","d",..: 1 1 2 2 2 2 2 2 3 3 ...

So for me personally default read.table stringsAsFactors=TRUE is better as I have some code working with factors and without checking.


> experienced one, in which case he/she might not have a question to ask.
> Besides, the default TRUE is contradictory with "stick with character
> early and convert to factor a bit later". With both "early" and
> "later".
> A different thing is to have a very used function's default behavior
> change from one version of R to the next one. What about all the code
> in use? Maybe it's better to leave it be.
>
> Rui Barradas
> >
> > S Ellison
> >
> > *******************************************************************
> > This email and any attachments are confidential. Any
> > use...{{dropped:8}}
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Opinion: Why I find factors convenient to use

S Ellison-2
In reply to this post by nutterb
>> on the whole, it seems to me that character
>> input gets used as factor data so much of the time when it is
>> used at all that the default stringsAsFactors=TRUE setting
>> seems the more sensible default.
>
> I disagree with this last point. Just think of the number of
> questions to this list about, say, dates.

Mileage on this issue is likely to vary within and between useRs.

For a more than anecdotal answer to the question of whether 'on the whole', character input gets used as factor data, one would have to construct a more careful survey than this.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...