Default for bin limits in hist()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Default for bin limits in hist()

R devel mailing list
Hello all.

I noticed that the default setting for breaks in the construction of histograms in hist() is “right = TRUE”.

I think “right=FALSE” would be more consistent with usual definitions of lower and upper limits for bins in applied statistics, and I suggest that you consider making it the default for hist().

For example, I generated the following frequency distribution for duration of hospitalization with a script in R specifying the cuts to be “right = FALSE” (from an exercise in Bernard Rosner’s Fundamentals of Biostatistics book).  

                number     %
[0,5)             5         0.20
[5,10)         12         0.48
[10,15)         6         0.24
[15,20)         1         0.04
[20,25)         0         0.00
[25,30]         1         0.04

The actual boundaries for each bin are: 0-4, 5-9, 10-14, … and so on since the limits on the right are “open”, with the exception of the last bin. This format is in agreement with usual practice and recommendations. Actually, it is compatible with the process described by Romer in his book (“from y inclusive to y exclusive”).

If I use R to generate a histogram with 6 bins, I get the following:



… which actually presents the histogram of the frequency distribution when the “right” parameter is set as “TRUE”:


               number     %

[0,5]             9         0.36
(5,10]           9         0.36
(10,15]         5         0.20
(15,20]         1         0.04
(20,25]         0         0.00
(25,30]         1         0.04

In this case, the real limits of the bins are 0-5, 6-10, 11-15, … and so on.

If I edit the histogram command adding “right = FALSE”, I can get the histogram for my original frequency distribution. Compare bins 1 and 2 in both distributions and histograms.






The actual choice of the argument for the “right” parameter may be a matter of choice, but I think most users of R would benefit from using bins with limits that are closed to the left and open to the right, and so having this setting as a default for hist().

I am aware I am writing from the limited perspective of my own field (epidemiology and biostatistics), but there are plenty of examples that show the need to consider changing the default. Here are just a few:

https://www.statcan.gc.ca/eng/concepts/definitions/age2

https://seer.cancer.gov/stdpopulations/stdpop.19ages.html

https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-hinc/hinc-01.html


Thank you.

José

José G. Conde, MD, MPH
Professor, School of Medicine
Director, CentIT2
UPR Medical Sciences Campus

Tel  (787) 763-9401 Fax (787) 758-5206

Email: [hidden email]

URL: http://rcmi.rcm.upr.edu


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

histogram1.pdf (5K) Download Attachment
Histogram2.pdf (5K) Download Attachment