I would like a real-life example of a data set which one might think to model by a binomial distribution, but which is substantially underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i is an integer between 0 and n (n known a priori) such that var(X) << mean(X)*(1 - mean(X)/n). Does anyone know of any such examples? Do any exist? I've done a perfunctory web search, and had a look at "A Handbook of Small Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank. I've seen on the web some references to underdispersed "pseudo-Poisson" data, but not to underdispersed "pseudo-binomial" data. And of course there's lots of *over* dispersed stuff. But that's not what I want. I can *simulate* data sets of the sor that I am looking for (so far the only ideas I've had for doing this are pretty simplistic and artificial) but I'd like to get my hands on a *real* example, if possible. Grateful for any pointers/suggestions. cheers, Rolf Turner -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Wed, 24 Mar 2021 18:45:01 -0700 <Name suppressed to protect the innocent> wrote: > "X = {X_1, X_2, ..., X_N} where each X_i > is an integer between 0 and n (n known a priori)" > > That is a multinomial, not a binomial distribution. A binomial > distribution can have only two values, success or failure. > > What have I misunderstood? And then, following up: > Oh, I think I get what you mean -- you are drawing repeated samples > from a binomial with n trials and you are counting the number of > successes for each. Yes. Exactly. Sorry if my post was unclear. cheers, Rolf Turner -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Rolf Turner
I haven't checked this, but I guess that the number of students that
*pass* a particular exam/subject, per semester would be like that. e.g. Let's say you have a course in maximum likelihood, that's taught once per year to 3rd year students, and a few postgrads. You could count the number of passes, each year. If you assume a near-constant probability of passing in each exam/semester: Then I would assume it would follow the distribution that you're requesting. If there is a significant change in the number of students: Lets say, that less and less students study maximum likelihood because they would rather study "advanced" R programming for "data science" with "large data", then you might be able to apply some sort of discrete-scaling transformation to the number of passes each semester. This would allow you to pretend that the number of people studying maximum likelihood is the same, and no one is studying other apparently more important subjects. On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner <[hidden email]> wrote: > > > I would like a real-life example of a data set which one might think to > model by a binomial distribution, but which is substantially > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i > is an integer between 0 and n (n known a priori) such that var(X) << > mean(X)*(1 - mean(X)/n). > > Does anyone know of any such examples? Do any exist? I've done > a perfunctory web search, and had a look at "A Handbook of Small > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank. > > I've seen on the web some references to underdispersed "pseudo-Poisson" > data, but not to underdispersed "pseudo-binomial" data. And of course > there's lots of *over* dispersed stuff. But that's not what I want. > > I can *simulate* data sets of the sor that I am looking for (so far the > only ideas I've had for doing this are pretty simplistic and > artificial) but I'd like to get my hands on a *real* example, if > possible. > > Grateful for any pointers/suggestions. > > cheers, > > Rolf Turner > > -- > Honorary Research Fellow > Department of Statistics > University of Auckland > Phone: +64-9-373-7599 ext. 88276 > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Fri, 26 Mar 2021 13:41:00 +1300 Abby Spurdle <[hidden email]> wrote: > I haven't checked this, but I guess that the number of students that > *pass* a particular exam/subject, per semester would be like that. > > e.g. > Let's say you have a course in maximum likelihood, that's taught once > per year to 3rd year students, and a few postgrads. > You could count the number of passes, each year. > > If you assume a near-constant probability of passing in each > exam/semester: Then I would assume it would follow the distribution > that you're requesting. <SNIP> Thanks Abby. I've experimented (simulated) a wee bit and found that if I keep the numbers of students (undergrad and grad) exactly constant, then the results are underdispersed. However if the numbers are allowed to vary then the results are overdispersed. It seems that the universe is very reluctant to produce underdispersed pseudo-binomial data! cheers, Rolf -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On 25/03/2021 10:25 p.m., Rolf Turner wrote:
> > On Fri, 26 Mar 2021 13:41:00 +1300 > Abby Spurdle <[hidden email]> wrote: > >> I haven't checked this, but I guess that the number of students that >> *pass* a particular exam/subject, per semester would be like that. >> >> e.g. >> Let's say you have a course in maximum likelihood, that's taught once >> per year to 3rd year students, and a few postgrads. >> You could count the number of passes, each year. >> >> If you assume a near-constant probability of passing in each >> exam/semester: Then I would assume it would follow the distribution >> that you're requesting. > > <SNIP> > > Thanks Abby. I've experimented (simulated) a wee bit and found > that if I keep the numbers of students (undergrad and grad) exactly > constant, then the results are underdispersed. However if the > numbers are allowed to vary then the results are overdispersed. > > It seems that the universe is very reluctant to produce underdispersed > pseudo-binomial data! I'd expect underdispersion to happen in competitive situations: if subject A succeeds, that makes it less likely that other subjects will also succeed. An extreme case is a contest winner. With some contests there will always be one winner (a little too-underdispersed for you, probably), but others allow a small amount of variation. For example, sports events that allow ties. This page https://en.wikipedia.org/wiki/List_of_ties_for_medals_at_the_Olympics seems to indicate that speed skating had a lot of ties up until 1980. Duncan Murdoch ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Rolf Turner
Hi Rolf,
Let's say we have a course called Corgiology 101, with a single moderated exam. And let's say the moderators transform initial exam scores, such that there are fixed percentages of pass rates and A grades. Rather than count the number of passes, we can count the number of "jumps". That is, the number of people that pass the corgiology exam after moderation, that would not have passed without moderation. I've created a function to test for underdispersion, based on your expression. (I hope I got it right). Then I've gone on to create simulations, using both constant and nonconstant class sizes. The nonconstant simulations apply an (approx) discrete scaling transformation, referred to previously. We can see from the examples that there are a lot of these jumps. And more importantly, they appear to be underdispersed. ----code---- PASS.SCORE <- 0.5 A.SCORE <- 0.8 #target parameters PASS.RATE <- 0.8 A.RATE <- 0.2 #unmoderated parameters UNMOD.MEAN.SCORE <- 0.65 UNMOD.SD.SCORE <- 0.075 NCLASSES <- 2000 NSTUD.CONST <- 200 NSTUD.NONCONST.LIMS <- c (50, 800) sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE, pass.score=PASS.SCORE, a.score=A.SCORE, pass.rate=PASS.RATE, a.rate=A.RATE) { x <- rnorm (nstud, mean0, sd0) q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE) dq <- diff (q) q <- (a.score - pass.score) / dq * q y <- pass.score - q [1] + (a.score - pass.score) / dq * x sum (x < a.score & y >= a.score) } sim.nclasses <- function (nclasses, nstud, nstud.std) { nstud <- rep_len (nstud, nclasses) njump <- integer (nclasses) for (i in 1:nclasses) njump [i] <- sim.njump (nstud [i]) if (missing (nstud.std) ) njump else round (nstud.std / nstud * njump) } is.under <- function (x, n) var (x) < mean (x) * (1 - mean (x) / n) njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST) nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1], NSTUD.NONCONST.LIMS [2]) ) njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST) under.hom <- is.under (njump.hom, NSTUD.CONST) under.het <- is.under (njump.het, NSTUD.CONST) main.hom <- paste0 ("const class size (under=", under.hom, ")") main.het <- paste0 ("diff class sizes (under=", under.het, ")") p0 <- par (mfrow = c (2, 1) ) hist (njump.hom, main=main.hom) hist (njump.het, main=main.het) par (p0) ----code---- best, B. On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner <[hidden email]> wrote: > > > I would like a real-life example of a data set which one might think to > model by a binomial distribution, but which is substantially > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i > is an integer between 0 and n (n known a priori) such that var(X) << > mean(X)*(1 - mean(X)/n). > > Does anyone know of any such examples? Do any exist? I've done > a perfunctory web search, and had a look at "A Handbook of Small > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank. > > I've seen on the web some references to underdispersed "pseudo-Poisson" > data, but not to underdispersed "pseudo-binomial" data. And of course > there's lots of *over* dispersed stuff. But that's not what I want. > > I can *simulate* data sets of the sor that I am looking for (so far the > only ideas I've had for doing this are pretty simplistic and > artificial) but I'd like to get my hands on a *real* example, if > possible. > > Grateful for any pointers/suggestions. > > cheers, > > Rolf Turner > > -- > Honorary Research Fellow > Department of Statistics > University of Auckland > Phone: +64-9-373-7599 ext. 88276 > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Sorry.
I just realized, after posting, that the "n" value in the dispersion calculation isn't correct. I'll have to revisit the simulation, tomorrow. On Sat, Mar 27, 2021 at 9:11 PM Abby Spurdle <[hidden email]> wrote: > > Hi Rolf, > > Let's say we have a course called Corgiology 101, with a single moderated exam. > And let's say the moderators transform initial exam scores, such that > there are fixed percentages of pass rates and A grades. > > Rather than count the number of passes, we can count the number of "jumps". > That is, the number of people that pass the corgiology exam after > moderation, that would not have passed without moderation. > > I've created a function to test for underdispersion, based on your expression. > (I hope I got it right). > > Then I've gone on to create simulations, using both constant and > nonconstant class sizes. > The nonconstant simulations apply an (approx) discrete scaling > transformation, referred to previously. > > We can see from the examples that there are a lot of these jumps. > And more importantly, they appear to be underdispersed. > > ----code---- > PASS.SCORE <- 0.5 > A.SCORE <- 0.8 > > #target parameters > PASS.RATE <- 0.8 > A.RATE <- 0.2 > #unmoderated parameters > UNMOD.MEAN.SCORE <- 0.65 > UNMOD.SD.SCORE <- 0.075 > > NCLASSES <- 2000 > NSTUD.CONST <- 200 > NSTUD.NONCONST.LIMS <- c (50, 800) > > sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE, > pass.score=PASS.SCORE, a.score=A.SCORE, > pass.rate=PASS.RATE, a.rate=A.RATE) > { x <- rnorm (nstud, mean0, sd0) > q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE) > dq <- diff (q) > q <- (a.score - pass.score) / dq * q > y <- pass.score - q [1] + (a.score - pass.score) / dq * x > sum (x < a.score & y >= a.score) > } > > sim.nclasses <- function (nclasses, nstud, nstud.std) > { nstud <- rep_len (nstud, nclasses) > njump <- integer (nclasses) > for (i in 1:nclasses) > njump [i] <- sim.njump (nstud [i]) > if (missing (nstud.std) ) > njump > else > round (nstud.std / nstud * njump) > } > > is.under <- function (x, n) > var (x) < mean (x) * (1 - mean (x) / n) > > njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST) > nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1], > NSTUD.NONCONST.LIMS [2]) ) > njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST) > > under.hom <- is.under (njump.hom, NSTUD.CONST) > under.het <- is.under (njump.het, NSTUD.CONST) > main.hom <- paste0 ("const class size (under=", under.hom, ")") > main.het <- paste0 ("diff class sizes (under=", under.het, ")") > > p0 <- par (mfrow = c (2, 1) ) > hist (njump.hom, main=main.hom) > hist (njump.het, main=main.het) > par (p0) > ----code---- > > best, > B. > > > On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner <[hidden email]> wrote: > > > > > > I would like a real-life example of a data set which one might think to > > model by a binomial distribution, but which is substantially > > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i > > is an integer between 0 and n (n known a priori) such that var(X) << > > mean(X)*(1 - mean(X)/n). > > > > Does anyone know of any such examples? Do any exist? I've done > > a perfunctory web search, and had a look at "A Handbook of Small > > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank. > > > > I've seen on the web some references to underdispersed "pseudo-Poisson" > > data, but not to underdispersed "pseudo-binomial" data. And of course > > there's lots of *over* dispersed stuff. But that's not what I want. > > > > I can *simulate* data sets of the sor that I am looking for (so far the > > only ideas I've had for doing this are pretty simplistic and > > artificial) but I'd like to get my hands on a *real* example, if > > possible. > > > > Grateful for any pointers/suggestions. > > > > cheers, > > > > Rolf Turner > > > > -- > > Honorary Research Fellow > > Department of Statistics > > University of Auckland > > Phone: +64-9-373-7599 ext. 88276 > > > > ______________________________________________ > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Further to yesterday's posts:
I think the "n" value would be the maximum possible number of jumps, not the number of students. In theory, the minimum possible number is zero, so the distributions are more binomial-like than they look. Also, there was a mistake in my comments. The jump is from non-A-grade to A-grade, not non-pass to pass. On Sat, Mar 27, 2021 at 10:00 PM Abby Spurdle <[hidden email]> wrote: > > Sorry. > I just realized, after posting, that the "n" value in the dispersion > calculation isn't correct. > I'll have to revisit the simulation, tomorrow. > > On Sat, Mar 27, 2021 at 9:11 PM Abby Spurdle <[hidden email]> wrote: > > > > Hi Rolf, > > > > Let's say we have a course called Corgiology 101, with a single moderated exam. > > And let's say the moderators transform initial exam scores, such that > > there are fixed percentages of pass rates and A grades. > > > > Rather than count the number of passes, we can count the number of "jumps". > > That is, the number of people that pass the corgiology exam after > > moderation, that would not have passed without moderation. > > > > I've created a function to test for underdispersion, based on your expression. > > (I hope I got it right). > > > > Then I've gone on to create simulations, using both constant and > > nonconstant class sizes. > > The nonconstant simulations apply an (approx) discrete scaling > > transformation, referred to previously. > > > > We can see from the examples that there are a lot of these jumps. > > And more importantly, they appear to be underdispersed. > > > > ----code---- > > PASS.SCORE <- 0.5 > > A.SCORE <- 0.8 > > > > #target parameters > > PASS.RATE <- 0.8 > > A.RATE <- 0.2 > > #unmoderated parameters > > UNMOD.MEAN.SCORE <- 0.65 > > UNMOD.SD.SCORE <- 0.075 > > > > NCLASSES <- 2000 > > NSTUD.CONST <- 200 > > NSTUD.NONCONST.LIMS <- c (50, 800) > > > > sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE, > > pass.score=PASS.SCORE, a.score=A.SCORE, > > pass.rate=PASS.RATE, a.rate=A.RATE) > > { x <- rnorm (nstud, mean0, sd0) > > q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE) > > dq <- diff (q) > > q <- (a.score - pass.score) / dq * q > > y <- pass.score - q [1] + (a.score - pass.score) / dq * x > > sum (x < a.score & y >= a.score) > > } > > > > sim.nclasses <- function (nclasses, nstud, nstud.std) > > { nstud <- rep_len (nstud, nclasses) > > njump <- integer (nclasses) > > for (i in 1:nclasses) > > njump [i] <- sim.njump (nstud [i]) > > if (missing (nstud.std) ) > > njump > > else > > round (nstud.std / nstud * njump) > > } > > > > is.under <- function (x, n) > > var (x) < mean (x) * (1 - mean (x) / n) > > > > njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST) > > nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1], > > NSTUD.NONCONST.LIMS [2]) ) > > njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST) > > > > under.hom <- is.under (njump.hom, NSTUD.CONST) > > under.het <- is.under (njump.het, NSTUD.CONST) > > main.hom <- paste0 ("const class size (under=", under.hom, ")") > > main.het <- paste0 ("diff class sizes (under=", under.het, ")") > > > > p0 <- par (mfrow = c (2, 1) ) > > hist (njump.hom, main=main.hom) > > hist (njump.het, main=main.het) > > par (p0) > > ----code---- > > > > best, > > B. > > > > > > On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner <[hidden email]> wrote: > > > > > > > > > I would like a real-life example of a data set which one might think to > > > model by a binomial distribution, but which is substantially > > > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i > > > is an integer between 0 and n (n known a priori) such that var(X) << > > > mean(X)*(1 - mean(X)/n). > > > > > > Does anyone know of any such examples? Do any exist? I've done > > > a perfunctory web search, and had a look at "A Handbook of Small > > > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank. > > > > > > I've seen on the web some references to underdispersed "pseudo-Poisson" > > > data, but not to underdispersed "pseudo-binomial" data. And of course > > > there's lots of *over* dispersed stuff. But that's not what I want. > > > > > > I can *simulate* data sets of the sor that I am looking for (so far the > > > only ideas I've had for doing this are pretty simplistic and > > > artificial) but I'd like to get my hands on a *real* example, if > > > possible. > > > > > > Grateful for any pointers/suggestions. > > > > > > cheers, > > > > > > Rolf Turner > > > > > > -- > > > Honorary Research Fellow > > > Department of Statistics > > > University of Auckland > > > Phone: +64-9-373-7599 ext. 88276 > > > > > > ______________________________________________ > > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |