Quantcast

estimation problem

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

estimation problem

Dániel Kehl
Dear List-members,

I have a problem where I have to estimate a mean, or a sum of a
population but for some reason it contains a huge amount of zeros.
I cannot give real data but I constructed a toy example as follows

N1 <- 100000
N2 <- 3000
x1 <- rep(0,N1)
x2 <- rnorm(N2,300,100)
x <- c(x1,x2)

n <- 1000

x_sample <- sample(x,n,replace=FALSE)

I want to estimate the sum of x based on x_sample (not knowing N1 and N2
but their sum (N) only).
The sample mean has a huge standard deviation I am looking for a better
estimator.
I was thinking about trimmed (or "left trimmed" as my numbers are all
positive) means or something similar,
but if I calculate trimmed mean I do not know N2 to multiply with.

Do you have any idea or could you give me some insight?

Thanks a lot:
Daniel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

Jeff Newmiller
Although you have provided R code to illustrate your problem, it is fundamentally a statistics theory question, and belongs somewhere else like stats.stackexchange.net.

When you post there, I recommend that you spend more effort to identify why the zeros are present. If they are indicators of unknown values, that will be very different than if zeros are valid members of the population.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.



"Kehl Dániel" <[hidden email]> wrote:

>Dear List-members,
>
>I have a problem where I have to estimate a mean, or a sum of a
>population but for some reason it contains a huge amount of zeros.
>I cannot give real data but I constructed a toy example as follows
>
>N1 <- 100000
>N2 <- 3000
>x1 <- rep(0,N1)
>x2 <- rnorm(N2,300,100)
>x <- c(x1,x2)
>
>n <- 1000
>
>x_sample <- sample(x,n,replace=FALSE)
>
>I want to estimate the sum of x based on x_sample (not knowing N1 and
>N2
>but their sum (N) only).
>The sample mean has a huge standard deviation I am looking for a better
>
>estimator.
>I was thinking about trimmed (or "left trimmed" as my numbers are all
>positive) means or something similar,
>but if I calculate trimmed mean I do not know N2 to multiply with.
>
>Do you have any idea or could you give me some insight?
>
>Thanks a lot:
>Daniel
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

Dániel Kehl
Dear Jeff,

thank you for the response.
Of course I know this is a theory question still I hope to get some
comments on it
(if somebody already dealt with alike problems might suggest a package
and it would not take longer than saying this is a theoretical question)
The values are counts, so 0 means those cases do not have this item,
they have 0, as such it means a "real zero", they are valid members.

thanks,
daniel

2012.05.03. 16:42 keltezéssel, Jeff Newmiller írta:

> Although you have provided R code to illustrate your problem, it is fundamentally a statistics theory question, and belongs somewhere else like stats.stackexchange.net.
>
> When you post there, I recommend that you spend more effort to identify why the zeros are present. If they are indicators of unknown values, that will be very different than if zeros are valid members of the population.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>         Basics: ##.#.       ##.#.  Live Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
>
>
> "Kehl Dániel"<[hidden email]>  wrote:
>
>> Dear List-members,
>>
>> I have a problem where I have to estimate a mean, or a sum of a
>> population but for some reason it contains a huge amount of zeros.
>> I cannot give real data but I constructed a toy example as follows
>>
>> N1<- 100000
>> N2<- 3000
>> x1<- rep(0,N1)
>> x2<- rnorm(N2,300,100)
>> x<- c(x1,x2)
>>
>> n<- 1000
>>
>> x_sample<- sample(x,n,replace=FALSE)
>>
>> I want to estimate the sum of x based on x_sample (not knowing N1 and
>> N2
>> but their sum (N) only).
>> The sample mean has a huge standard deviation I am looking for a better
>>
>> estimator.
>> I was thinking about trimmed (or "left trimmed" as my numbers are all
>> positive) means or something similar,
>> but if I calculate trimmed mean I do not know N2 to multiply with.
>>
>> Do you have any idea or could you give me some insight?
>>
>> Thanks a lot:
>> Daniel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

Petr Savicky
In reply to this post by Dániel Kehl
On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl Dániel wrote:

> Dear List-members,
>
> I have a problem where I have to estimate a mean, or a sum of a
> population but for some reason it contains a huge amount of zeros.
> I cannot give real data but I constructed a toy example as follows
>
> N1 <- 100000
> N2 <- 3000
> x1 <- rep(0,N1)
> x2 <- rnorm(N2,300,100)
> x <- c(x1,x2)
>
> n <- 1000
>
> x_sample <- sample(x,n,replace=FALSE)
>
> I want to estimate the sum of x based on x_sample (not knowing N1 and N2
> but their sum (N) only).
> The sample mean has a huge standard deviation I am looking for a better
> estimator.

Hi.

I do not know the exact answer, but let me formulate the following observation.
If the question is redefined to estimate the mean of nonzero numbers, then
an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your
situation may be estimated as

  res <- rep(NA, times=1000)
  for (i in seq.int(along=res)) {
      x_sample <- sample(x,n,replace=FALSE)
      res[i] <- mean(x_sample[x_sample != 0])
  }
  sd(res)

  [1] 18.72677 # this varies with the seed a bit

The observation is that this cannot be improved much, since the estimate
is based on a very small sample. The average size of the sample of nonzero
values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be
something close to 100/sqrt(29.1) = 18.5376.

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

Dániel Kehl
Dear Petr,

thank you for your input.
I tried to experiment with (probably somewhat biased) truncated means
like in the following code.
How I got the 225 as a truncation limit is a good question. :)

REPS1 <- REPS2 <- 1000
N1 <- 100000
N2 <- 30000
N <- N1+N2
x1 <- rep(0,N1)
x2 <- rnorm(N2,300,100)
x <- c(x1,x2)

n <- 1000

for (i in 1:REPS1){
   x_sample <- sort(sample(x,n,replace=FALSE),TRUE)
   x_trunc <- x_sample[1:225]
   REPS1[i] <- mean(x_sample)*N
   REPS2[i] <- sum(x_trunc)/n*N
   }

sum(x2)
mean(REPS1)
mean(REPS2)
sd(REPS1)
sd(REPS2)
sd(REPS2)/sd(REPS1)


Best,
daniel

2012.05.03. 17:45 keltezéssel, Petr Savicky írta:

> On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl Dániel wrote:
>> Dear List-members,
>>
>> I have a problem where I have to estimate a mean, or a sum of a
>> population but for some reason it contains a huge amount of zeros.
>> I cannot give real data but I constructed a toy example as follows
>>
>> N1<- 100000
>> N2<- 3000
>> x1<- rep(0,N1)
>> x2<- rnorm(N2,300,100)
>> x<- c(x1,x2)
>>
>> n<- 1000
>>
>> x_sample<- sample(x,n,replace=FALSE)
>>
>> I want to estimate the sum of x based on x_sample (not knowing N1 and N2
>> but their sum (N) only).
>> The sample mean has a huge standard deviation I am looking for a better
>> estimator.
> Hi.
>
> I do not know the exact answer, but let me formulate the following observation.
> If the question is redefined to estimate the mean of nonzero numbers, then
> an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your
> situation may be estimated as
>
>    res<- rep(NA, times=1000)
>    for (i in seq.int(along=res)) {
>        x_sample<- sample(x,n,replace=FALSE)
>        res[i]<- mean(x_sample[x_sample != 0])
>    }
>    sd(res)
>
>    [1] 18.72677 # this varies with the seed a bit
>
> The observation is that this cannot be improved much, since the estimate
> is based on a very small sample. The average size of the sample of nonzero
> values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be
> something close to 100/sqrt(29.1) = 18.5376.
>
> Petr Savicky.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

Petr Savicky
On Fri, May 04, 2012 at 07:43:32PM +0200, Kehl Dániel wrote:

> Dear Petr,
>
> thank you for your input.
> I tried to experiment with (probably somewhat biased) truncated means
> like in the following code.
> How I got the 225 as a truncation limit is a good question. :)
>
> REPS1 <- REPS2 <- 1000
> N1 <- 100000
> N2 <- 30000
> N <- N1+N2
> x1 <- rep(0,N1)
> x2 <- rnorm(N2,300,100)
> x <- c(x1,x2)
>
> n <- 1000
>
> for (i in 1:REPS1){
>   x_sample <- sort(sample(x,n,replace=FALSE),TRUE)
>   x_trunc <- x_sample[1:225]
>   REPS1[i] <- mean(x_sample)*N
>   REPS2[i] <- sum(x_trunc)/n*N
>   }
>
> sum(x2)
> mean(REPS1)
> mean(REPS2)
> sd(REPS1)
> sd(REPS2)
> sd(REPS2)/sd(REPS1)

Dear Daniel.

Thank you for your reply.

In the original question, you used the parameters

  N1 <- 100000
  N2 <- 3000

and now the parameters

  N1 <- 100000
  N2 <- 30000

My remark was that with the original parameters, there are only 29.1
nonzero elements on average. Now, there are 230.8 nonzero elements on
average, which is significantly better.

Discussion of the use of the truncated mean is probably a question to
other members of the list. I do not feel to be an expert on this.

Best, Petr.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: estimation problem

David Winsemius

On May 4, 2012, at 4:22 PM, Petr Savicky wrote:

> On Fri, May 04, 2012 at 07:43:32PM +0200, Kehl Dániel wrote:
>> Dear Petr,
>>
>> thank you for your input.
>> I tried to experiment with (probably somewhat biased) truncated means
>> like in the following code.
>> How I got the 225 as a truncation limit is a good question. :)
>>
>> REPS1 <- REPS2 <- 1000
>> N1 <- 100000
>> N2 <- 30000
>> N <- N1+N2
>> x1 <- rep(0,N1)
>> x2 <- rnorm(N2,300,100)
>> x <- c(x1,x2)
>>
>> n <- 1000
>>
>> for (i in 1:REPS1){
>>  x_sample <- sort(sample(x,n,replace=FALSE),TRUE)
>>  x_trunc <- x_sample[1:225]
>>  REPS1[i] <- mean(x_sample)*N
>>  REPS2[i] <- sum(x_trunc)/n*N
>>  }
>>
>> sum(x2)
>> mean(REPS1)
>> mean(REPS2)
>> sd(REPS1)
>> sd(REPS2)
>> sd(REPS2)/sd(REPS1)
>
> Dear Daniel.
>
> Thank you for your reply.
>
> In the original question, you used the parameters
>
>  N1 <- 100000
>  N2 <- 3000
>
> and now the parameters
>
>  N1 <- 100000
>  N2 <- 30000
>
> My remark was that with the original parameters, there are only 29.1
> nonzero elements on average. Now, there are 230.8 nonzero elements on
> average, which is significantly better.
>
> Discussion of the use of the truncated mean is probably a question to
> other members of the list. I do not feel to be an expert on this.
>
> Best, Petr.

My experience is that Petr is better than I at much of R, but so far  
in this thread I have not seen mention of methods that are designed to  
examine data situations with large numbers of zeros. There is a very  
informative review of R techniques and packages to such efforts by  
Achim Zeileis and others. The same material was published in the  
Journal of Statistical Software and as a vignette in one of the  
contributed packages:

www.jstatsoft.org/v27/i08/paper
cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

I don't have this information memorized, but generally find a Google-
search with "count r zeileis" to be highly effective. I've just  
noticed that the second author Kleiber also has put up useful material  
on that topic for web-searchers to use.

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...