random truncation

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

random truncation

Spencer Graves-4
Hello:


       What do you suggest I do about modeling random truncation?


        I have data on a variable Y in strata S[0], S[1], ..., S[n],
where Y is always observed in S[0] but is less often observed in the
other strata.  I assume that the probability of observing Y is a
monotonically increasing function of Y and a monotonically decreasing
function of d[i] = the distance from S[0] to S[i].


       There is a section on "random truncation" in the Wikipedia
article on "Truncated distribution".[1]  It would be nice if I had an R
package that would make it relatively easy to model the truncation as a
function of "d" and / or publication that described someone doing it in
R.  (I also have a couple of other variables that influence the
distribution of Y.)


       Thanks,
       Spencer Graves


[1] https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Abby Spurdle
> It would be nice if I had an R
> package that would make it relatively easy to model the truncation as a
> function of "d"

I suspect that R has everything you need, already.
However, I suspect you may need to reformulate your question to find what
you need.

> I assume that the probability of observing Y is a
> monotonically increasing function of Y

That's like flipping a coin, and saying the probability of observing heads
is a monotonically increasing function of heads.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Bert Gunter-2
Did you search on e.g. "model truncation" at rseek.org? Several packages
came up that appear to deal with truncated data, though I have no clue
whether in the way you specify.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Jul 12, 2019 at 4:26 PM Abby Spurdle <[hidden email]> wrote:

> > It would be nice if I had an R
> > package that would make it relatively easy to model the truncation as a
> > function of "d"
>
> I suspect that R has everything you need, already.
> However, I suspect you may need to reformulate your question to find what
> you need.
>
> > I assume that the probability of observing Y is a
> > monotonically increasing function of Y
>
> That's like flipping a coin, and saying the probability of observing heads
> is a monotonically increasing function of heads.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Rolf Turner
In reply to this post by Spencer Graves-4

On 13/07/19 10:54 AM, Spencer Graves wrote:

> Hello:
>
>
>        What do you suggest I do about modeling random truncation?

Good question!  Probably the best answer is "Give up and go to the pub!" :-)

But seriously, there is a package DTDA on CRAN which purports to analyse
randomly truncated data.

>
>         I have data on a variable Y in strata S[0], S[1], ..., S[n],
> where Y is always observed in S[0] but is less often observed in the
> other strata.  I assume that the probability of observing Y is a
> monotonically increasing function of Y and a monotonically decreasing
> function of d[i] = the distance from S[0] to S[i].
>
>
>        There is a section on "random truncation" in the Wikipedia
> article on "Truncated distribution".[1]  It would be nice if I had an R
> package that would make it relatively easy to model the truncation as a
> function of "d" and / or publication that described someone doing it in
> R.  (I also have a couple of other variables that influence the
> distribution of Y.)
>
>
>        Thanks,
>        Spencer Graves
>
>
> [1] https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation

I'd just like to add that many many years ago, probably back before you
were born, I struggled mightily for a while with random truncation,
modelling the truncation variable to have a density that was uniform on
some interval [a,b].

I took the underlying ("untruncated") variable to have a Gaussian
distribution N(mu,sigma^2).

The distribution of the randomly truncated variable has thus four
parameters: a, b, mu and sigma.  I was able to write down the likelihood
and attempted to maximise it, but this was a nightmare since the partial
derivatives of the log likelihood are undefined for b=x_i where x_1,
..., x_n are the i.i.d. observations from the randomly truncated
distribution.  I fought with this for a long while, could not get
estimates based on simulated data to agree at all well with the true
values, and eventually chucked it all in.

This all happened back before there was R, or even S!!!

Let us hope that the authors of DTDA are far clever than I.  (Not at all
unlikely!)

cheers,

Rolf

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Abby Spurdle
> The distribution of the randomly truncated variable has thus four
> parameters: a, b, mu and sigma.  I was able to write down the likelihood
> and attempted to maximise it

I read the Wikipedia article more carefully.
The formula is relatively simple, and is based on the application of Bayes
Theorem.
If one doesn't want to work out the integral, numerical methods can be used.

However, the problem needs to be defined *precisely* first.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Rolf Turner

On 13/07/19 3:31 PM, Abby Spurdle wrote:

>  > The distribution of the randomly truncated variable has thus four
>  > parameters: a, b, mu and sigma.  I was able to write down the likelihood
>  > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be used.

I am mystified as to what you are saying here.  What integral?
As I said, I could write down the likelihood.  Explicitly.  Maximising
it (numerically *of course*) turned out to be problematic.  Back then.
Modern optimisers might help.  Presumably the DTDA package has a
reasonable procedure.  (I haven't looked at it.)

> However, the problem needs to be defined *precisely* first.

Again I have no idea of what you are driving at here.  The concept of
"random truncation" is quite precisely defined.

cheers,

Rolf

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Spencer Graves-4
In reply to this post by Abby Spurdle


On 2019-07-12 22:31, Abby Spurdle wrote:

> > The distribution of the randomly truncated variable has thus four
> > parameters: a, b, mu and sigma.  I was able to write down the likelihood
> > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be
> used.
>
> However, the problem needs to be defined *precisely* first.
>

Correct:  In my case, I confess I hadn't thought this through completely
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me
to the "truncreg" and "DTDA" packages, neither of which seemed to be
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.


       I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and
f are normally distributed with standard deviations s and t,
respectively, i = 1:n.  I want the total of all the Y's, including those
truncated (and not observed).


       I think the likelihood is the product of the density of Y[i]
given x[i] and given that Y[i] is actually observed.  The latter can be
further written as follows:


             (x[i]'b+e)>(z[i]c+f)
iff
             (x[i]'b-z[i]'c)>(f-e),


Therefore, the probability that Y[i] is observed is


             Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))


where Phi is the cdf of the standard normal.


       And then the likelihood for observation i can be written as follows:


             f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) /
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2).


       We may not be able to estimate "c" in this, because if one of the
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we
still cannot estimate "c".)  However, if "b" is estimable, ignoring the
truncation, then we can estimate "b", "s" and "t" given "c".


       And then the desired total of all the Y's, observed and
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.


       This likelihood is simple enough, it can be easily programmed in
R and maximized over variations in "b", "s" and "t" given "c".  I can
get starting values for "b" and "s" from "lm", ignoring the truncation. 
And I can first fit the model assuming t = s, then test whether it's
different using likelihood ratio.  And I can try to estimate "c", but I
should probably use values I can estimate from other sources until I'm
comfortable with the estimates I get for "b", "s" and "t" given an
assumed value for "c".


       Comments?
       Thanks so much.
       Spencer Graves

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Abby Spurdle
In reply to this post by Rolf Turner
> What integral?

What do you mean "What integral?"...
The integral on the Wikipedia page.
(The same page referenced in the earlier posts).

https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation
https://wikimedia.org/api/rest_v1/media/math/render/svg/93717ffcd3bfa2a60d825bd71b5375ad888ceb97

Seems quite obvious to me...
Did you read the Wikipedia page?

> Maximising
> it (numerically *of course*) turned out to be problematic.  Back then.
> Modern optimisers might help.

Numerical methods have been around for a while...

> > However, the problem needs to be defined *precisely* first.

> Again I have no idea of what you are driving at here.  The concept of
> "random truncation" is quite precisely defined.

I wasn't referring the problem of "random truncation".
I was referring to the original post.
In particular, how Y, S[i] and d[i] are defined.

Maybe, I should be more precise when suggesting others be more precise.
(My bad).

I note that Spencer has posted again.
I will read the new post, later, when I get some more time.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Abby Spurdle
In reply to this post by Spencer Graves-4
Firstly, we don't really need all your working.
Just the problem you want solve.

However, I'm still having difficulty understanding this.

> I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and
> f are normally distributed with standard deviations s and t,
> respectively, i = 1:n.  I want the total of all the Y's, including those
> truncated (and not observed).

> (x[i]'b+e)>(z[i]c+f)
> (x[i]'b-z[i]'c)>(f-e),

(1) What does the tick (or single quote) mean?
(2) What do you mean by "X[i]" and "observing Y[i]"?
(3) Are e and f random variables, or vectors of (observed) errors?
My intuitive understanding of (2), in the given context, would be that X
and Y are vectors of observations.
However, if e and f are random variables, wouldn't that would make Y[i] a
random variable too? In which case I'm not sure what you mean by
"observing".
Conversely, if e and f are vectors of (observed) errors, shouldn't they be
indexed (for consistency)? And in which case, there's no random component
in your expressions.
(4) Is lower case "x[i]" the same as upper case "X[i]", and what's "z[i]"?
(5) Is the mean of e and f, zero?
(6) So, you want to estimate b and c?
And wouldn't that make this a parameter estimation problem?
(In which case, you don't necessary need to model any distributions).

> Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))
> where Phi is the cdf of the standard normal.

This implies that "Y[i]" is a is binary (true or false) random variable.

Do you mean Pr (Y <= y) where Y is a random variable and y is a possible
value (from Y's sample space), that Y can be less than or equal to?
You can replace y with y[i] if you want, but the principle is the same.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Abby Spurdle
Note my last response is probably off-topic.
I just wanted to highlight the need for defining problems in a way that
computers (and R programmers) can understand.

On Sun, Jul 14, 2019 at 1:13 PM Abby Spurdle <[hidden email]> wrote:
>
> Firstly, we don't really need all your working.
> Just the problem you want solve.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: random truncation

Spencer Graves-4
In reply to this post by Abby Spurdle
PLEASE EXCUSE:  This discussion has diverged from R into discussing the
precise assumptions seemingly descriptive of an application that drove
the initial post to this thread.  A reply by Abby Spurdle seemed to  me
to raise questions, whose answers may not be intelligible without
material snipped from Spurdle's reply.  I wish to thank Spurdle from the
reply and apologize to those who feel this is an abuse of this list.  I
trust that those in the latter category will please not bother to read
further.  For anyone still interested in this problem, below please find
my earlier analysis with corrections that attempt to respond to
Spurdle's most recent concerns.  Thanks, Spencer Graves


On 2019-07-12 22:31, Abby Spurdle wrote:

> > The distribution of the randomly truncated variable has thus four
> > parameters: a, b, mu and sigma.  I was able to write down the likelihood
> > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be
> used.
>
> However, the problem needs to be defined *precisely* first.
>

Correct:  In my case, I confess I hadn't thought this through completely
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me
to the "truncreg" and "DTDA" packages, neither of which seemed to be
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.


       I'm observing a random variable Y[i] = (x[i]'b+e[i]) given
Y[i]>(z[i]'c+f[i]) where the tick mark (') denotes transpose of a
vector, and e and f are normally distributed with mean 0 and standard
deviations s and t, respectively, i = 1:n.  Thus, Y[i] follows a
truncated normal distribution with mean x[i]'b and standard deviation s,
with the truncation condition being that Y[i]>(z[i]'c+f[i]).


       I want the total of all the Y's from the untruncated
distribution, i.e., including those truncated (and not observed).


       I think the likelihood is the product of the density of Y[i]
given x[i] and given that Y[i] is actually observed.  By substituting
Y[i] = (x[i]'b+e[i]) into the truncation condition Y[i]>(z[i]'c+f[i]),
we get the following:


             (x[i]'b+e)>(z[i]c+f).


This occurs if and only if:


             (x[i]'b-z[i]'c)>(f-e),


Therefore, the probability that Y[i] is observed (and not truncated) is


             Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))


where Phi is the cdf of the standard normal.


       And then the likelihood for observation i can be written as follows:


             f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) /
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)).


       We may not be able to estimate "c" in this, because if one of the
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we
still cannot estimate "c".)  However, if "b" is estimable, ignoring the
truncation, then we can estimate "b", "s" and "t" given "c".


       And then the desired total of all the Y's, observed and
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.


       This likelihood is simple enough, it can be easily programmed in
R and maximized over variations in "b", "s" and "t" given "c".  I can
get starting values for "b" and "s" from "lm", ignoring the truncation. 
And I can first fit the model assuming t = s, then test whether it's
different using likelihood ratio.  And I can try to estimate "c", but I
should probably use values I can estimate from other sources until I'm
comfortable with the estimates I get for "b", "s" and "t" given an
assumed value for "c".


       Comments?
       Thanks so much.
       Spencer Graves

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.