Fitting Distribution Mixs with Incomplete Data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fitting Distribution Mixs with Incomplete Data

Lorenzo Isella
Dear All,
Suppose you are given some data to fit to a mixture of Gaussian
distributions (a weighted sum of three distributions).
You can do that using e.g. the mix package
(http://www.math.mcmaster.ca/peter/mix/mix.html)
which gave me very good results on some data I generated artificially.
See e.g. the code:

# R code used to fit a sum of three normal distributions
rm(list=ls())
library(mixdist)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70

for (i in 1:three)
{
 mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
plot(density(mysample,kern="gaussian"),lwd=2,col=300)
## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)
#x[ ,1]<-mysample[]
#x[ ,2]<-1
y<-as.data.frame(mysample)
mixgroup(y)
z<-mixgroup(y,breaks=100)
plot(z)
#ini<-read.table("gaupar.TXT")
ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
mymix<-mix(z,ini,"norm")
summary(mymix)
plot(mymix)
print("So far so good")


However, you can be in a situation in which you cannot sample the
distribution that easily (e.g. you do not have stock quotations
below/above a certain price) and then the same tool does not work that
well. Consider for instance the same case as above but removing all
the data below 15 and above 65:

# R code used to fit a sum of three normal distributions
rm(list=ls())
library(mixdist)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70

for (i in 1:three)
{
 mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
##  now I am going to remove some data and see if I can still
meaningfully optimize the
## distribution
N3=3*NN
mycount=0
for (i in 1:N3)
{if (mysample[i]>22 & mysample[i]<67)
{
mycount=mycount+1
if (mycount ==1 )
{
mysample2<-cbind(mysample[i])
}
else
{
mysample2<-cbind(mysample2,mysample[i])
}

}

}

plot(density(mysample,kern="gaussian"),lwd=2,col=300)
## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)
#x[ ,1]<-mysample[]
#x[ ,2]<-1
# y<-as.data.frame(mysample)
mysample3<-mysample2[1:15000]
y<-as.data.frame(mysample3)
mixgroup(y)
z<-mixgroup(y,breaks=100)
plot(z)
#ini<-read.table("gaupar.TXT")
ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
mymix<-mix(z,ini,"norm")
summary(mymix)
plot(mymix)
print("So far so good")


Then the mixdist completely misses the means of the 2 external modes.
I am sure that, apart from the combination of three Gaussians, this
situation cannot be "new" at all.
I am thinking about some iteration technique (e.g. you fit the data as
well as you can with mixdist to start with, then according to your
fitted parameters you generate the "missing"  results, then you re-fit
the whole new set of data (real ones and those freshly generated) and
so on hoping to get some convergent results).
But it seems a bit cumbersome (furthermore, how many of them should I
generate?) plus I am sure that solving this problem from scratch is
re-inventing the wheel.
Sorry for the long email.
Kind Regards

Lorenzo Isella

_______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
Reply | Threaded
Open this post in threaded view
|

Re: [R-sig-finance] Fitting Distribution Mixs with Incomplete Data

p.kostov
Why don't you try to use the mmlcr package, where you can specify cnorm
(censored normal) with the appropriate min and max (known) censoring points.

Hope this helps
Philip

On May 18 2006, Lorenzo Isella wrote:

> Dear All,
> Suppose you are given some data to fit to a mixture of Gaussian
> distributions (a weighted sum of three distributions).
> You can do that using e.g. the mix package
> (http://www.math.mcmaster.ca/peter/mix/mix.html)
> which gave me very good results on some data I generated artificially.
> See e.g. the code:
>
> # R code used to fit a sum of three normal distributions
> rm(list=ls())
> library(mixdist)
> set.seed(123)
> NN<-10000
> three<-3 #number of gaussian sequences I want to generate
> mysample<-matrix(nrow=NN,ncol=three)
> sdvec<-seq(1:three)
> sdvec[1]<-4
> sdvec[2]<-3
> sdvec[3]<-3.5
> meanvec<-seq(1:three)
> meanvec[1]<-20
> meanvec[2]<-50
> meanvec[3]<-70
>
> for (i in 1:three)
> {
>  mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
> }
> dim(mysample)<-c(3*NN,1)
> plot(density(mysample,kern="gaussian"),lwd=2,col=300)
> ## The commented lines are parts of the 1st version of this script
>
> #x<-matrix(ncol=2,nrow=three*NN)
> #x[ ,1]<-mysample[]
> #x[ ,2]<-1
> y<-as.data.frame(mysample)
> mixgroup(y)
> z<-mixgroup(y,breaks=100)
> plot(z)
> #ini<-read.table("gaupar.TXT")
> ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
> mymix<-mix(z,ini,"norm")
> summary(mymix)
> plot(mymix)
> print("So far so good")
>
>
> However, you can be in a situation in which you cannot sample the
> distribution that easily (e.g. you do not have stock quotations
> below/above a certain price) and then the same tool does not work that
> well. Consider for instance the same case as above but removing all
> the data below 15 and above 65:
>
> # R code used to fit a sum of three normal distributions
> rm(list=ls())
> library(mixdist)
> set.seed(123)
> NN<-10000
> three<-3 #number of gaussian sequences I want to generate
> mysample<-matrix(nrow=NN,ncol=three)
> sdvec<-seq(1:three)
> sdvec[1]<-4
> sdvec[2]<-3
> sdvec[3]<-3.5
> meanvec<-seq(1:three)
> meanvec[1]<-20
> meanvec[2]<-50
> meanvec[3]<-70
>
> for (i in 1:three)
> {
>  mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
> }
> dim(mysample)<-c(3*NN,1)
> ##  now I am going to remove some data and see if I can still
> meaningfully optimize the
> ## distribution
> N3=3*NN
> mycount=0
> for (i in 1:N3)
> {if (mysample[i]>22 & mysample[i]<67)
> {
> mycount=mycount+1
> if (mycount ==1 )
> {
> mysample2<-cbind(mysample[i])
> }
> else
> {
> mysample2<-cbind(mysample2,mysample[i])
> }
>
> }
>
> }
>
> plot(density(mysample,kern="gaussian"),lwd=2,col=300)
> ## The commented lines are parts of the 1st version of this script
>
> #x<-matrix(ncol=2,nrow=three*NN)
> #x[ ,1]<-mysample[]
> #x[ ,2]<-1
> # y<-as.data.frame(mysample)
> mysample3<-mysample2[1:15000]
> y<-as.data.frame(mysample3)
> mixgroup(y)
> z<-mixgroup(y,breaks=100)
> plot(z)
> #ini<-read.table("gaupar.TXT")
> ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
> mymix<-mix(z,ini,"norm")
> summary(mymix)
> plot(mymix)
> print("So far so good")
>
>
> Then the mixdist completely misses the means of the 2 external modes.
> I am sure that, apart from the combination of three Gaussians, this
> situation cannot be "new" at all.
> I am thinking about some iteration technique (e.g. you fit the data as
> well as you can with mixdist to start with, then according to your
> fitted parameters you generate the "missing"  results, then you re-fit
> the whole new set of data (real ones and those freshly generated) and
> so on hoping to get some convergent results).
> But it seems a bit cumbersome (furthermore, how many of them should I
> generate?) plus I am sure that solving this problem from scratch is
> re-inventing the wheel.
> Sorry for the long email.
> Kind Regards
>
> Lorenzo Isella
>
> _______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
>

_______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance