Dear All,

Suppose you are given some data to fit to a mixture of Gaussian

distributions (a weighted sum of three distributions).

You can do that using e.g. the mix package

(

http://www.math.mcmaster.ca/peter/mix/mix.html)

which gave me very good results on some data I generated artificially.

See e.g. the code:

# R code used to fit a sum of three normal distributions

rm(list=ls())

library(mixdist)

set.seed(123)

NN<-10000

three<-3 #number of gaussian sequences I want to generate

mysample<-matrix(nrow=NN,ncol=three)

sdvec<-seq(1:three)

sdvec[1]<-4

sdvec[2]<-3

sdvec[3]<-3.5

meanvec<-seq(1:three)

meanvec[1]<-20

meanvec[2]<-50

meanvec[3]<-70

for (i in 1:three)

{

mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])

}

dim(mysample)<-c(3*NN,1)

plot(density(mysample,kern="gaussian"),lwd=2,col=300)

## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)

#x[ ,1]<-mysample[]

#x[ ,2]<-1

y<-as.data.frame(mysample)

mixgroup(y)

z<-mixgroup(y,breaks=100)

plot(z)

#ini<-read.table("gaupar.TXT")

ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))

mymix<-mix(z,ini,"norm")

summary(mymix)

plot(mymix)

print("So far so good")

However, you can be in a situation in which you cannot sample the

distribution that easily (e.g. you do not have stock quotations

below/above a certain price) and then the same tool does not work that

well. Consider for instance the same case as above but removing all

the data below 15 and above 65:

# R code used to fit a sum of three normal distributions

rm(list=ls())

library(mixdist)

set.seed(123)

NN<-10000

three<-3 #number of gaussian sequences I want to generate

mysample<-matrix(nrow=NN,ncol=three)

sdvec<-seq(1:three)

sdvec[1]<-4

sdvec[2]<-3

sdvec[3]<-3.5

meanvec<-seq(1:three)

meanvec[1]<-20

meanvec[2]<-50

meanvec[3]<-70

for (i in 1:three)

{

mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])

}

dim(mysample)<-c(3*NN,1)

## now I am going to remove some data and see if I can still

meaningfully optimize the

## distribution

N3=3*NN

mycount=0

for (i in 1:N3)

{if (mysample[i]>22 & mysample[i]<67)

{

mycount=mycount+1

if (mycount ==1 )

{

mysample2<-cbind(mysample[i])

}

else

{

mysample2<-cbind(mysample2,mysample[i])

}

}

}

plot(density(mysample,kern="gaussian"),lwd=2,col=300)

## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)

#x[ ,1]<-mysample[]

#x[ ,2]<-1

# y<-as.data.frame(mysample)

mysample3<-mysample2[1:15000]

y<-as.data.frame(mysample3)

mixgroup(y)

z<-mixgroup(y,breaks=100)

plot(z)

#ini<-read.table("gaupar.TXT")

ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))

mymix<-mix(z,ini,"norm")

summary(mymix)

plot(mymix)

print("So far so good")

Then the mixdist completely misses the means of the 2 external modes.

I am sure that, apart from the combination of three Gaussians, this

situation cannot be "new" at all.

I am thinking about some iteration technique (e.g. you fit the data as

well as you can with mixdist to start with, then according to your

fitted parameters you generate the "missing" results, then you re-fit

the whole new set of data (real ones and those freshly generated) and

so on hoping to get some convergent results).

But it seems a bit cumbersome (furthermore, how many of them should I

generate?) plus I am sure that solving this problem from scratch is

re-inventing the wheel.

Sorry for the long email.

Kind Regards

Lorenzo Isella

_______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance