Varying results of sammon(), for the same data set

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Varying results of sammon(), for the same data set

Ole Edsberg
Hello,

I have a data set on which I run the sammon algorithm as follows:

library(MASS)
data = read.table('problemforr.dat')
y = cmdscale(data, add=TRUE)
s = sammon(data, y$points)

(In case it should be relevant, I make the data available at
http://idi.ntnu.no/~edsberg/problemforr.dat)

With R 2.2.1 on Debian Sid I always get one of two solutions (stress
1.74288 after 10 iterations or stress 1.33629 afer 9 iterations). I
always get the same result within the same R session, even if I read
the data again. With R 2.2.0 on SunOS 5.9 I always get the same result
(stress 0.13186 after 74 iterations).

I understand that the sammon algorithm is very sensitive to even tiny
variations in the starting point, but the observed behaviour seems
strange to me. Difference between machines could perhaps be explained
by floating point portability issues, but not difference on the same
machine, and not the fact that i get the same result within the same R
session.

I read in the documentation
(http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/sammon.html)
that "Further, since the configuration is only determined up to
rotations and reflections (by convention the centroid is at the
origin), the result can vary considerably from machine to machine."
This doesn't make sense to me. If the data and the algorithm is the
same, the result should be the same. What differences between machines
do they refer to here? Floating point issues?

I must admit that I am a beginner, both in R and in statistics. I'm
very curious about the cause of this strangeness. Does anybody have an
explanation?

Best Regards,

Ole Edsberg

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Varying results of sammon(), for the same data set

Brian Ripley
On Mon, 30 Jan 2006, Ole Edsberg wrote:

> Hello,
>
> I have a data set on which I run the sammon algorithm as follows:
>
> library(MASS)
> data = read.table('problemforr.dat')

Hmm.  This is a data frame of 387 rows and 387 columns and Euclidean
distance is used.  Squeezing 387 dims (and PCA shows these points as well
spread in almost all those dimensions) to 2 is not a well-posed problem,
and you should welcome the plurality of answers found.

> y = cmdscale(data, add=TRUE)
> s = sammon(data, y$points)
>
> (In case it should be relevant, I make the data available at
> http://idi.ntnu.no/~edsberg/problemforr.dat)
>
> With R 2.2.1 on Debian Sid I always get one of two solutions (stress
> 1.74288 after 10 iterations or stress 1.33629 afer 9 iterations). I
> always get the same result within the same R session, even if I read
> the data again. With R 2.2.0 on SunOS 5.9 I always get the same result
> (stress 0.13186 after 74 iterations).

Note that your subject line attributes this to sammon, but it could also
be due to cmdscale.

On AMD64 Linux I get

> s = sammon(data, y$points)
Initial stress        : 2.21024
stress after  10 iters: 1.22268, magic = 0.092
stress after  20 iters: 0.48801, magic = 0.009
stress after  30 iters: 0.35007, magic = 0.020
stress after  40 iters: 0.24377, magic = 0.045
stress after  50 iters: 0.17343, magic = 0.021
stress after  60 iters: 0.14944, magic = 0.048
stress after  70 iters: 0.12810, magic = 0.022
stress after  80 iters: 0.12423, magic = 0.010
stress after  90 iters: 0.12191, magic = 0.118
stress after 100 iters: 0.11986, magic = 0.500

That large reduction in `magic' indicates the algorithm is having
problems.  Without optimization (used for valgrind) I got the solution you
quoted for Solaris 9.

However, on all four systems (AMD64 FC3 Linux, i686 FC3 Linux, Solaris and
Windows) I tried the results were different between systems and repeatable
by system.  I even ran under valgrind to be sure that no uninitialized
areas were used (on FC3).

> I understand that the sammon algorithm is very sensitive to even tiny
> variations in the starting point, but the observed behaviour seems
> strange to me. Difference between machines could perhaps be explained
> by floating point portability issues, but not difference on the same
> machine, and not the fact that i get the same result within the same R
> session.

No, but then that is not reproducible, and has never been reported before.
If for example different BLAS libraries get selected on different runs
this would explain it.  Or it could be a Debian-Sid-specific bug in a
shared library or compiler.

> I read in the documentation
> (http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/sammon.html)
> that "Further, since the configuration is only determined up to
> rotations and reflections (by convention the centroid is at the origin),
> the result can vary considerably from machine to machine." This doesn't
> make sense to me.

Note that is addressing a separate issue.  For a given minimized stress
there are multiple solutions which can be transformed into each other, and
the help file is warning you of that.  There are also (in general)
multiple local minima.

> If the data and the algorithm is the same, the result should be the
> same.

Depending what you mean by 'algorithm', this is what the subject of
numerical analysis is about.  I take it you are familiar with J. H.
Wilkinson's classic work on the Algebraic Eigenvalue Problem?

> What differences between machines do they refer to here? Floating
> point issues?

Any difference in the CPU/FPU or compiler or run-time environment
(including all the dynamically linked support libraries).  Just changing
the optimization level of the compiler changes the assembler-level
algorithm used, and can often affect the answer of e.g. an eigenvalue
calculation.  Rounding errors depend on whether (and when)
extended-precision registers are used and the exact order of the
calculations since computer arithmetic is not distributive.

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Varying results of sammon(), for the same data set

Ole Edsberg
Thanks a lot for the explanation and the advice! It looks like I
should find something else to do with the data. I'm afraid that my
knowledge of numerical analysis is very limited, and I haven't read
Wilkinson's book.

Best Regards,

Ole Edsberg

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html