Apology for this question being off the topic (OT) of
R, though I expect
this list might be the best place on the net to ask
In brief, the question is: what classification
can one use if the features are histograms?
I have a classification problem, and believe that
of the distribution of some values may be the best
"feature" to use.
To make the mail shorter, here's a simpler example
Try to classify a person as e.g. drunk or not given
of their driving speed.
In the training phase, we have a table whose rows
contain the driver,
whether they are drunk, and a sample of driving speed.
>From this one can build separate histograms of driving
for drunk/non drunk.
(In my actual application, I have several such
histogram features, and they
are visibly different; they are also ranked now by
pdf-distance measures such as KL).
Now, how to classify...
given a single speed, its probability can be evaluated
under the two classes,
but a single speed sample is not going to be reliable
in this problem.
Suppose instead that the _distribution_ of speeds is
We have a driver, and a distribution of their speeds
over time. A histogram
can be built. What to do with this histogram?...
Is there a standard classifier that can deal with this
- the test histogram could be compared to each
of the training histograms with the Chi^2 measure -
sum of squared Gaussian deviations, then get a
probability from this?
- Alternately, consider training histograms with n
bins as points
in N-dimensional space, use euclidean closeness in
This may not generalize to more than one such
histogram feature though....
Thanks for any thoughts.
(Also thanks for the replies to my recent question