[Insight-developers] Draft for the ITK statistical modelling module

Fri, 12 Oct 2001 09:37:03 -0700 (PDT)

Thank you for your comments.

--- "Miller, James V (CRD)" <millerjv@crd.ge.com>
wrote:
> I looked through the writeup. Here are my comments
> in no
> particular order:
> 
> *	I think you also need CDFs.  CDFs are used more in
> decision criteria than PDFs.

I agree. Then the question is that the CDF and PDF of
a distribution should be member functions of a
distributional model object or separate functions
(function objects). I think, if users frequently work
with both functions for a task, then creating a
distributional model object with both functions in it
might be nice. Otherwise, I think they should be
separate. It seems that each of them has its strength
in answering different kinds of questions. For
example, a CDF is ideal for answering what is the
probability of a feature vector having a value less
than some value. A PDF is a good tool for
eye-examining the spread of a distribution, if it used
with plotting tools. 

> *	I don't like the FeatureDomain name (or
> FeatureSpace or Features). I have never heard
> FeatureDomain term used before in statistics. I tend
> use "feature" as a term of something that is
> extracted or estimated from data. I think you are
> mixing the concept of a "measurement", a "sample",
> a "population", a "sample from a population", a
> "subsample", and an "estimated pmf (histogram)".  I
> suggest you identify how these concepts are each
> supposed to be used.  And what concepts needs to
> represented as objects.

I totally agree with you on the fact that the
"FeatureDomain" term is the most unfamiliar or exotic
term. 

FeatureDomain classes are intended to be generic data
storage classes. They can store a whole population
(for example, all pixels in an image), a sample (if
you meant a sample generated by simulation), a sample
from population (typical definition of sample), or
random variables' values. In this context, I think a
measurement is a feature element. Conceptually, each
class is a set of elements (each element is a set of
an "InstanceIdentifier" - itk::Index or unsigned long
and a feature vector - a set of measurement values).
Hmm, then do we have to call it something like
DataSet? Please, help me.

The "InstanceIdentifer" of each element - instance in
a "FeatureDomain" is for later use with other data
storage classes such as Label (stores class labels)
objects.

Histogram classes are estimate pmfs. Also they can be
think of as special cases of statistical data storage
objects with a data reduction mechanism. 

> Some suggestions for the
> underlying concepts
> 
> *	RandomVariable 
> *	FunctionOfARandomVariable

Each dimension of a feature vector of a
"FeatureDomain" instance can be a random variable.

> *	Sample
> 
> *	This is a confusion concept because you can have a
> single sample of a random variable or you
> can take a "sample" which are number of sampling a
> random variable. You want to make sure there is
> not confusion by what you mean by sample in the
> toolkit.
> *	Perhaps the solution is have a "Sample" and a
> "Measurement".  A "Measurement" is a single
> sample of a random variable and "Sample" is a
> collection of "Measurement"'s.  I guess in your
> current
> terminology a "Measurement" refers to your
> "Observation".
> 

I think the closest thing to your "Sample" is
"FeatureDomain". I think the "FeatureDomain" class
should have methods to extract each feature dimension
- "Measurement" from a "FeatureDomain". The current
"Sample" object is meaningless in terms of statistical
analysis. It's only a work space that binds different
types of statistical objects such as "FeatureDomain"
and "Label" class. Since "sample" is an important
concept and already have its meaning, the name should
be changed. WorkSpace? StatisticalObjectHolder? 

> *	Measurement (see above)

> *	Density
I think you PDF. Am I right?

> *	Distribution (typically a CDF).
See the above comments on CDF and PDF.

> *	Parameter (population parameter)
> *	Estimate (something calculated from a collection
> of rv's. If the estimate is on of something
> other than a parameter of the distribution, then it
> is hard to distinguish from a function of a
> random variable).

These parameters and estimates are real numbers. So I
think we don't need any special objects.

> *	Test - Student-t, Chi-squared, Hotelling T^2, F

I didn't start working on tests. 

> *	Table 
> 
> *	In the context of a representing the area under a
> CDF.

I believe the most common use of the term, "Table" is
cross tabulation (for example, two-way tables). In
this usage, my "Table" class could be a source of
confusion and dosn't comply with the usage. I need
names!!!

> 
> *	NearestNeighborFinder - I would call this a
> NearestNeighborLocator.

Could you explain why? Is the "nearest neighbor
locator" term  more common?

> *	FastRandomUnitNormalVariateGenerator
> 
> *	I still do not like this name.  If this the
> proposed algorithm for generating samples from a
> standardized gaussian distribution, then I propose
> we just call it NormalVariateGenerator.

There are more than one algorithm to do the job.
Therefore, I think we need a name a little more
specific than "NormalVariateGenerator" to leave a room
for different algorithms. However, I don't like the
"FastRanddomUnitNormalVariateGenerator" either. It has
lots of general words in it, but it doesn't seem to be
more specific that "NormalVariateGenerator". 

Since the generator is a revised version of C. S.
Wallace's algorithm, How about
RevisedWallaceNormalVariateGenerator or just
WallaceNormalVariateGenerator?

> *	However, I think this "algorithm" should really
> just be a method on a "Density" or
> "Distribution".  Every "Distribution" should be able
> to generate random variables.  The simpliest way
> to do this is to draw a sample from a uniform
> distribution and run that value through the inverse
> CDF
> of the specified Distribution. If a CDF cannot be
> inverted (easily), you can do a binary search for
> that x value generates the CDF value specified. 
> This could be the default implementation in the
> "Density" and "Distribution" classes.

I think there is another option for the random sample
generation. The "general sample generator" can accept
a inverse CDF as an input and generate samples as
output. You may call it a functional process model or
a filter model. Then, special random sample generators
would share common input/output interface with the
general one. If we have "Distribution" classes, then
these generators will be parts of the "Distribution"
calsses.

> *	Along these lines, a Distribution should be able
> to answer "test" questions. i.e. What is the
> probability of a random variable having a value less
> than or equal to "blah"
> 

Thank you again, Jim. Your comments gave me many
things to think again and help me to understand the
problems. I don't think I didn't address all the
issues you raised. I will try again :) 

To many fellow ITK developers, the development of the
ITK statistical modelling model has been kind of 
invisible up to this point. I will let you know what's
going on with the module development as frequently as
possible. So please join in the discussion and help
me.

Thank you,

=====
Jisung Kim
bahrahm@yahoo.com
106 Mason Farm Rd.
129 Radiology Research Lab., CB# 7515
Univ. of North Carolina at Chapel Hill
Chapel Hill, NC 27599-7515

__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com