[Insight-users] ITK SPSA optimizer: StateOfConvergence question and
SetScales bug.
Zachary Pincus
zpincus at stanford.edu
Mon Jun 27 03:34:34 EDT 2005
Hello folks,
In preparation for using the SPSA optimizer in my code, I've been
reviewing the SPSA implementation that you guys contributed to ITK
earlier this year. First off, thank you for that, and also thanks for
making the code so easy to read and well-commented!
I've got a couple of questions about this implementation, if I might
trouble you:
(1) The convergence test is a bit unfamiliar to me. Since it's not what
is described in Spall's papers on SPSA, reading them didn't help
clarify matters.
In general, I'm familiar with convergence tests that look at how much
the value of the objective function has changed, or how much the
parameters have changed, or how close the gradient magnitude is to
zero. However, the "StateOfConvergence" test in the SPSA class is more
unclear. Optimization terminates if StateOfConvergence is less than
some threshold, where StateOfConvergence starts at zero and is updated
as follows:
StateOfConvergence <= DecayRate * (StateOfConvergence + LearningRate *
GradientMagnitude)
Could anyone give some guidance as to why this criteria was chosen over
more simplistic tests such as mentioned above, and how to choose the
threshold and decay rate parameter appropriately? (Naively, it seems
that if the gradient starts large and then goes to zero, optimization
could continue for some time as the "state of convergence" parameter
decays, even though the optimum has already been found. Likewise, if
the decay rate is too large, optimization could be terminated
prematurely. Thus, choosing the right decay rate seems very important,
but it is not clear how best to do so!)
(2) I think that there might be a bug in how the Scales parameter is
used to scale the perturbations and the gradient value. I've read the
email thread from a few months ago concerning this, but I think there's
a bug in the implementation that breaks this slightly.
The logic for dealing with the scaled delta (perturbation) values, as
commented at the bottom of the ComputeGradient() method in
itkSPSAOptimizer.cxx (lines 414-451), assumes that scaling is
accomplished by dividing by sqrt(Scales). This sqrt(Scales) term
eventually is transferred to a numerator, and thus to correct for that
(i.e. put the scaling term back in the denominator where it belongs),
the comment suggests an additional division by Scales. However in the
actual code, the scaling is done (as it should be!) by dividing by
Scales. However, in the code the correction remains an additional
division by Scales, when it should be by Scales^2 (as per the logic in
the comment). Thus, the scaling factor winds up being canceled out
entirely, instead of being transfered from a numerator to a denominator
as a division by Scales^2 would properly do.
If you agree that this is a problem, I'll happily fix the code (and
update the comments) and commit it to the ITK repository. (After filing
a bug report, of course, to keep the ITK bookkeepers happy.) If you
still have questions, please refer to the detailed example at the end
of this email where I work through the logic and why I think there's a
bug in the implementation.
Thanks for your time,
Zach Pincus
Department of Biochemistry and Program in Biomedical Informatics
Stanford University School of Medicine
EXAMPLE OF ISSUE IN SPSA SCALING
Consider the case where the scales are set to [1, 10]. That is, the
first dimension (say it's translation) should be searched more "widely"
than the second dimension (say it's rotation). Now assume that the
"direction" or delta chosen for a given preturbation is [1, -1]. Once
we scale this delta to delta' by dividing by the scales (as in the
code), we get delta' = [1, -0.1].
Now say that the current parameters are [0,0] and that the current
value of the objective is 0. Say for simplicity that c_k is 1. Now, we
sample our objective at theta+, which is [1, -0.1] and get a value of
1, and sample at theta-, which is [-1, 0.1], and get a value of -1.
Here's a picture of the parameter space, with the values of the
objective at the current position, and at theta+ and theta-:
f(theta-) |
/ | f(current)
-1 | /
-----------0-----------
| 1
| \
| f(theta+)
Now, valuediff is calculated as f(theta+) - f(theta-) / (2 * c_k), and
is in this case equal to 1.
So we calculate the gradient as valuediff / delta', as in the code.
This gives us a gradient of [1, -10]. Now, this is clearly wrong, as
you guys noted. So we apply the correction, which is implemented in the
code as dividing by the scales. So our new gradient is [1, -1].
But, this is *still* wrong! When we update to the new position, we will
move from [0, 0] to [-1, 1], assuming that a_k = 1 and that we're
minimizing the function. That's better than moving to [-1, 10] as it
would be had we not divided by the scales, but it's still (a) not
respecting the scales, and (b) not moving in the direction that was
originally sampled. What we need to do is divide our gradient of [1,
-10] by the square of the scales, giving (as desired) a new gradient of
[1, -0.1], and an update step to [-1, 0.1] which *is* in the direction
of the perturbation.
Note that this is basically equivalent to the suggestion in Spall's
"Implementation of the Simultaneous Perturbation Algorithm for
Stochastic Optimization", footnote 1. This suggests, "In cases where
the elements of theta have very different magnitudes, it may be
desirable to use a matrix scaling of the gain a_k, if prior information
is available on the relative magnitudes.".
More information about the Insight-users
mailing list