[Insight-users] ITK SPSA optimizer: StateOfConvergence question and SetScales bug.

Mon Jun 27 03:34:34 EDT 2005

Hello folks,

In preparation for using the SPSA optimizer in my code, I've been 
reviewing the SPSA implementation that you guys contributed to ITK 
earlier this year. First off, thank you for that, and also thanks for 
making the code so easy to read and well-commented!

I've got a couple of questions about this implementation, if I might 
trouble you:
(1) The convergence test is a bit unfamiliar to me. Since it's not what 
is described in Spall's papers on SPSA, reading them didn't help 
clarify matters.

In general, I'm familiar with convergence tests that look at how much 
the value of the objective function has changed, or how much the 
parameters have changed, or how close the gradient magnitude is to 
zero. However, the "StateOfConvergence" test in the SPSA class is more 
unclear. Optimization terminates if StateOfConvergence is less than 
some threshold, where StateOfConvergence starts at zero and is updated 
as follows:
StateOfConvergence <= DecayRate * (StateOfConvergence + LearningRate * 
GradientMagnitude)

Could anyone give some guidance as to why this criteria was chosen over 
more simplistic tests such as mentioned above, and how to choose the 
threshold and decay rate parameter appropriately? (Naively, it seems 
that if the gradient starts large and then goes to zero, optimization 
could continue for some time as the "state of convergence" parameter 
decays, even though the optimum has already been found. Likewise, if 
the decay rate is too large, optimization could be terminated 
prematurely. Thus, choosing the right decay rate seems very important, 
but it is not clear how best to do so!)

(2) I think that there might be a bug in how the Scales parameter is 
used to scale the perturbations and the gradient value. I've read the 
email thread from a few months ago concerning this, but I think there's 
a bug in the implementation that breaks this slightly.

The logic for dealing with the scaled delta (perturbation) values, as 
commented at the bottom of the ComputeGradient() method in 
itkSPSAOptimizer.cxx (lines 414-451), assumes that scaling is 
accomplished by dividing by sqrt(Scales). This sqrt(Scales) term 
eventually is transferred to a numerator, and thus to correct for that 
(i.e. put the scaling term back in the denominator where it belongs), 
the comment suggests an additional division by Scales. However in the 
actual code, the scaling is done (as it should be!) by dividing by 
Scales. However, in the code the correction remains an additional 
division by Scales, when it should be by Scales^2 (as per the logic in 
the comment). Thus, the scaling factor winds up being canceled out 
entirely, instead of being transfered from a numerator to a denominator 
as a division by Scales^2 would properly do.

If you agree that this is a problem, I'll happily fix the code (and 
update the comments) and commit it to the ITK repository. (After filing 
a bug report, of course, to keep the ITK bookkeepers happy.) If you 
still have questions, please refer to the detailed example at the end 
of this email where I work through the logic and why I think there's a 
bug in the implementation.

Thanks for your time,

Zach Pincus

Department of Biochemistry and Program in Biomedical Informatics
Stanford University School of Medicine

EXAMPLE OF ISSUE IN SPSA SCALING

Consider the case where the scales are set to [1, 10]. That is, the 
first dimension (say it's translation) should be searched more "widely" 
than the second dimension (say it's rotation). Now assume that the 
"direction" or delta chosen for a given preturbation is [1, -1]. Once 
we scale this delta to delta' by dividing by the scales (as in the 
code), we get delta' = [1, -0.1].

Now say that the current parameters are [0,0] and that the current 
value of the objective is 0. Say for simplicity that c_k is 1. Now, we 
sample our objective at theta+, which is [1, -0.1] and get a value of 
1, and sample at theta-, which is [-1, 0.1], and get a value of -1.

Here's a picture of the parameter space, with the values of the 
objective at the current position, and at theta+ and theta-:
f(theta-)  |
   /        |  f(current)
-1         | /
-----------0-----------
            |          1
            |           \
            |          f(theta+)

Now, valuediff is calculated as f(theta+) - f(theta-) / (2 * c_k), and 
is in this case equal to 1.
So we calculate the gradient as valuediff / delta', as in the code. 
This gives us a gradient of [1, -10]. Now, this is clearly wrong, as 
you guys noted. So we apply the correction, which is implemented in the 
code as dividing by the scales. So our new gradient is [1, -1].

But, this is *still* wrong! When we update to the new position, we will 
move from [0, 0] to [-1, 1], assuming that a_k = 1 and that we're 
minimizing the function. That's better than moving to [-1, 10] as it 
would be had we not divided by the scales, but it's still (a) not 
respecting the scales, and (b) not moving in the direction that was 
originally sampled. What we need to do is divide our gradient of [1, 
-10] by the square of the scales, giving (as desired) a new gradient of 
[1, -0.1], and an update step to [-1, 0.1] which *is* in the direction 
of the perturbation.

Note that this is basically equivalent to the suggestion in Spall's 
"Implementation of the Simultaneous Perturbation Algorithm for 
Stochastic Optimization", footnote 1. This suggests, "In cases where 
the elements of theta have very different magnitudes, it may be 
desirable to use a matrix scaling of the gain a_k, if prior information 
is available on the relative magnitudes.".