10.1 Parameter Adaptation

   Consider an algorithm y=A(f,x) which takes in a guess x at the solution to a certain problem f
and outputs a (hopefully better) guess y at the solution. Assume that it is easy to compute and
compare the quality Q(x) of guess x and the quality Q(y) of guess y. Assume also that A contains
some parameter p (which may be a numerical value, a vector of numerical values, etc.), so that
we may write y=A(f,x,p). Then, for a given set S of problems f whose solutions all lie in some
set R, there may be some value p which maximizes the average over all f in S of the average
over all x in R of Q(A(f,x,p)) – Q(x). Such a value of p will be called optimal for S.
   The determination of the optimal value of p for a given S can be a formidable optimization
problem, even in the case where S has only one element. In practice, since one rarely possesses a
priori information as to the performance of an algorithm under different parameter values, one is
required to assess the performance of an algorithm with respect to different parameter values in a
real-time fashion, as the algorithm operates. For instance, a common technique in numerical
analysis is to try p=a for (say) fifty passes of A, then p=b for fifty passes of A, and then adopt the
value that seems to be more effective on a semi-permanent basis. Our goal here is a more general
approach.
   Assume that A has been applied to various members of S from various guesses x, with various
values of p. Let U denote the nx2 matrix whose i’th row is (fi,xi), and let P denote the nx1 vector
whose i’th entry is (pi), where fi, xi and pi are the values of f, x and p to which the i’th pass of A
was applied. Let I denote the nx1 vector whose i’th entry is Q(A(fi,xi,pi))-Q(xi). The crux of
adaptation is finding a connection between parameter values and performance; in terms of these
matrices this implies that what one seeks is a function C(X,Y) such that %C(U,P)-I% is small,
for some norm % %.
   So: once one has by some means determined C which thus relates U and I, then what? The
overall object of the adaptation (and of A itself) is to maximize the size of I (specifically, the
most relevant measure of size would seem to be the l1 norm, according to which the norm of a
vector is the sum of the absolute values of its entries). Thus one seeks to maximize the function
C(X,Y) with respect to Y.
PARAMETER ADAPTATION AS A BANDIT PROBLEM
   The problem here is that one must balance three tasks: experimenting with p so as to locate an
accurate C, experimenting with P so as to locate a maximum of C with respect to Y, and at each
stage implementing the what seems on the basis of current knowledge most appropriate p, so as
to get the best answer out of A. This sort of predicament, in which one must balance
experimentalvariation with use of the best results found through past experimentation, is known
as a "bandit problem" (Gittins, 1989). The reason for the name is the following question: given a
"two-armed bandit", a slot machine with two handles such that pulling each handle gives a
possibly different payoff, according to what strategy should one distribute pulls among the two
handles? If after a hundred pulls, the first handle seems to pay off twice as well, how much more
should one pull the second handle just in case this observation is a fluke?
   To be more precise, the bandit problem associated with adaptation of parameters is as follows.
In practice, one would seek to optimize C(X,Y) with respect to Y by varying Y about the current
optimal value according to some probability distribution. The problem is: what probability
distribution? One could, of course, seek to determine this adaptively, but this leads to a regress:
how does one solve the bandit problem associated with this adaptation?
Kaynak: A New Mathematical Model of Mind
belgesi-962