Toward A General Induction Algorithm

The strengthened Peirce’s principle is only a beginning. It is a basic philosophical assumption
which ensures the possibility of intelligence through pattern recognition. All the standard
mathematical methods for predicting the future are based on probability theory. In this section I will show that it is indeed possible to combine prediction based on pattern with prediction based on probability. Most of the ideas here are completely straightforward, and the reader who prefers to avoid mathematical formalism will lose little by skimming over this section.

From the abstract philosophical point of view, this approach may be controversial; it assumes
that elementary probability theory works, and probability is a troublesome concept. This
difficulty will be discussed in Chapter 9, in a different context. For now, let us simply apply the theory of probability without worrying about justifications.

Let P(X) denote the probability that the proposition X is true; let P(X%E) denote the
probability that X is true given that the proposition E is true. Recall that P(X%E)=P(XE)/P(E),
unless P(E)=0. Let us consider events Yi of the form "Pattern Pj is present in the environment to intensity Kl over time period(t,v)", where j and K are integers, and set pi=P(Yi). If there are n patterns and O%K<M, then there will be N=nM events Yi. Essentially, the task of making a
coherent model may be summarized as follows. Where t denotes the present time, let Qs denote
the proposition "Patterns Qs1, Qs2,…, Qsn have been observed during the interval [t-s,t] with
respective intensities Ks1,…,Ksn". The question is: what is P(Yi%Qs), for each s? Strictly
speaking, this is a different problem for each s. However, the knowledge generated in solving it
for, say, s=1000, would obviously be useful in solving it for s=1100, 2000, or 10000. In general, the larger s is, the harder the problem is, since there will be more Qj.

The estimation of the probabilities P(Yi%Qs) is a very difficult problem. Probably there will
always be a need for rough heuristics. But still, a more detailed analysis can lend some insight
into what is involved. {Qsj} is the set of patterns observed in the time interval [t-s,t]. Let us assume that all the possible combinations of elements of {Qsj} have been listed in some specified order, and that {Qsj(k,1), Qsj(k,2),…} refers to the k’th one. Then for any v, and any u between s+v and t, we may introduce a set of propositions, {Ruvk}, which includes all propositions of the form "Qsj(k,1) and Qsj(k,2) and … have been observed during (u-v,u) with respective intensities Ksj(k,1), Ksj(k,2),…." And, finally, we may define Ps(Ruvk) as the result of applying Algorithm 3.1 to the set of patterns referred to by Ruvk [in the notation of that algorithm, let Qsj(k,1)=(y1,z1), etc.].

According to the strengthened Peirce’s principle, this indicates how likely it is for the k’th
conjunction of patterns to occur.

So, in this framework, estimating P(Yi%Qs) comes down to first using the Ps(Ruvk) to estimate
a set of most likely future scenarios, a set of subsets of {Qsi} which are intense enough and
mutually consistent enough to be plausible; and second, using the Ps(Ruvk) to estimate the
P(Yi%Qsj) for those Qsj in this future scenario.
The second part, estimating the P(Yi%Qsj), is relatively tractable; it doesn’t require all the
Ps(Ruvk), but only the relatively miniscule subset involving conjunctions of the form PjQl, where Ql is an element of some Qsj and Pj is the pattern to which the proposition Yi refers. In practice an intelligent system would probably have to forego the testing of all the Yi, and for those Yi which it selected to test it would have to ignore some of the conjunctions PjQl. For those Yi and Qsj selected, it would simply have to use the formula P(Yi%Qsj)=P(YiQsj)/P(Qsj).
Unfortunately, the task of choosing a set of most likely futures, is far, far less palatable. This is a very difficult optimization problem, and what’s worse, in general it refers to the entire set of Ps(Ruvk). The array {Ruvk} is so large that no conceivable intelligence could ever compute it for a situation involving a realistic number of patterns. So in order for intelligent prediction to be possible, some sort of very rough approximation method is required.

Let’s assume — for reasons that will become clear in later chapters — that certain patterns in Qs have been labeled as important. By what process doesn’t matter right now. Define the
prominence of a pattern as the product of its intensity with its importance. Then one rough
approximation algorithm is asfollows. Consider perhaps seven different values of s — e.g. a
second, a minute, an hour, a day, a week, a year, a lifetime — and seven different values of v. For each of these values of s, start with, say, the seven most prominent patterns, and consider all the various combinations of them. Estimate for each of these 896 combinations the appropriate average over all u of Ps(Ruvk), where v runs through all seven values. For each v retain the seven which yield the highest numbers (there could be some repetition in this set if the same pattern were prominent on two different time scales). Then, for each v, for each of these seven patterns P, consider all the combinations of P with the six next most prominent patterns in Qs (where s corresponds to the time scale on which P was selected as prominent). For each of these combinations, compute the appropriate average over all u of Ps(Ruvk) (where v corresponds to the time scale according to which the initial element of Ruvk was retained). Select the seven which yield the highest numbers. Then, for each v, for each of these seven patterns P, consider all the combinations of P with the next most prominent patterns in Qs….
Obviously, the numbers could be tinkered with. Perhaps they could be selected more
intelligently. But the basic idea should be clear: with luck, you can accrete a reasonably likely projection on any future time scale v by starting with an intense, important pattern and
progressively adding less and less intense and important patterns onto it. At each stage a number of possible additions are considered, and those which contradict least are retained. This is a very crude methodology. It relies essentially on the precept that more intense habits can be expected to continue with greater certainty — without this assumption, one might as well start with randomly selected "important" patterns as with intense ones.
The moral of all this is that if the philosophical problem of induction is unsolvable, the
general practical problem of induction is formidable. By proposing the tendency to take habits
as a fundamental law, we have analyzed induction in terms of pattern recognition. By proposing
that more intense patterns have more propensity to continue we have given a clue as to how to go
about resolving the contradictory predictions which the tendency to take habits inevitably gives
rise to. But this is nothing more than a general precept, a hint as to how to go about doing things: even if, on average, the more intense of a set of contradictory patterns prevails, it is folly to invariably assume that the more intense pattern one has perceived will prevail. To go beyond mere philosophical maxims requires a general framework for talking about the probabilities of various combinations of patterns, and this leads into a maze of sets and indices. It is extremely difficult to use past experience to tell which of two contradictory patterns will prevail.

But this task is necessary for intelligence: what we are talking about is no less than building a plausible model of the future on the basis of the past. Hopefully the homeliness of the necessary formalism does not obscure the very basic nature of the questions being addressed. Although induction is generally recognized to be essential to mentality, there has been very little work done on the general practical problem of induction. Philosophers have tirelessly debatedthe fine points of arguments for and against the justifiability and possibility of induction; and computer scientists have made a great deal of progress studying induction in limited contexts (Blum and Blum, 1975; Daley and Smith, 1986). But, cumbersome as the ideas of this section imply a hybrid pattern-theoretic/probabilistic method of predicting, on the basis of patterns recognized in the world in the past, the probability that a given pattern X will occur with intensity K in the future on a time scale [t,t+v], where t is the present.

belgesi-923

Belgeci