Universidade de São Paulo

Keywords: weighting functions, probabilistic biases, adaptive probability theory.

If you are sure about the stated probability values and you are completely rational, you should just go ahead and assign utilities to each monetary value and choose the gamble that provides the largest expected utility. And, as a matter of fact, the laboratory experiments that revealed the failures in human reasoning provided exact values, without any uncertainty associated to them. Therefore, if humans were perfect Bayesian statisticians, when faced with those experiments, the subjects should have treated those values as if they were known for sure. But, from your point of view of an intuitive Bayesian, or from the point of view of everyday life, there is no such a thing as a probability observation that does not carry with it some degree of uncertainty. Even values from probabilistic models based on some symmetry of the problem depend, in a more complete analysis, on the assumption that the symmetry does hold. If it doesn't, the value could be different and, therefore, even though the uncertainty might be small, we would still not be completely sure about the probability value. Assuming there is uncertainty, what you need to do is to obtain your posterior estimate of the chances, given the stated gamble probabilities. Here, the probabilities you were told are actually the data you have about the problem. And, as long as you were not in a laboratory, it is very likely they have been obtained as observed frequencies, as proposed by the evolutionary psychologists. That is, what you understand is that you are being told that, in the observed sample, a specific result was observed 85% of the times it was checked. And, with that information in mind, you must decide what posterior value it will use. The best answer would certainly involve a hierarchical model about possible ways that frequency was observed and a lot of integrations over all the nuisance parameters (parameters you are not interested about). You should also consider whether all observations were made under the same circumstances, if there is any kind of correlation between their results, and so on. But all those calculations involve a cost for your mind and it might be a good idea to accept simpler estimations that work reasonably well most of the time. You are looking for a good heuristic, one that is simple and efficient and that gives you correct answers most of the time (or, at least, close enough). That is the basic idea behind Adaptive Probability Theory (Martins, 2005). Our minds, from evolution or learning, are built to work with probability values as if they were uncertain and make decisions compatible with that possibility. APT does not claim we are aware of that, it just says that our common sense is built in a way that mimics a Bayesian inference of a complex, uncertain problem. If you hear a probability value, it is a reasonable assumption to think that the value was obtained from a frequency observation. In that case, the natural place to look for a simple heuristic is by treating this problem as one of independent, identical observations. In this case, the problem has a binomial likelihood and the solution to the problem would be straight-forward if not for one missing piece of information. You were informed the frequency, but not the sample size n. Therefore, you must use some prior opinion about n. In the full Bayesian problem, that means that n is a nuisance parameter. This means that, while inference about p is desired, the posterior distribution depends also on n and the final result must be integrated over n. The likelihood that a observed frequency o, equivalent to the observation of s=no successes, is reported is given by

Gamble A Gamble B 85% to win 100 95% to win 100 15% to win 50 5% to win 7

(1)

In order to integrate over n, a prior for it is required. However, the problem is actually more complex than that since it is reasonable that our opinion on n should depend on the value of o. That happens because if o=0.5, it is far more likely that n=2 than if o=0.001, when it makes sense to assume that at least 1,000 observations were made. And we should also consider that extreme probabilities are more subject to error. In real life, outside the realm of science, people rarely, if ever, have access to large samples to draw their conclusions from. For the problem of detecting correlates, there is some evidence that using small samples can be a good heuristics (Kareev, 1997). In other words, when dealing with extreme probabilities, we should also include the possibility that the sample size was actually smaller and the reported frequency is wrong. The correct prior f(n,o), therefore, can be very difficult to describe and, for the complete answer, hierarchical models including probabilities of error are needed. However, such a complicated, complete model is not what we are looking for. A good heuristic should be reasonably fast to use and shouldn't depend on too many details of the model. Therefore, it makes sense to look for reasonable average values of n and simply assume that value for the inference process. Given a fixed value for n, it is easy to obtain a posterior distribution. The likelihood in Equation 1 is a binomial likelihood and the easiest way to obtain inferences when dealing with binomial likelihoods is assuming a Beta distribution for the prior. Given a Beta prior with parameters a and b, the posterior distribution will also be a Beta distribution with parameters a+s and b+n. The average of a random variable that follows a Beta distribution with parameters a and b has a simple form, [a/(a+b)]. That means that we can obtain a simple posterior average for the probability p, given the observed frequency o, w(o)=E[p|o]

(2)

which is a straight line if n is a constant (independent of o), but different and less inclined than w(o)=o. For a non-informative prior distribution, that corresponds to the choice a=1 and b=1, Equation 2 can be written in the traditional form (1+s)/(2+n) (Laplace rule). However, a fixed sample size, equal for all values of p does not make much sense in the regions closer to certainty and n must somehow increase as we get close to those regions (o® 0 or o® 1). The easiest way to model n is to suppose that the event was observed at least once (and, at least once, it was not observed). That is, if o is small, choose an observed number of successes s=1 (some other value would serve, but we should remember that humans tend to think with small samples, as observed by Kareev et al., 1997). If p is closer to 1, take s=n-1. That is, the sample size will be given by n=1/t where t=min(o;1-o) and we have that w(o)=2o/(2o+1) for o < 0.5 and w(o)=1/(3-2o) for o > 0.5. By calculating

it is easy to show that the common-rate effect holds, meaning that the curves are subproportional, both for o < 0.5 and o > 0.5. Estimating the above fraction shows that w(x)/w(y) < w(cx)/w(cy), for c < 1, exactly when x < y. The curve w(o) can be observed in Figure , where it is compared to a curve proposed by Prelec (2000) as a parameterization that fits reasonably well the data observed in the experiments. A few comments are needed here. For most values of o, the predicted value of n will not be an integer, as it would be reasonable to expect. If o=0.4, we have n=2.5, an odd and absurd sample size, if taken literally. One could propose using a sample size of n=5 for this case, but that would mean a non-continuous weighting function. More than that, for too precise values, as 0.499, that would force n to be as large as 1.000. However, it must be noted that, in the original Bayesian model, n is not supposed to be an exact value, but the average value that is obtained after it is integrated out (remember it is a nuisance parameter). As an average value, there is nothing wrong with non-integer numbers. Also, it is necessary to remember that this is a proposed heuristic. It is not the exact Bayesian solution to the problem, but an approximation to it. In the case of o=0.499, it is reasonable to assume that people would interpret it as basically 50%. In that sense, what the proposed behavior for n says is that, around 50%, the sample size is estimated to be around n=2; around o=0.33, n is approximately 3; and so on.

That is, for real problems where only conditional independence exists, the result is not the correct 25% for the situation where p is known to be 0.5 with certainty. Of course, if the uncertainty in the priori was smaller, the result would become closer to 25%. Furthermore, if the conditional independence assumption is also dropped, the predicted results can become even closer to the observed behavior. In many situations, especially when little is known about a system, even conditional independence might be too strong an assumption. Suppose, for example, that our ancestors needed to evaluate the probability of finding predators at the river they used to get water from. If a rational man had a prior uniform (a=b=1, with an average a/(a+b)=1/2) distribution for the chance the predator would be there and, after that, only one observation was made an hour ago where the predator was actually seen, the average chance a predator would be by the river would change to a posterior where a=2 and b=1. That is, the average probability would now be 2/3. However, if he wanted to return to the river only one hour later, the events would not be really conditionally independent, as the same predator might still be there. The existence of correlation between the observations implies that the earlier sighting of the predator should increase the probability of observing it there again. Real problems are not as simple as complete independence would suggest. Therefore, a good heuristic is not one that simply multiplies probabilities. When probabilistic values are not known for sure, true independence does not exist, only conditional independence remains, and the heuristic should model that. If our heuristics are also built to include the possibility of dependent events, they might be more successful for real problems. However, they would fail more seriously in the usual laboratory experiments. This means that the observed estimate for the conjunctive example in Cohen, around 45%, can be at least partially explained as someone trying to make inferences when independence, or even conditional independence, do not necessarily hold. It is important to keep in mind that our ancestors had to deal with a world they didn't know how to describe and model as well as we do nowadays. It would make sense for a successful heuristic to include the learning about the systems it was applied to. The notion of independent sampling for similar events might not be natural in many cases, and our minds might be better equipped by not assuming independence. When faced with the same situation, not only the previous result can be used as inference for the next ones, but also it might have happened that some covariance between the results existed and this might be the origin of the conjunctive events bias. Of course, this doesn't mean that people are aware of that nor that our minds perform all the analysis proposed here. The actual calculations can be performed following different guidelines. All that is actually required is that, most of the time, they should provide answers that are close to the correct ones.

a=b=1 | a=1 and b=e-1 | |

o=0.3, g = 1 | 0.375 | 0.330 |

o=0.3, g = 0.3 | 0.416 | 0.344 |

o=0.7, g = 1 | 0.625 | 0.551 |

o=0.7, g = 0.3 | 0.584 | 0.483 |

File translated from T

On 16 Nov 2006, 10:06.