[ View menu ]

Score with scoring rules

Filed in Encyclopedia ,Ideas ,R ,Research News ,Tools
Subscribe to Decision Science News by Email (one email per week, easy unsubscribe)


We have all been there. You are running an experiment in which you would like participants to tell you what they believe. In particular, you’d like them to tell you what they believe to be the probability that an event will occur.

Normally, you would ask them. But come on, this is 2009. Are you going to leave yourself exposed to the slings and arrows of experimental economists? You need to give your participants an incentive to tell you what they really believe, right?

Enter the scoring rule. You pay off the subjects based on the accuracy of the probabilities they state. You do this by observing some outcome (let’s say “rain”) and you pay a lot of money to the people who assigned a high probability to it raining and you pay a little money (or even impose a fine upon) those who assigned a low probability to it raining. A so-called “proper” scoring rule is one in which people will do the best for themselves if they state what they truly believe to be the case.

Three popular proper scoring rules are the Spherical, Quadratic, and Logarithmic. Let’s see how they work.

Suppose in your experimental task you give people the title of a movie, and they have to guess what year the movie was released.  You tell them at the outset that the movie was released between 1980 and 1999: that’s 20 years. So you have these 20 categories (years) and you want people to assign a probability to each year. Afterwards, you will pay them out based on the actual year the movie was released and the probability they assigned to that year.

Let r be the vector of 20 probabilities, and r_1 could be the probability they assign to 1980 being the year of release, and r_2 the probability that it was 1981, so on through r_20 for 1999’s probability. Naturally, all the r’s add up to one, as probabilities like to do. Now, let r_i be the probability they assign to the year which turns out to be correct.

Under the Spherical scoring rule, their payout would be r_i / (r*r)^.5

Under the Quadratic scoring rule, the payout would be 2*r_i – r*r

Under the Logarithmic scoring rule, the payout would be ln(r_i)

In the movie above, the top row shows various sets of probabilities someone might assign to the 20 years. (Imagine the categories along the x-axis are the years 1980 to 1999).  Each bar in the graphs in the bottom three rows shows the person’s payout if that year turns out to be correct, based on the probabilities assigned to each year in the top row.

As you can see, when they assign a high probability to a category and it turns out to be correct, their payout is high. When they assign a low payout to a category and it turns out to be correct, their payout is low.

You’ll notice that the Logarithmic scoring rule goes right off the bottom of the page. This is because the log of small probabilities are negative numbers far beneath zero, and the log of 0 is negative infinity!

While I was at Stanford I heard that decision scientist extraordinaire Ron Howard (no relation) used to make students assign probabilities to the alternatives (A, B, C or D) on the multiple choice items on the final exam. The score for each question was the log of the probability they assigned to the correct answer. This means, of course, that if you assign a probability of 0 to alternative “B” and alternative “B” turns out to be correct, your score on that question is negative infinity. I always wondered if you got a negative infinity on one question if it meant you got negative infinity on the exam, or if there was some mercy clause.

But the main reason I am writing this post is because I wonder what experimental economists and psychologists are supposed to do when implementing log scoring rules in the lab. Naturally, you can endow the participant with cash at the beginning of the experiment and have them draw down with each question, but what do you do if they score a negative infinity? Take their life savings?

Winkler (1971) decided that he would treat probabilities less than .001 as .001 when it came time to imposing the penalty. Does anyone know of other methods?


Robert L. Winkler (1971)  Probabilistic Prediction: Some Experimental Results, Journal of the American Statistical Association, Vol. 66, No. 336.  pp. 675-685.


To make this simulation, I’ve drawn on the top row various beta distributions of differing modes between two fixed endpoints. This is akin to having a min and a max guess for the year of release, then entertaining various years between those two endpoints as most likely.


  1. Alan Schwartz says:

    It’s obviously not mathematically perfect, but you could say that in this case, the study is over and they receive the lowest possible payout or highest possible loss. That’s as close to budget ruin as you can come.

    Of course, what you’re teaching them is to never assign a 0 probability to anything; the cost of assigning a tiny probability > 0, even if they truly believe the probability is 0, is like a minuscule insurance payment, and if people are at all risk averse, you’ll have convinced them to misstate their true beliefs, albeit in a tiny way – unless, of course, you believe that what constitutes a “true belief” is your willingness to risk everything for it – a logically consistent position, but a psychological implausible one.

    July 21, 2009 @ 10:52 pm

  2. dan says:

    Very interesting post, Alan.

    One thing I think about is the following. Suppose research suggests (which it kind of does) that when you ask people for a 95% confidence interval, you get a very reasonable 50% confidence interval. If this is a pretty consistent finding a researcher might finally admit, ok, though it’s not the statistical definition, for whatever reason 95% confidence interval is “human” for 50% confidence interval. In this case, when eliciting a 95% confidence interval, perhaps the right thing to do in consulting settings is treat it as a 50% CI. I reckon that in the simulation above, if I treated the endpoints of the 95% CI as the endpoints of a 50% CI, there’d be very few cases in which 0% probability would be assigned to a cell.

    On the other hand, to quote you, this is “teaching” them to not learn the true meaning of a 95% CI, which can’t be a good thing.

    July 22, 2009 @ 1:29 pm

  3. dan says:

    I’m going to try something new here. I love R, but sometimes my code isn’t as beautiful as it should be. From now on, I’m going to paste the code behind my simulations so that if anyone sees room to improve it, they can just chime in.

    #This first part is based Stephanie Kovalchik’s beta.prior code http://skoval.bol.ucla.edu/beta.prior.R

    beta.given.region.mode <- function(lowerq,upperq,region,mode,min=2,max=10000,drawplot=FALSE) { mode.check <- function(lowerq,upperq,mode) { if(!(mode > lowerq & mode= 0 & region<=1 )) { stop("Region must be between 0 and 1") } } abcheck <- function(a,b) { if(a<1 || b<1 ) { stop("a or b was less than 1") } } mode.check(lowerq,upperq,mode) region.check(region) #This function will be zeroed when the difference in the pbetas equals the region size f <- function(n) { a <- 1 + mode*(-2+n) b <- -1-mode*(-2+n)+n pbeta(upperq,a,b)-pbeta(lowerq,a,b)-region } #This uniroot thing calls f with every n from 2 to 1000 until it finds the n that causes the inner region to hit a certain number. n <- uniroot(f,interval=c(min,max))$root a <- 1 + mode*(-2+n) b <- -1-mode*(-2+n)+n abcheck(a,b) theta <- c(a,b) cat(a,b,"\n") return(theta) } ####This is the part that needs work, especially the super-lame pause function! pause_x = function (x) { sofar=0 while (sofar

    July 22, 2009 @ 1:46 pm

  4. Michael Smithson says:

    None of the 3 scoring rules take into account the number of alternatives. A rule that does would assign a score of 0 to a probability of 1/n, where n is the number of alternatives, a negative score to a probability 1/n. Here’s an example:
    f(r_i) = min[(r_i – 1/n)/(1-1/n), (r_i-1/n)/(1/n)],
    where r_i is the probability assigned to the ith alternative. We have f(0) = -1, f(1) = 1, and f(1/n) = 0. And of course, the terms inside the min could be raised to a power other than 1.

    October 22, 2009 @ 11:53 pm

  5. Michael Smithson says:

    Hmm, the clause “a negative score to a probability of 1/n” should have read “a negative score to a probability less than 1/n.”

    October 22, 2009 @ 11:56 pm

  6. Giovanni Ciriani says:

    I’m working with a group of forecasters at UPenn, in a decision science project, and I’m pushing for Fair Skill, a log score with offset log(m), where m is the number of mutually exclusive events. Because our forecasts are in percentage points, I assume that a 0 forecast, was actually 0.5%, i.e. 0.005; I then normalize the other probabilities to sum up to [1- (m-1)*0.005]. In this way I minimize the damage of flunking a forecast. The overall results of our research project, by the way do not change a whole lot if I replace the 0.005 with a 0.001.

    April 8, 2020 @ 2:55 pm

RSS feed Comments

Write Comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>