Originally posted on http://www.howtomeasureanything.com/forums/ on Thursday, April 30, 2009 10:42:04 AM, by djr.
“Let me start off by saying, I really appreciate this book and have found it very useful. I enjoyed the calibration exercises and decided to include them in a semester class on decision analysis I’ve just finished with graduate students. Unfortunately it didn’t work out as I hoped.
While I saw progress for the group as a whole initially, only a few of the 27 students even neared the 90% target and when we took a break for 5 weeks, the skills slipped. Furthermore, the students who got close primarily did so by using such extreme ranges, that they (and I) felt the conclusion was they didn’t really know much about the questions. I sought to see if students who felt they were comfortable working with numbers did better, but they did not. Students who self-described themselves as analytical did somewhat better but it was not a strong relationship. Nevertheless, the students indicated for the most part they liked the exercises. It helped them realize they were overconfident and it made them think about estimation and uncertainty. However, making progress on getting more calibrated for the most part eluded them. I recognize that unlike the scenarios you described in the book, these students are not in the business of estimation and indeed many of them are quite possibly adverse to such estimating. But I argued they all nevertheless would estimate in their professional careers (public and nonprofit management).
I’m planning on doing this again but I wanted to pose two questions.
1. One strategy for “getting calibrated” at the 90% level is to choose increasingly wider ranges, even to the point where they seem ridiculous. For example on a question about the height of Mt. Everest in miles above sea level, one student put 0.1 miles to 100,000 miles. While strictly speaking this was a range that captured the true value, its usefulness as an uncertainty range is probably approaching zero. However from the students’ perspectives, answering in this way was getting them closer to the 90% confidence range that I was pushing on them. (Even with such ranges, many students were still at 50-70%.) What would your response be to this strategy if you saw it being used and what might I as an instructor suggest to improve this? Is the conclusion to be left with, you don’t know anything if you have to choose wide ranges? Are there other measures we should combine with this such as the width of the confidence intervals? Are there other mental exercises besides those in the book that might help?
2. While students did not do well on the 90% confidence interval questions, they did do fairly well on true/false questions where they then estimated their degree of confidence. More than three-fourths of the class did get within ten percent of their estimated level of confidence by the second true/false trial (though these came after several 90% confidence interval exercises as well). At the same time, students average confidence level for individual questions, did not correlate at all with the percent of the students who correctly guessed true/false. In the book there was no discussion of improvements or accuracy with the true/false type estimation questions and I wondered if you had any observations to offer on why this seemed easier and students were better on this type of estimation. In your experience, are these type of calibrations more/less effective or representative? Should they be very different from the 90% confidence intervals in terms of calibration?
Again, great book that I think could almost be a course for my students as it is.”
Thank you for this report on your experiences. Please feel free to continue to post your progress on calibration so we can all share in the feedback. I am building a “library” of findings from different people and I would very much like to add your results to the list. I am especially interested in how you asked students to describe themselves as analytical vs. those who did not. Please feel free to send details on those results or to call or email me directly. Also, since I now have two books out discussing calibration, please let me know which book you are referring to.
I item-tested these questions for the general business, government, analyst and management crowd. Perhaps that is one reason for the difference in perceived difficulty, but I doubt that alone would make up for the results you see. My experience is that about 70% of people achieve calibration after 5 tests. We might be emphasizing different calibration strategies. Here are my response to your two questions:
1) We need to be sure to explain that “under-confidence” is just as undesirable for assessing uncertainty as overconfidence. I doubt that student really believed Mt. Everest was several times larger than the diameter of the Earth, but if he/she literally had no sense of scale, I suppose that is possible. It is more likely that they didn’t really believe Mt. Everest could be 100,000 miles high or even 10,000 miles high. Remember to apply the equivalent bet. I suspect that person believed they had nearly a 100% chance of getting the answer within the range, not 90%. They should answer the questions such that they allow themselves a 5% chance that the true value is above the upper bound and a 5% chance it is below the lower bound. But if this truly is their range that best represents their honest uncertainty, then you are correct – they are telling you they have a lot of uncertainty and the ends of that range are not really that absurd to them.
2. Yes, they always appear to get calibrated on the binary questions first. But I do discuss how to improve the true/false questions. Remember that the “equivalent bet” can apply to true false questions as well. Furthermore, repetition and feedback is a strategy for improving on either ranges or true/false questions. Finally, the corrective strategy against “anchoring” involves treating each range question as two binary questions (the anchoring phenomenon may be a key reason why ranges are harder to calibrate than true/false questions). When answering range questions, many people first think of one number, then add or subtract another “error” value to get a range. This tends to result in narrower – and overconfident – ranges. As an alternative strategy, ask the students to make the lower bound such that they could say they would answer “True with 95% confidence” to the question “Is the true value above the lower bound?” This seems to significantly reduce overconfidence in ranges.
Thanks for this information and feel free to send detailed records of your observations. I may even be able to incorporate your observations in the second edition of the How to Measure Anything book (which I’m currently writing).
Doug HubbardRead More
Originally posted on http://www.howtomeasureanything.com/forums/ on Thursday, April 30, 2009 6:20:57 AM.
I want to thank you for your work in this area .Using the information in your book I used Minitab 15 and created an attribute agreement analysis plot. The master has 10 correct and I then plotted 9,8,7,6,5,4,3,2,1,0. From that I can see the overconfidence limits you refer to in the book. Based on the graph there does not appear to be an ability to state if someone is under-confident. Do you agree?
Can you assist me in the origin of the second portion of the test where you use the figure of -2.5 as part of the calculation in under-confidence?
I want to use the questionnaire as part of Black Belt training for development. I anticipate that someone will ask how the limits are generated and would like to be prepared.
Thanks in advance – Hugh”
The figure of 2.5 is based on an average of how confidently people answer the questions. We use a binomial distribution to work out the probability of just being unlucky when you answer. For example, if you are well-calibrated, and you answer an average of 85% confidence (expecting to get 8.5 out of 10 correct), then there is about a 5% chance of getting 6 or less correct (cumulative). In other words, at that level is is more likely that you were not just unlikely, but actually overconfident.
I took a full distribution of how people answer these questions. Some say they are an average of 70% confident, some say 90%, and so on. Each one has a different level for which there is a 5% chance that the person was just unlucky as opposed to overconfident. But given the average of how most people answer these questions, having a difference of larger than 2.5 out of 10 between the expected and actual means that there is generally less than a 5% chance a calibrated person would just be unlucky.
It’s a rule of thumb. A larger number of questions and a specific set of answered probabilities would allow us to compute this more accurately for an individual.