Variation of Recatch Example

Originally posted to http://www.howtomeasureanything.com/forums/ on Monday, July 13, 2009 2:13:07 PM.

“I would love to see an example following upon the idea of estimating the population of all prospective clients which uses similar sampling method as recatching example. Could you do it for me?

Best regards

Adam”

We might need more details to work out the specific mechanics of this one, but we can discuss the concept. First, it is worth pointing out that the recatch example is just a way of using two independent sampling methods and comparing the overlap. In the case of the fish in the lake, the sampling methods were sequential (one was done after the other was done) and the overlap of the samples was determined by the tags that were left with the first sample of fish. Then when the second sample of fish was gathered, the proportion of that sample with tags would show how many fish were caught in both samples. From this and knowledge of each sample size, the entire population could be estimated.

But we don’t have to think of this as being sequential sampling where the first sampling leaves a mark of some kind (e.g. the tags on the fish) so that we see the overlap in the second sample. We can also run samples at the same time as long as we can identify individuals. People are simple enough to identify (since they have names, unique email addresses, etc.) so we don’t have to “tag” them between samples. (This is convenient, since I find that people rarely sit still while I try to apply the tag gun to their ear lobe.)

So if we had two independent sources attempt to identify prospects out of a population pool we could estimate the size of the prospect population. If two independent teams were using two different methods (perhaps two different phone surveyors or two different teams surveying people in malls), and if identification is captured, then the two teams could compare notes after the survey and determine how many individuals came up in both surveys.

The trick would be to find sampling methods that were truly independent of each other and the target population. If the population was “prospects in the city of Houston” and the sampling methods were mall surveys, then we should consider the possibility that not all prospects are equally likely to visit malls. If both survey methods were biased in the same way (tending to sample the same small subset of the target population), then the “recatch” method would underestimate the population size. If we used two completely different sampling methods (one mall survey and one phone survey) and the two methods were biased in a way that made prospects in one method less likely to be found by the other method, then the method will overestimate the total population.

As you can see, there are many variations on this method and each has challenges. The error could be high but, as I point out in the book, if it told you more than you knew before, then it can be a useful measurement.

Thanks,

Doug Hubbard

Random Power Law Generator Example?

Originally posted on http://www.howtomeasureanything.com/forums/ on Monday, June 22, 2009 4:33:03 AM.

“Greetings!

ON the page 187 there is a claim that the author has generated an example of random power law generator, but i cant find it from the examples. Can someone help me with this problem who has find the example from downloads?

THanks,

Markus Kantor”

Thanks for your question. I had it up briefly and discovered a flaw in the automatic histogram generation. I’m out of the country I’ll have the power law generator up by the end of June (much sooner if I find the time before I head back to the US).

Thanks for your patience

Measuring the Gross Margin of Specific Projects

This question was originally posted on Friday, June 12, 2009 10:47:57 AM by mzaret20000 on http://www.howtomeasureanything.com/forums/.

“Hi Douglas,

I have been asked to provide the gross margin for client engagements. My company is a recruiting firm that operates like an internal recruiting team (meaning that we charge fixed rates as we are doing the work regardless of outcome rather than billing a percentage of compensation hired). Most of our engagements are small so they aren’t not profitable in the absence of other engagements. I’m wondering what your strategy would be to make this type of measurement.”

Thanks for your post. At first glance, your problem seems like it could just be a matter of accounting procedures (e.g. revenue minus expenses and divided by revenue with consideration for issues like how you allocate marketing costs across projects, etc.) but let me presume you might mean something that is more complex. It might not be what I strictly call a “measurement” since it sounds like you probably have the accounting data you need already and you probably do not actually need to make additional observations to calculate it. This is more of a calculation based on given data and the issue is more about what it is you really want to compute.

Since you mentioned that your projects are small and not profitable n the absence of other engagements, perhaps what you really want to compute is some kind of break-even point based on fixed costs and marginal costs of doing business. But even that’s a guess on my part.

Perhaps you could describe why you need to know this. This is a typical question I ask. What decision could be different given this information? What actions of what parties would this information guide? Once you define that I find that the measurement problem is generally much clearer.

Pass/Fail Questions

Originally posted on http://www.howtomeasureanything.com/forums/ on Wednesday, July 08, 2009 2:46:05 PM by

“Hi Doug,

I want to share an observation a V.P. made after doing the 10 pass fail questions. If one was to input 50% confidence to all the questions and randomly selected T/F they would be correct 1/2 the time the difference would be 2.5.

The scoring would indicate that that person was probably overconfident. Can you help here ?.

I am considering making the difference between the overall series of answers (as a decimal) and the Correct answers(as a decimal) as needing to be greater than 2.5 for someone to be probably overconfident.

please advise

Thaks in advance – Hugh”

Yes, that is a way to “game the system” and the simple scoring method I show would indicate the person was well calibrated (but not very informed about the topic of the questions). It is also possible to game the 90% CI questions by simply creating absurdly large ranges for 90% of the questions and ranges we know to be wrong for 10% of them. That way, they would always get 90% of the answers within their ranges.

If the test-takers were, say, students, who simply wanted to appear calibrated for the purpose of a grade, then I would not be surprised if they tried to game the system this way. But we assume that most people who want to get calibrated realize they are developing a skill they will need to apply in the real world. In such cases they know they really aren’t helping themselves by doing anything other than putting their best calibrated estimates on each individual question.

However, there are also ways to counter system-gaming even in situations where the test taker has no motivation whatsoever to actually learn how to apply probabilities realistically. In the next edition of How to Measure Anything I will discuss methods like the “Brier Score” which would penalize anyone who simply flipped a coin on each true/false question and answered them all as 50% confident. In a Brier Score, the test taker would have gotten a higher score if they put higher probabilities on questions they thought they had a good chance of getting right. Simply flipping a coin to answer all the questions on a T/F test and calling them each 50% confident produces a Brier score of zero.

Thanks for your interest,

Doug Hubbard

Teaching Graduate Students Calibration

Originally posted on http://www.howtomeasureanything.com/forums/ on Thursday, April 30, 2009 10:42:04 AM, by djr.

“Let me start off by saying, I really appreciate this book and have found it very useful. I enjoyed the calibration exercises and decided to include them in a semester class on decision analysis I’ve just finished with graduate students. Unfortunately it didn’t work out as I hoped.

While I saw progress for the group as a whole initially, only a few of the 27 students even neared the 90% target and when we took a break for 5 weeks, the skills slipped. Furthermore, the students who got close primarily did so by using such extreme ranges, that they (and I) felt the conclusion was they didn’t really know much about the questions. I sought to see if students who felt they were comfortable working with numbers did better, but they did not. Students who self-described themselves as analytical did somewhat better but it was not a strong relationship. Nevertheless, the students indicated for the most part they liked the exercises. It helped them realize they were overconfident and it made them think about estimation and uncertainty. However, making progress on getting more calibrated for the most part eluded them. I recognize that unlike the scenarios you described in the book, these students are not in the business of estimation and indeed many of them are quite possibly adverse to such estimating. But I argued they all nevertheless would estimate in their professional careers (public and nonprofit management).

I’m planning on doing this again but I wanted to pose two questions.

1. One strategy for “getting calibrated” at the 90% level is to choose increasingly wider ranges, even to the point where they seem ridiculous. For example on a question about the height of Mt. Everest in miles above sea level, one student put 0.1 miles to 100,000 miles. While strictly speaking this was a range that captured the true value, its usefulness as an uncertainty range is probably approaching zero. However from the students’ perspectives, answering in this way was getting them closer to the 90% confidence range that I was pushing on them. (Even with such ranges, many students were still at 50-70%.) What would your response be to this strategy if you saw it being used and what might I as an instructor suggest to improve this? Is the conclusion to be left with, you don’t know anything if you have to choose wide ranges? Are there other measures we should combine with this such as the width of the confidence intervals? Are there other mental exercises besides those in the book that might help?

2. While students did not do well on the 90% confidence interval questions, they did do fairly well on true/false questions where they then estimated their degree of confidence. More than three-fourths of the class did get within ten percent of their estimated level of confidence by the second true/false trial (though these came after several 90% confidence interval exercises as well). At the same time, students average confidence level for individual questions, did not correlate at all with the percent of the students who correctly guessed true/false. In the book there was no discussion of improvements or accuracy with the true/false type estimation questions and I wondered if you had any observations to offer on why this seemed easier and students were better on this type of estimation. In your experience, are these type of calibrations more/less effective or representative? Should they be very different from the 90% confidence intervals in terms of calibration?

Again, great book that I think could almost be a course for my students as it is.”

Thank you for this report on your experiences. Please feel free to continue to post your progress on calibration so we can all share in the feedback. I am building a “library” of findings from different people and I would very much like to add your results to the list. I am especially interested in how you asked students to describe themselves as analytical vs. those who did not. Please feel free to send details on those results or to call or email me directly. Also, since I now have two books out discussing calibration, please let me know which book you are referring to.

I item-tested these questions for the general business, government, analyst and management crowd. Perhaps that is one reason for the difference in perceived difficulty, but I doubt that alone would make up for the results you see. My experience is that about 70% of people achieve calibration after 5 tests. We might be emphasizing different calibration strategies. Here are my response to your two questions:

1) We need to be sure to explain that “under-confidence” is just as undesirable for assessing uncertainty as overconfidence. I doubt that student really believed Mt. Everest was several times larger than the diameter of the Earth, but if he/she literally had no sense of scale, I suppose that is possible. It is more likely that they didn’t really believe Mt. Everest could be 100,000 miles high or even 10,000 miles high. Remember to apply the equivalent bet. I suspect that person believed they had nearly a 100% chance of getting the answer within the range, not 90%. They should answer the questions such that they allow themselves a 5% chance that the true value is above the upper bound and a 5% chance it is below the lower bound. But if this truly is their range that best represents their honest uncertainty, then you are correct – they are telling you they have a lot of uncertainty and the ends of that range are not really that absurd to them.

2. Yes, they always appear to get calibrated on the binary questions first. But I do discuss how to improve the true/false questions. Remember that the “equivalent bet” can apply to true false questions as well. Furthermore, repetition and feedback is a strategy for improving on either ranges or true/false questions. Finally, the corrective strategy against “anchoring” involves treating each range question as two binary questions (the anchoring phenomenon may be a key reason why ranges are harder to calibrate than true/false questions). When answering range questions, many people first think of one number, then add or subtract another “error” value to get a range. This tends to result in narrower – and overconfident – ranges. As an alternative strategy, ask the students to make the lower bound such that they could say they would answer “True with 95% confidence” to the question “Is the true value above the lower bound?” This seems to significantly reduce overconfidence in ranges.

Thanks for this information and feel free to send detailed records of your observations. I may even be able to incorporate your observations in the second edition of the How to Measure Anything book (which I’m currently writing).

Thanks,

Doug Hubbard

The Statistics Behind the Calibration Scores

Originally posted on http://www.howtomeasureanything.com/forums/ on Thursday, April 30, 2009 6:20:57 AM.

“Hi Douglas,

I want to thank you for your work in this area .Using the information in your book I used Minitab 15 and created an attribute agreement analysis plot. The master has 10 correct and I then plotted 9,8,7,6,5,4,3,2,1,0. From that I can see the overconfidence limits you refer to in the book. Based on the graph there does not appear to be an ability to state if someone is under-confident. Do you agree?

Can you assist me in the origin of the second portion of the test where you use the figure of -2.5 as part of the calculation in under-confidence?
I want to use the questionnaire as part of Black Belt training for development. I anticipate that someone will ask how the limits are generated and would like to be prepared.

Thanks in advance – Hugh”

The figure of 2.5 is based on an average of how confidently people answer the questions. We use a binomial distribution to work out the probability of just being unlucky when you answer. For example, if you are well-calibrated, and you answer an average of 85% confidence (expecting to get 8.5 out of 10 correct), then there is about a 5% chance of getting 6 or less correct (cumulative). In other words, at that level is is more likely that you were not just unlikely, but actually overconfident.

I took a full distribution of how people answer these questions. Some say they are an average of 70% confident, some say 90%, and so on. Each one has a different level for which there is a 5% chance that the person was just unlucky as opposed to overconfident. But given the average of how most people answer these questions, having a difference of larger than 2.5 out of 10 between the expected and actual means that there is generally less than a 5% chance a calibrated person would just be unlucky.

It’s a rule of thumb. A larger number of questions and a specific set of answered probabilities would allow us to compute this more accurately for an individual.

Thanks,

Doug