I’m reintroducing the Measurement Challenge for the blog.  I ran it for a couple of years on the old site and had some very interesting posts. 

Use this thread to post comments about the most difficult – or even apparently “impossible” – measurements you can imagine.  I am looking for truly difficult problems that might take more than a couple of rounds of query/response to resolve.  Give it your best shot!

Doug Hubbard

64 comments

  1. bcalcott

    Hi Doug,

    I thought the book was excellent and I am already making practical use of it in the area of project portfolio management, so many thanks.

    One comment I would make is in regards to the passege in the book describing the thoughts of Stephen J. Gould on subject of IQ tests. as you porbably know, Gould’s argument in “The Mismeasurement of Man” was that tests for intelligence were primarily a tool to prove the superiority of one culture or race over another. I think his scepticism with regards to IQ tests were not that they fail as a basis for measurement per se, but more that they failed as a measurement for intelligence. Intelligence as a concept is ambiguous at best and, as you suggested at the start of the book, needs to be unpacked (including the motives for requiring measurement in the first place) before it can be meaningful. In the case of an IQ test, what you would be measuring would be the ability to solve a particular set of problems. In the case of measuring potential brain damage this would be useful. it would not however tell you if a particular subject was less or more intelligent after the incident.

    Anyhow, fantastic book and more power to your elbow.

    Regards

    Brian Calcott

  2. dwhubbard

    Brian,

    Thanks for your input. But I was refuting Gould specifically regarding his claim that IQ tests are not a measure of intelligence. He stated that the IQ score is nothing more than an “artifact” of the mathematical method used to compute it. This alone is an easily testable and falsifiable claim. If it really were nothing other than an arbitrary artifact of a calculation, we should not see any correlations between this score and other measurable phenomenon like income, incarceration, welfare, and so on. And yet we do see correlations between IQ scores and behaviours we would generally associate with high or low IQs.

    IQ tests do, however, have lots of error for all of the reasons Gould lists. But the complete lack of error was never the criterion for a measurement. Even a test that has 90% confidence errors of +/-20 points two thirds of the time and then gives a completely spurious result a third of the time is still a reduction in uncertainty. I think this is where IQ test skeptics miss the point. They seem to be saying that if an IQ test even *can* give a wrong answer then it is measuring nothing. This is a misunderstanding of how the term measurement is used in the sciences.

    What they are not considering is the previous state of uncertainty before a test. Suppose you have persons A and B. You know nothing about them prior to a test. You could only say that there is equal likelihood for either to have a higher intelligence than the other. Then they take an IQ test. Person A scores 145 and person B scores 95. Now would Gould say that this result would not even justify a probability of 70% that A has a higher interligence? How about just 60%? How about if the tests results were 165 and 80 respectively? Would the chance that A has higher intelligence than B still only be 50%? I think If we repeated this test on different pairs of people many times, and you had to place real-money bets on who would perform better at other tasks associated with intelligence (however it is defined) like college grades, income, or ability to write, I’m sure you prefer to keep betting on the person with the higher tested score, especially if the differences were 30 points or more.

    Again, it is important to separate the claims that a measure has a lot of noise and little signal from the claim that it is only noise and no signal. Gould’s statement that IQ is merely an artifact of the method used to calculate it would imply he believes the latter. This would mean that no matter how far apart two people were on their IQ test scores, no matter how many trail pairs of people we use, that we would never have any reason to believe one person is even slightly more likely to be more intelligent than another. When you think it through, this is an extrodinary claim.

    Of course, we have no basis for being certain that a person who scores 125 is actually more intelligent than one who scores 105. Maybe we couldn’t even be certain if the difference were twice as wide. But measurement isn’t about achieving certainty. Its about reduced uncertainty. For example, would you really bet your own money that pairs of people whose tests scores were 40 points apart are indistinguishable from those wit the same score? How about 60 points apart? How about 100? Is there seriously no point at which the probably that one person in the pair ever budges from 50% chance of being the more intelligent? I know where I would bet my money on repeated tests.

    Regarding your example, why would it necessarilly be the case that IQ tests before and after a brain injury must tell us absolutely nothing? Are you saying that even if the person consistently scored above 150 before an injury and cannot score more than 100 after an injury that we can still conclude nothing at all about a change in intelligence? Are you saying that even if the difference was twice as much (a 100 point difference) we could still learn nothing? That also seems like an extraordinary claim to make.

    If IQ tests reduce uncertainty even slightly even just some small percentage of the time, it meets the criteria of a measurement in its strictest mathematical sense. And when it comes to important issues of public policy (such as the example I used where methyl mercury is correlated to reduced IQ points of children), a slight uncertainty reduction is preferable (perhaps the equivalent of millions of dollars preferable) to no uncertainty reduction at all.

    Thanks,
    Doug Hubbard

  3. Model_Math

    Potential for organizational change.
    I am thinking of measuring the level of resistance to accepting change — what’s it going to take to blast an unproductive culture out of behaviors and methods that they freely admit are not working? (This is not academic or frivolous.)

  4. A followup on the analysis of IQ as a measurement of intelligence (and then onto a more general point):

    It seems to me that M reducing uncertainty in evaluating V is a rather low hurdle for calling M a measure of V. By this criteria, wouldn’t Income be a measure of Intelligence?

    Clearly, to say Income is a measure of anything other than Income makes the concept of measure unnecessarily confusing and much more prone to be misapplied. In this light, I would say IQ is only a measure of what has been defined as IQ. The question then becomes how well does IQ correlate with what we believe Intelligence to be, which seems to me to be a much more clear debate than whether IQ is a measure of Intelligence.

    The main policy problem with substituting IQ for Intelligence is not so much that the correlation is not 100%, but rather than the correlation seems to vary quite a bit among different populations. In so many cases, the method for deriving a concrete measure that correlates well with an intangible concept is very context sensitive. For example, deriving a concrete measure that correlates well with intelligence in western populations may not correlate well in African populations. Deriving a concrete measure that correlates well with what we think of as Productivity on a factory floor in company A may not correlate well on factory floors in company B or in software development offices in either company. To call the measure Productivity instead of what it directly measures makes it too easy to overlook this contextual factor, and start applying a Productivity measurement where it does not apply.

    Understanding this issue is more clear if our language for measurement makes it more clear that:
    1. We can only directly measure concrete measurements (like IQ, throughput, income, MPG, age, …),
    2. We cannot directly measure intangible concepts (like Intelligence and Productivity),
    3. The book is about how to derive concrete measurements that correlate highly to intangible concepts in given contexts, not how to directly measure intangible concepts (and especially not in a context-independent way).

    Steven Gordon, PhD

  5. dwhubbard

    Steven,

    Thanks for your input. I would argue that whether a measurement is “direct” or not is an unambiguous distinction mathematically, semantically and at the level of fundemental epistemology. This distinction adds no clarity whatsoever because most – or perhaps all – measures of properties in physical sciences are not “direct”. They rely on readings of an instrument (e.g. an digital scale, ohmmeter or photometer) and rely on inferences from related observations (e.g. observing the deflection of a particle in a magnetic field or the movement of planet to measure mass). I would even argue that we only percieve reality indirectly in the first place. All we can do is make observations to make inferences that ultimately have some utility to practical decisions.

    In the definition I propose for measurement (which is, in fact, consistent with information theory, decision theory and measurement theory) income really is a a measure of IQ which is, in turn, a measure of intelligence. Of course, this would be a measure with a huge amount of error but, on average, estimates based on income would be slightly more accurate than estimates based on no information at all. If you doubt this, let me propose a game that would test whether you really believe your position. Let’s identify 100 people who have had IQ tests representing a wide range of IQ’s. We will know nothing about them and we will not be told their IQ’s, except that I will be told thier income from last year and you will not. We will sort them into random pairs and we will bet which person is smarter. You should be indifferent because you would have no information about whether one person has a higher IQ. But since I know their incomes, I will not be indifferent. I will, for each pair, choose that the person with the higher income also has the higher IQ. Each time I’m wrong I pay you $100 and each time I win you pay me $100. I would love to play this game for a million people if we could. If I’m right only slightly more often then I’m wrong, then I would make a lot of money off of you. I would even be willing to pay you $500 to play this game with me for 100 persons. Now would you prefer to be the one of us who had the income information? Would you even be willing to pay me if you had it? You might say you are not a betting person, but I think betting is simply the ultimate test of whether a person really believes some position they have. You might say this is impractical but I think we could find a way to make it happen.

    The fact is that even when there is a small correlation between X and Y, knowledge of X slightly reduces your uncertainty about Y. And I would say that any definition of measurement that you propose that ignores the reduction in uncertainty would be far more confusing. You would have to resort to a definition that merely describes a procedure regardless of whether the result is informative. And if you said that the procedure only resulted in a measurement if it was informative while rejecting that the threshold is any uncertainty reduction, then you would have to define how much uncertainty reduction is required. Presumably, you would have to define this arbitrary threshold differently for every possible kind of measurement since error in measurements vary widely among fields.

    Taking your claim “Clearly, to say Income is a measure of anything other than Income makes the concept of measure unnecessarily confusing and much more prone to be misapplied.” to its logical conclusion, you would have to reject the validity of most measures in the physical sciences since, as I said, most are inferences based on indirect observations (nobody has ever “directly” measured the mass of an electron or a star or the age of a rock or even time itself). If, as you say “IQ is only a measure of IQ” appears to say that if X is correlated to Y, X is still only a measure of X and it reveals nothing about the possible values of Y. If this is the case, then you must also hold that the glow of a hot body is a measure only of its glow, not its temperature. You must hold that the depth of a rock formation is only a measure of its depth, not its age. You must hold that a credit score is only a measure of a credit score and says absolutely nothing about whether someone is an acceptable lending risk – meaning that a 1000 people with credit scores of under 600 are just as good a credit risk as 1000 people with scores over 750. There will be exceptions, of course, but given a portfolio of 1000 people, do you really want to charge the same interest for the under-600 score group as the over-750 score group?

    You stated “The main policy problem with substituting IQ for Intelligence is not so much that the correlation is not 100%, but rather than the correlation seems to vary quite a bit among different populations. In so many cases, the method for deriving a concrete measure that correlates well with an intangible concept is very context sensitive. For example, deriving a concrete measure that correlates well with intelligence in western populations may not correlate well in African populations.” How is this more uniquely “context specific” than any other measure I just mentioned? As before, if this context-specificity you speak of was any real obstacle at all to measurement, then again, most of what we know from science could not be possible. There is no complexity you can think of that does not have a measurement solution. Many of the procedures in scientific method are specifically to controll for such issues.

    Or perhaps you believe that it is not the “directness” but the mere existance of error that undermines any measurement. There is a common fallacy that the the existance of noise means a lack of signal. This would force you to defend yet another position that would be impossible to reconcile with all scientific knowledge. Or perhaps you believe that the existance of noise (i.e. error) in a measurement means that any signal that does exist must not have any utility. This would be impossible to reconcile with both decision theory and common sense. If you knew a coin favored heads on a flip just 52% of the time, that knowledge would be worth a lot of money over a large number of flips.

    On the other hand, I don’t deny that people misuse measures. Regarding your comment about productivity measures, if someone actually equates a measure of productivity on a factory floor to productivity in creating software, I could only respond that this would be a straw man argument. I make no claim that two obviously uncorrelated factors say anything about each other. I also point out in How to Measure Anything that most organizations measure the wrong things (because they are not computing the value of uncertainty reduction). If they applied this method consistently, they would measure what matters. And, as a side note, don’t confuse measures used for decisions like project approval to incentive programs. They are two different issues. Good incentives are based on good measures but that doesn’t mean any measurement should be part of some incentive. I clearly argue against this in the book.

    And don’t forget that you always are comparing a decision analysis method based on measurements to some *other* method – presumably your intuition. That also has a measurable performance and research shows it is often not hard to beat with even simple quantitative methods.
    Again, if you really believe what you believe, then let’s start recruiting some people for our bet.

    Thanks for your contribution to the conversation and I look forward to your response.

    Doug Hubbard

  6. stefano.palestini

    Douglas,
    We are improving new procedure for financial planning whose target is to reduce the range of approximation of the foreseen from the actual range +- 20 Million likelihood 30% (as the greatest difference between actual and forecast of net financial position) +- 5 Million likelihood 70% (the most narrow difference actual vs forecast) , to the new interval +-15M E 10% /+- 5M 80%. The cost of the investment is Euro 500 K.
    It is possible and meaningful measure the value of the better information available in the new procedure of financial planning using EVPI as described in the capitol “Measuring Value of Information” in your book How to Measure Anything (Paragraph “the value of information for ranges)?
    Else what you suggest to set better the problem of the value of information?
    Thank you in advance for your answer.

    stefano palestini
    internal auditing & risk management

  7. Douglas Hubbard

    Stefano,

    This sounds like a perfectly viable problem for the value of information. However, I would need clarification on something you said. It seems you are saying that the current accuracy of forecasts to actuals has a 30% chance of being within 20 million and a 70% chance of being within 5 million, right? That seems backward to me. The wider range should have the higher probability unless I misunderstand how you are stating this. The same would seem to hold for the target accuracy.

    But, that aside, improving the value of forecasts is certainly something the information value pertains to. You have a cost of overestimating and/or a cost of underestimating, right? How much would you lose for every million you over or under estimate? The product of this “loss function” and your probability distribution for the forecast is the expected opportunity loss (EOL). You have an EOL for the current state and for the desired target accuracy. The difference between the two EOLs is the value of information. Technically, you would only be computing EVPI if you were comparing the value of your current accuracy to perfect forecasts, which I’m sure is not what you mean.

    Thanks for your comment,
    Doug Hubbard

  8. stefano.palestini

    The likelihood comes from the observation of differences between the quartly financial position forseen and the actual, during 3 years (12 observation) I’ve seen 9 cases (70%) where the difference has been inside the range +/- 5 million and for 3 other cases has been +/- 20 million.

    Thank you in advance for your answer.

    stefano palestini

  9. Michael_Carman

    Hi Doug,

    I have just finished reading How to Measure Anything and found it immensely valuable. The organisation I work for (a public sector transport agency in Australia) reaped the benefit of it (even before I finished reading chapter 9!) as I ran a regression analysis and quantified the link between infrastructure maintenance and reliability. Your book has aided me in linking hitherto unrelated areas of data and information with a view to bolstering and enhancing performance. So – thank you.

    Another area I am starting to turn my attention to which I thought you may have an interest in from an HTMA perspective is that of quantifying the benefits of ‘reform’. Reform programs are often undertaken with rubbery objectives and vaguely defined benefits. Process can often dominate outcomes. I think there are some powerful applications of your work in being able to more clearly and quantifiably link the process inputs which form the basis of a reform program, with tangible outcomes. More sharply defining reform (“what exactly do you mean by ‘reform’?”) and correlating associated changes with performance measures such as relability and customer satisfaction are promising areas here.

    One question if I may: do you have any recommended further reading in the areas of Monte Carlo simulation, and of regression. I need to dig deeper, esp. re. regression, to find out how many pairs of data points are required for a regression to be valid, and how to deal with coefficients and p-values changing depending on whether a variable is treated in a simple regression or a multiple regression. My preferred reading would have plenty of worked examples (which is a key strength of your book).

    Thanks again.

    Michael Carman

  10. Hi Doug,
    I am a professional investor and a fan of Warren Buffett. My question is: At the time Buffett started his first partnership in 1956, how would a potential investor have measured Buffett’s value? We can measure the value of Buffett to his investors in retrospect by looking at the returns that Buffett earned for his investors. Is this just a case of where measurement could take decades, at which point the opportunity has been missed? How would I apply your principles if I was looking for the next Warren Buffett? Or to phrase this another way: How would I measure the value to me of investing in a fund run by someone who claims to be the next Buffett, but does not have a long track record?
    Many thanks,
    Tom

  11. Douglas Hubbard

    Tom,

    If what you mean is how can we predict the next Warren Buffett, I don’t think we can, exactly. But if your question is “How do we – even slightly – improve the odds of selecting successful investors compared to my unaided judgment?”, then I think we can do that.

    Paul Meehl shows a large volume of work comparing simple statistical models against human intuition. He collected research comparing the judgments of humans to statistical models in such as predicting the outcomes of football games, predicting business failures, and the prognosis of liver disease over a large number of trials. The results were conclusive. In a wide variety of domains and even in areas where it was assumed the human expert was essential, simple statistical models outperformed the humans at predicting outcomes.

    So, to your question, can you collect data about characteristics of various investors and track them over time? You would have to be sure to test properly for how much variance can be accounted for by luck, of course. Then start comparing your own estimates to that of the regression model and see which does better. Remember, you don’t have to collect data on ALL investors. That’s what random sampling is for. Also, in my book, I describe something called the Lens method that seems to work even for the “data challenged”. The Lens method works by statistically the smoothing your own judgments – without any historical data. Past research shows that our own judgments are so inconsistent that even a model that just removes our inconsistencies is a significant improvement.

    Doug Hubbard

  12. Douglas Hubbard

    Michael,

    Thanks for your comments. Regarding further reading on Monte Carlo, you might have noticed I’m a fan of Sam Savage’s work. But I also provide webinars and seminars on the topic. You should check out the details on the http://www.hubbardresearch.com/store/merchant.mvc?Screen=CTGY&Store_Code=HDR&Category_Code=Training.

    Regarding regression, “validity” has to consider your prior state of uncertainty. If you use a tool like Excel for regression, the error of the estimate is based on a z-stat, which is recommended for sample sizes greater than 30. But there are other regression methods that don’t have that kind of constraint.

    Suppose your current subjective 90% CI for Y is 2 to 15. Now suppose have 8 values for X and Y of (1.1,1.2), (2.1,2.5), (4.5,3.9), (5.1, 5.5), (6.6,7.1), (8.5, 8.6), (9.0, 9.1), (9.5, 9.5). Plug that into a scatter chart in Excel if that helps. I also tell you that the X value for the Y you wish to estimate is 4.1. Given what you have seen with these 8 samples of (X,Y) you wouldn’t really think it is now very likely for Y to be something over 10, would you? This is especially useful if you had a reason to believe there should be a relationship – like foot traffic in front of a store and sales or years of experience in some task and the score on a related test, etc.

    I hope that helps.

    Doug Hubbard

  13. I have been tasked with creating a value-based pricing model for my employer’s products so that we can maximize profitability. This means that instead of using our cost-plus approach, we would know how different customer sectors value our products. We would price products at a premium that captures some but probably not all of this value. This usually involves educating customers on how paying more for our product actually saves money over the near/long term verses paying a lower price for our competitors’ products. My employer’s products are not very distinguishable from our competitors’ products but our on-time delivery, lead-time, financial stability, global reach and other features are definitely distinguishable and hence, measurable. My employer sells industrial products in a business-to-business model supported by field salesmen. Buyers representing our customers are quite keen on hiding their value of my employer’s products and services in order to cut the best deal for themselves so I have little confidence that I can take direct measurements from the customer; however, I feel confident that I can get some kind of value measures on the aforementioned attributes from our outside sales team. Current value based pricing models rely on something called a fair value graph where a product’s price is plotted on the vertical axis and its dimensionless value coefficient is plotted on the horizontal axis. The value coefficient is calculated with a series of rankings on various product attributes using 1 to 10 scales. A competitor’s product is ranked the same way and depending on how the points fall on the graph, one can determined if their product is overvalued or undervalued relative to the competition. After reading your book, I have little confidence that such a method has credibility because it does not dollarize value, it merely provides a dimensionless rank relative to chosen competitors using subjective ordinal rankings that have little substance. For example, what is “7” on-time delivery? My challenge to you is how do I use your methods to dollarize my employer’s value? I can picture calibrating salespeople, getting 90% confidence ranges on the value of on-time delivery, lead-time, etc. but after that is where I get stuck. Do I perform a Monte Carlo analysis and if yes, what would that look like? Your book example uses a $400,000 investment that has a variety of productivity gains/losses and shows a 14% chance of not breaking even. I cannot conceptualize a result like “there is a 14% chance customer group A will reject our value-based price of $200 for product X”. Thank you, Tom B.

  14. Douglas Hubbard

    Tom,

    Thanks for your question.

    Since you read my book, you know my position on these ordinal “1 to 10″ scales. They are never really necessary once someone figures out what the real problem is. They simply gloss over the problem making managers feel like it was solved in some way.

    In the book example you cite, the 14% does not indicate a discrete “all or nothing” outcome. I wouldn’t model a discrete, binary chance that an entire customer group would reject a given price. A more realistic model is that some uncertain percentage of that group will not purchase the product at a given price. For example “At $200, our 90% CI is that 10% to 35% of customers will decline to purchase product X”. This is, in fact, what all price optimization models are doing directly or indirectly.

    But recall another one of the maxims I mention in the book: no matter what you are measuring, assume it has been measured before. This certainly turns out to be true in this case. Not only is there a large body of academic work on estimating price elasticity and then computing optimal prices, there are well-established and proven tools available to you on the market now.

    Is your business in the B2B sales area? If so, one of my current clients is Zilliant and your problem is exactly the sort their software addresses. Zilliant has a large number of very able “price scientists” who have developed the algorithms for price optimization for customers in many B2B situations. Their customers include many of the largest manufacturers and distributors you can think of. I would start by giving them a call (go to Zilliant.com). Then you can do real price science and drop the whole “1 to 10″ activity.

    Thanks again,
    Doug Hubbard

You must log in to post a comment.