I am concerned about the CI, median and normal distribution

Originally posted at http://www.howtomeasureanything.com, on Wednesday, February 11, 2009 2:16:38 PM, by andrey.

“Hello Douglas,

First of all let me say I have thoroughly enjoyed reading your book. I have a technical background (software engineering) and have always been surprised at how “irresponsible” some business-level decision making can be – based on gut instincts and who-knows-what. This ‘intuitive’ approach is plagued with biases and heuristics, the effects of such approach has been widely publicized (for example here). This is one of many reasons I found your book very simulating and the AIE approach as a whole very promising.

However I have reservations about a few points you make. Please forgive me my ignorance if my questions are silly, my math has become rusty with the years passing by.

One of my concerns is the validity of the assumption that you make when explaining ‘The Rule of Five’, 90% CI and especially when using Monte Carlo simulation. I can believe (although it would’ve been great to see the sources) that ‘there is a 93% chance that the median of a population is between the smallest and largest values in any random sample of five from that population’. But when you are applying this to the Monte Carlo simulations, you assume that the mean (which is also the median for symmetric probability distributions) is exactly in the middle of the confidence interval. Which, I think, makes a big difference to the outcome because of the shape of the normal distribution function. If you assume the the median is for example very close to the lower or upper bound of the confidence interval by putting a different value into the =norminv(rand(),A, B) formula the results would be different.

I am still working through your book (second reading), trying to ‘digest’ and internalise it properly. I would be very grateful if you could explain this to me.

Thank you very much,

Andrey”

Thanks for your comment.

I don’t show a source (I’m the one who coined the term “Rule of Five” in this context) but I show the simple calculation, which is easily verifiable. The chance of randomly picking one sample with a parameter value above the true population median for that parameter is, by definition, 50%. We ask “what is the probability that I could happen to randomly choose five samples in a row that are all above the true population median?” It is the same chance as flipping five coins and getting all heads. The answer is 1/2^5 or 3.125% Likewise, the probability that we could have just picked 5 in a row that were all below the true population median is 3.125%. That means that there is a 93.75% chance that some were above and some were below – in other words, that the median is really between the min and max values in the sample. It’s not an “assumption” at all – it is a logically necessary conclusion from the meaning of the word “median”.

You can verify this experimentally as well. Try generating any large set of continuous values you like using any distribution you like (or just define a distribution function for such a set). Determine the median for the set. Then randomly select 5 from the set and see if the known population median is between the min and max values of those 5 samples. Repeat this a large number of times. You will find that 93.75% of the time the known median will be between the min and max values of the sample of 5.

I believe I also made it clear that this only applies to the median and not the mean. I further stated that if, on the other hand, you were able to make the determination that the distribution is symmetrical then, of course, it applies to the mean as well. Often, you may have reason to do this and this is no different than the assumption in any application of a t-stat or z-stat based calculation of a CI (which are always assumed to be symmetrical).

Furthermore, you certainly should not use a function that generates a normally distributed random number if you know it not to be normally distributed and I don’t believe I recommended otherwise. If you know the median and the mean of the distribution you intend to generate are not the same, then you can’t count on Excel’s normdist function to be a good approximation. For the sake of simplicity, I gave a very limited set of distributions functions for random values in Excel. But we can certainly add a lot more (triangular, beta, lognormal, etc.)

My approach is not to assume anything you don’t have to. If you don’t know that the distribution isn’t lopsided, you can simulate that uncertainty, too. Is it possible that the real distribution could actually be lognormal? Then put a probability on that and generate accordingly. Why “assume” something is true if we can explicitly model that we are uncertain about something?

Thanks,

Doug

Computing the Value of Information

 Originally posted at http://www.howtomeasureanything.com, on Monday, December 08, 2008 10:08:08 PM, by Unknown.

“I’m trying to quickly identify the items I need to spend time measuring. In your book, you determine what to measure by computing the value of information. You refer to a macro that you run on your spreadsheet that automatically computes the value of information and thus permits you to identify those items most worth spending extra time to refine their measurements. Once I list potential things that I might want to measure, do I estimate, using my Calibrated Estimators, a range for the chance of being wrong and the cost of being wrong and using something like @RISK, multiply these two lists of probability distributions together to arrive at a list of distributions for all the things I might want to measure? Then, do I look over this list of values of information and select the few that have significantly higher values of information?

I don’t want you to reveal your proprietary macro, but am I on the right track to determining what the value of information is?”

You were on track right up to the sentence that starts Once I list potential things that I might want to measure. You already have calibrated estimates by that point. Remember, you have to have calibrated estimates first before you even can compute the value of information. Once the value of information indicates that you need to measure something, then its time to get new observations. As I mention in the book, you could also use calibrated estimates for this second round, but only if you are giving them new information that will allow them to reduce their uncertainty.

So, first you have your original calibrated estimates, THEN you compute the value of information, THEN you measure the things that matter. In addition to using calibrated estimators again (assuming you are finding new data to give them to reduce their ranges) I mention several methods in the book including decomposition, sampling methods, controlled experiments, and several other items. It just depends on what you need to measure.

Also, the chance of being wrong and the cost of being wrong can already be computed from the original calibrated estimates you provided and the business case you put them in. You do not have to estimate them separately in addition to the original calibrated estimates themselves. Look at the examples I gave in the chapter on the value of information.

My macros make it more convenient to more complicated information values, but they are not necessary for the simplest examples. Did you see how I computed the information values in that chapter? I already had calibrated estimates on the measurements themselves. Try a particular example and ask me about that example specifically if you are still having problems.

Thanks for your use of my book and please stay in touch.

Doug Hubbard