Statistics for Decision- Making in Business - Maricopa Community ...
Statistics for Decision- Making in Business - Maricopa Community ...
Statistics for Decision- Making in Business - Maricopa Community ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Statistics</strong> <strong>for</strong><br />
<strong>Decision</strong>-<br />
<strong>Mak<strong>in</strong>g</strong> <strong>in</strong><br />
Bus<strong>in</strong>ess<br />
1 st Edition<br />
Milos Podmanik
Foreword: What is This Book Good For<br />
You‟re probably th<strong>in</strong>k<strong>in</strong>g to yourself, “Who does this guy th<strong>in</strong>k he is by try<strong>in</strong>g to write his own<br />
book”<br />
The answer is both satisfy<strong>in</strong>g and deceiv<strong>in</strong>g to those who expect the traditional math course with<br />
the traditional <strong>in</strong>structor. I write this course manual to most closely match my personal teach<strong>in</strong>g<br />
philosophy. What might that be Well, I firmly believe that math education focuses too much on<br />
processes, templates, and repetitive, mundane computational skills. Is this of any importance To<br />
some extent, yes, they are important. For the most part, however, students fail to make<br />
connections from math to the real-world and vice versa. We tend to teach students how to “do”<br />
and not how to “th<strong>in</strong>k.” As a result, I believe it is far more important to promote a deep level of<br />
understand<strong>in</strong>g, engagement, and connections to the planet we live on. After all, do you really<br />
want to become a calculator If your answer is “yes,” then this will come as a major<br />
disappo<strong>in</strong>tment: a computer could calculate faster and more accurately than you decades ago!<br />
Not to mention, computers will only cont<strong>in</strong>ue to get faster and better than you at comput<strong>in</strong>g.<br />
Here‟s the good news: computers don‟t understand why they‟re do<strong>in</strong>g what they‟re do<strong>in</strong>g! They<br />
are simply comput<strong>in</strong>g mach<strong>in</strong>es. It takes (and most likely will always take) a rational, deepth<strong>in</strong>k<strong>in</strong>g<br />
human be<strong>in</strong>g to provide a contextual and mean<strong>in</strong>gful analysis of the <strong>in</strong>puts and outputs<br />
of a numerical process. And that, my friends, is what this book is all about.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 2
A Note to Students<br />
This book is far from perfect. In fact, it will never be perfect. There is, however, a lot of blood,<br />
sweat, and tears put <strong>in</strong>to this book (paper cuts hurt!). I spent much of my 2012 w<strong>in</strong>ter break<br />
th<strong>in</strong>k<strong>in</strong>g, writ<strong>in</strong>g, and rewrit<strong>in</strong>g contents <strong>in</strong> this book to make it feel “right” <strong>for</strong> both you and me.<br />
As such, I don‟t believe it‟s that much to ask <strong>for</strong> you to read the book.<br />
What‟s my po<strong>in</strong>t<br />
… Read this book!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 3
Table of Contents<br />
Chapter Section Concept Page<br />
1: Fundamentals of<br />
<strong>Statistics</strong><br />
1.1 Data and Their Uses 5<br />
1.2 Descriptive VS. Inferential <strong>Statistics</strong> 12<br />
1.3 <strong>Statistics</strong> <strong>in</strong> Excel 21<br />
2: Visual Representations<br />
of Data<br />
3: Probability and<br />
<strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong><br />
4: Discrete Probability<br />
Distributions<br />
5: Cont<strong>in</strong>uous Probability<br />
Distributions<br />
2.1 Visualiz<strong>in</strong>g Categorical Data 29<br />
2.2 Visualiz<strong>in</strong>g Quantitative Data 43<br />
2.3 Descriptive <strong>Statistics</strong> – Center and Position 56<br />
2.4 Descriptive <strong>Statistics</strong> – Variability 67<br />
3.1 The Idea of Probability 82<br />
3.2 Jo<strong>in</strong>t Probability 89<br />
3.3 Probability of Unions 99<br />
3.4 Conditional Probability 107<br />
3.5 Comb<strong>in</strong>ations and Permutations 119<br />
3.6 Expected Value 135<br />
4.1 The B<strong>in</strong>omial Distribution 146<br />
5.1 The Ideas Beh<strong>in</strong>d the Cont<strong>in</strong>uous<br />
158<br />
Distribution<br />
5.2 The Normal Distribution 172<br />
6: Sampl<strong>in</strong>g Distributions<br />
and Estimation<br />
̅<br />
̅<br />
6.1 Sampl<strong>in</strong>g Distribution <strong>for</strong> 181<br />
6.2 Confidence Interval <strong>for</strong> 191<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 4
6.3 Confidence Interval <strong>for</strong> ̂ 202<br />
7: Hypothesis Test<strong>in</strong>g<br />
7.1 The Concept Beh<strong>in</strong>d Hypothesis Test<strong>in</strong>g 208<br />
Appendices<br />
APPENDIX A:<br />
Answers to Select Problems<br />
220<br />
Chapter 1<br />
Fundamentals of <strong>Statistics</strong><br />
1.1 Data and Their Uses<br />
Our lives are filled with <strong>in</strong><strong>for</strong>mation. While at one po<strong>in</strong>t we didn‟t have enough data <strong>in</strong> the<br />
world, now we have so much of it that computers need to be revamped cont<strong>in</strong>ually <strong>in</strong> order to<br />
keep up with it. Facebook records rich <strong>in</strong><strong>for</strong>mation about hundreds of millions of users. Studies<br />
are reveal<strong>in</strong>g new conclusions that allow us to make decisions about choos<strong>in</strong>g the right type of<br />
treatment <strong>for</strong> medical conditions. Scientific data is establish<strong>in</strong>g the strong correlation between<br />
humans‟ <strong>in</strong>teraction with the planet and changes <strong>in</strong> climate. The power of data is limitless.<br />
However, due to our regularly fail<strong>in</strong>g media expertise, the results of studies are often<br />
miscommunicated because they are not understood. In order to fully extract the mean<strong>in</strong>gfulness<br />
of data, we must understand how to analyze them. We must be accurate and precise <strong>in</strong> what we<br />
measure and how we measure it.<br />
1.1.1 Three Good Reasons to Study <strong>Statistics</strong><br />
In no particular order, these are:<br />
1. To be <strong>in</strong><strong>for</strong>med<br />
2. To be able to make good decisions based on data and to understand current issues<br />
3. To be able to evaluate decisions that affect the operations of a bus<strong>in</strong>ess and our personal<br />
lives<br />
1. To be <strong>in</strong><strong>for</strong>med<br />
What does it mean to be <strong>in</strong><strong>for</strong>med To be <strong>in</strong><strong>for</strong>med we should be able to understand and <strong>in</strong>terpret<br />
tables, charts, and graphs. We should be able to make sense of conclusions of other's research<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 5
ased on their numerical results. Moreover, we should be able to have <strong>in</strong>sight <strong>in</strong>to the gather<strong>in</strong>g,<br />
summarization, and analysis of data, and so we should always approach numerical results with a<br />
slight bit of doubt. In other words, we ideally want to adopt the attitude of "doubt until enough<br />
evidence to trust." Let's take a look at some examples of where statistics have helped <strong>in</strong><strong>for</strong>m<br />
society.<br />
Examples:<br />
- Does it matter how long children are bottle-fed An experiment was run to determ<strong>in</strong>e<br />
differences <strong>in</strong> iron deficiency and the length of time that a child is bottle-fed.<br />
- In 2005, Medicare candidates faced a decision of which prescription medication plan to choose.<br />
A program called PlanF<strong>in</strong>der was made available onl<strong>in</strong>e to compare available options. But, are<br />
senior citizens onl<strong>in</strong>e<br />
- A study <strong>in</strong> 2005 attempted to answer the question, are students ruder today than <strong>in</strong> the past A<br />
survey was conducted.<br />
- Is domestic violence common A study <strong>in</strong> 2005 <strong>in</strong>terviewed about 24,000 women to attempt to<br />
answer this question.<br />
- What factors are <strong>in</strong>volved <strong>in</strong> student achievement <strong>in</strong> school Is study-time the most important<br />
factor <strong>in</strong> answer<strong>in</strong>g this question A study concluded that th<strong>in</strong>gs such as prioritiz<strong>in</strong>g student<br />
achievement and encourag<strong>in</strong>g teacher collaboration may have some impact.<br />
- Do the accounts receivable reported by a bus<strong>in</strong>ess accurately reflect the true accounts<br />
receivable The IRS randomly audits bus<strong>in</strong>esses to try and answer this question.<br />
- A stock’s share value change has fluctuated between -1.2% and 8.9% over the last year. What<br />
predictions should an <strong>in</strong>vestor make about the stock over the com<strong>in</strong>g year <strong>in</strong> order to decide<br />
whether to purchase<br />
- CVS Pharmacy sells 5 lb. bags of 100% Pure Cane Granulated Sugar. As a quality control<br />
measure, the company would like to know the amount of variability <strong>in</strong> the true weight of sugar<br />
placed <strong>in</strong>to each of the bags.<br />
2. <strong>Mak<strong>in</strong>g</strong> Good <strong>Decision</strong>s<br />
How can we ever be sure that the results we're see<strong>in</strong>g or read<strong>in</strong>g are truly the ones we should<br />
believe Although it is assumed that those who talk about data are supposed to understand<br />
statistics, you'd be surprised how poor some of their conclusions are. We'll def<strong>in</strong>itely see why by<br />
the time this course is over. You'll learn how to summarize data how to analyze it, and, most<br />
importantly, how not to make conclusions about it. The title "<strong>Mak<strong>in</strong>g</strong> Good <strong>Decision</strong>s" should<br />
not be new to you, hopefully.<br />
3. Evaluat<strong>in</strong>g <strong>Decision</strong>s that Affect Our Lives<br />
Are you satisfied that the Food and Drug Adm<strong>in</strong>istration (FDA) has allowed a new patent <strong>for</strong> the<br />
drug Zoloft, which is now also useful <strong>for</strong> Social Anxiety Disorder (<strong>in</strong> addition to depression), but<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 6
which has undergone no additional research to prove the claim Do you know why you're pay<strong>in</strong>g<br />
$720 <strong>for</strong> car <strong>in</strong>surance every six months, while you're roommate is pay<strong>in</strong>g only $450 If a<br />
mammogram comes back positive <strong>for</strong> breast cancer, is there any chance that this is a false<br />
positive Should you be surprised that no ethnic applicants were hired to a company if three<br />
applicants were to be selected, when 15 were Caucasian and 5 were Hispanic Is there a reason<br />
to suspect <strong>in</strong>equality It should not surprise you that these can be answered with probability and<br />
statistics.<br />
1.1.2 Types of Data<br />
In order to be able to reach the goals mentioned above, we need to have some sort of <strong>in</strong><strong>for</strong>mation<br />
about which to make our decisions – we call this <strong>in</strong><strong>for</strong>mation data.<br />
Data comes <strong>in</strong> two ma<strong>in</strong> categories: quantitative and qualitative/categorical.<br />
Quantitative variables, as the title implies, deal with numerical quantities. For example, the<br />
average revenue of a Whole Foods market store is considered a quantitative variable, s<strong>in</strong>ce the<br />
measurement is a number.<br />
Qualitative variables, on the other hand, deal with qualities. For example, the type of television<br />
that a customer is likely to purchase is considered a qualitative variable, s<strong>in</strong>ce its value will be,<br />
<strong>for</strong> <strong>in</strong>stance, plasma, LED, LCD, etc.<br />
1.1.3 Not All Quantitative Variables Are As They Appear!<br />
Just because a variable is stated as a numerical value doesn‟t mean that it can be treated as a<br />
numerical value. A variable must be classified accord<strong>in</strong>g to its scale of measurement.<br />
For <strong>in</strong>stance, suppose you are to test three market<strong>in</strong>g tactics on customers. You call these tactics,<br />
Tactics 1, 2, and 3, respectively. These tactics have numerical values, but the numbers do not<br />
have any order<strong>in</strong>g significance. That is, tactic 1 is not necessarily better than tactic 3. These<br />
numbers serve simply as names <strong>for</strong> the values of the variables and cannot be numerically<br />
compared. We call this a variable of nom<strong>in</strong>al scale.<br />
Suppose that a bus<strong>in</strong>ess magaz<strong>in</strong>e reports the top three new bus<strong>in</strong>esses <strong>in</strong> the city each month.<br />
That is, we have bus<strong>in</strong>esses 1, 2, and 3, where 1 is considered the best of the three, 2 the second<br />
best, and 3 the third best. In this case, we can talk about 1 be<strong>in</strong>g better than 2 and 3 and 3 be<strong>in</strong>g<br />
worse than 1 and 2. This type of variable has the properties of a nom<strong>in</strong>al scaled variable, but also<br />
has the property of order. We call this a variable of ord<strong>in</strong>al scale.<br />
In another example, consider the variable IQ. Suppose two people have IQ‟s of 100 and 120.<br />
Based on this <strong>in</strong><strong>for</strong>mation, we can say that the person with 120 has a higher IQ. However, we<br />
can also say that the second person has an IQ that is 20 po<strong>in</strong>ts higher than the first person. We<br />
couldn‟t really say this <strong>for</strong> the example above. In addition to be<strong>in</strong>g nom<strong>in</strong>al (a person can be<br />
identified by their value) and ord<strong>in</strong>al (can rank the scores), we can also talk about the differences<br />
<strong>in</strong> scores. This type of variable is of <strong>in</strong>terval scale.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 7
The most powerful type of variable is one that conta<strong>in</strong>s all of the above properties, but whose<br />
ratio between two values is mean<strong>in</strong>gful and whose value of zero means a complete absence of<br />
the characteristic. While IQ is of an <strong>in</strong>terval scale, it does not make much sense to say that the<br />
person with the 120 IQ is 20% .<br />
/ smarter than the person with the 100 IQ.<br />
Certa<strong>in</strong>ly we cannot say that a person with 0 IQ has no <strong>in</strong>telligence at all (this person is probably<br />
not even alive!). Consider, however, the median salary of different types of employees. One<br />
employee makes $100,000 and another makes $120,000. We can def<strong>in</strong>itely say that the second<br />
person makes 20% more than the first person, and we can also say that a values of $0 would<br />
<strong>in</strong>dicate a person makes no money at all (total absence of that variable). This variable is of ratio<br />
scale.<br />
1.1.4 How We Obta<strong>in</strong> Data<br />
The first question we have after know<strong>in</strong>g a bit about data is, how do we get it<br />
Exist<strong>in</strong>g Data<br />
In some <strong>in</strong>stances, this data already exists and is available to the researcher. For <strong>in</strong>stance, one<br />
can easily go onl<strong>in</strong>e and f<strong>in</strong>d exist<strong>in</strong>g data on the U.S. public. We can view th<strong>in</strong>gs like the<br />
average credit card debt per person by state, pounds of gra<strong>in</strong>s produced <strong>in</strong> the United States s<strong>in</strong>ce<br />
1950, etc. This data is usually available through a number of websites, such as:<br />
<br />
<br />
<br />
<br />
<br />
<br />
U.S. Statistical Abstract (U.S. Census) - http://www.census.gov/compendia/statab/<br />
Federal Reserve Board – http://www.federalreserve.org<br />
Office of Management and Budget – http://www.whitehouse.gov/omb<br />
Department of Commerce – http://www.doc.gov<br />
Bureau of Labor <strong>Statistics</strong> – http://www.bls.gov<br />
FedStats - http://www.fedstats.gov/<br />
There are literally thousands of other repositories <strong>for</strong> exist<strong>in</strong>g data. Sometimes a little bit of<br />
research unveils a plethora of results.<br />
If a company is do<strong>in</strong>g a study of its clients, it may already have a myriad of exist<strong>in</strong>g <strong>in</strong>ternal<br />
data.<br />
Conduct<strong>in</strong>g a Study to Obta<strong>in</strong> Data<br />
We hear a lot of th<strong>in</strong>gs com<strong>in</strong>g from our fail<strong>in</strong>g media sources. Data is bl<strong>in</strong>dly reported, while<br />
the method of data collection is ignored. Why do you th<strong>in</strong>k there are so many conflict<strong>in</strong>g<br />
conclusions reached One week coffee is l<strong>in</strong>ked to cancer, while the next it fights cancer. Which<br />
is it<br />
Many times, observational studies are conducted. There is no experimenter manipulation <strong>in</strong> this<br />
type of study. For example, a zoologist might study elephant eat<strong>in</strong>g patterns <strong>in</strong> various climates<br />
to determ<strong>in</strong>e whether climate has an effect on caloric <strong>in</strong>take (response variable – what is<br />
measured). He probably cannot manipulate the climate (predictor variable – serves to predict<br />
responses) <strong>in</strong> which the elephant lives (<strong>for</strong> many reasons, not the least of which is the difficulty<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 8
of transport<strong>in</strong>g such an animal. Not to mention, there are startl<strong>in</strong>g ethical concerns with such an<br />
action!). He probably cannot dictate how much food is <strong>in</strong> the environment, either. Certa<strong>in</strong>ly, he<br />
can get an accurate read<strong>in</strong>g of the elephant‟s food <strong>in</strong>take by follow<strong>in</strong>g the animal <strong>for</strong> several<br />
days. At the end of the day, the zoologist is merely observ<strong>in</strong>g what happens. His conclusions are<br />
limited.<br />
An experiment, on the other hand, is a type of study <strong>in</strong> which the experimenter is able to control<br />
and manipulate most, if not all, environmental factors. If the experimenter is study<strong>in</strong>g the effects<br />
of caffe<strong>in</strong>e on math test scores, <strong>for</strong> <strong>in</strong>stance, he would have a control group of, perhaps, students<br />
who he gives no coffee to and another, experimental group, to which he gives coffee with 60<br />
mg of caffe<strong>in</strong>e. He then measures each group on test score per<strong>for</strong>mance (% of total correct):<br />
Suppose the experimental group does poorly compared to the control group. Can we be sure that<br />
it was due to the caffe<strong>in</strong>e As long as test conditions were the same <strong>in</strong> each group, yes. If,<br />
however, there was someth<strong>in</strong>g different between the two groups <strong>in</strong> addition to the<br />
presence/absence of caffe<strong>in</strong>e, then the results are not so clear. What if, <strong>for</strong> <strong>in</strong>stance, they played<br />
music with the control group and none with the control group How do we know better<br />
per<strong>for</strong>mance <strong>in</strong> the control group wasn‟t an effect of sooth<strong>in</strong>g music calm<strong>in</strong>g the nerves It could<br />
even have been a comb<strong>in</strong>ation of no caffe<strong>in</strong>e and music.<br />
Punchl<strong>in</strong>e: In an experiment, we manipulate one factor and hold all other conditions constant.<br />
Most of the time it is desirable to run an experiment. The number one reason <strong>for</strong> this is that we<br />
can usually collect evidence that leads to a cause-and-effect relationship, assum<strong>in</strong>g the<br />
experiment is conducted properly. In an observational study it is impossible to do this as there<br />
are many confound<strong>in</strong>g variables, or variables that might be related to the explanatory and<br />
response variable. Consider this classic example: a researcher counts the number of crimes<br />
committed <strong>in</strong> a city and then the number of churches <strong>in</strong> that city. She does this <strong>for</strong> quite a few<br />
cities. It is found that there is a positive relationship between the number of crimes committed<br />
and the number of churches. That is, as crime <strong>in</strong>creases, so do the number of churches. What<br />
gives Do these people just repent more often <strong>for</strong> their guilty consciences<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 9
It may not come as a large shock that we're deal<strong>in</strong>g with potentially many confound<strong>in</strong>g variables.<br />
The simplest one is population. As a city's population <strong>in</strong>creases, more crime is committed and<br />
more churches are needed. This is but one possible explanation.<br />
Example 1: An educational researcher f<strong>in</strong>ds that there is a strong relationship between the<br />
number of hours a student studies and his/her grade po<strong>in</strong>t average (GPA) List a few possible<br />
confound<strong>in</strong>g variables.<br />
SOLUTION: There is no guarantee that study<strong>in</strong>g more causes a higher GPA. There are many<br />
factors that might <strong>in</strong>fluence a higher GPA:<br />
More sleep<br />
Less stress (maybe due to lack of job)<br />
Less television view<strong>in</strong>g<br />
Better study environment<br />
More support from family/friends<br />
Issues <strong>in</strong> Plann<strong>in</strong>g a Study<br />
There are many. Let's consider the follow<strong>in</strong>g scenario to help illustrate a few.<br />
Scenario: Suppose we want to test whether or not a newly designed Freud circular saw blade<br />
runs at a lower temperature, and hence causes less burn marks <strong>in</strong> the wood, than the old blade at<br />
7200 revolutions per m<strong>in</strong>ute (RPM).<br />
Can we just run the cuts, take the temperatures, and compare I th<strong>in</strong>k you know the answer to<br />
this.<br />
First off, we face many extraneous factors, or variables that are not of <strong>in</strong>terest <strong>in</strong> the current<br />
study but that are thought to affect the response variables. Examples The person do<strong>in</strong>g the<br />
cutt<strong>in</strong>g with each blade (same or not). The type of wood be<strong>in</strong>g cut (is one p<strong>in</strong>e and<br />
the other oak). The type of saw (low-power Craftsman, or professional Jet).<br />
In order to avoid hav<strong>in</strong>g these types of factors affect our measurement, we must control them.<br />
We can do this by hav<strong>in</strong>g the same person do the cutt<strong>in</strong>g, hav<strong>in</strong>g both boards be<strong>in</strong>g cut exactly<br />
the same, and use the same saw <strong>for</strong> both tests.<br />
Secondly, is it sufficient to cut just one board us<strong>in</strong>g each blade Def<strong>in</strong>itely not. We must expect<br />
that there will be some variation or variability <strong>in</strong> the temperatures we measure. That is, if I run<br />
the cut with the old saw four times, I may read temperatures of 205 , 202 , 209 and 219 . This<br />
difference among the measurements is called variability. Thus, to take <strong>in</strong>to account the<br />
variability, we must take several replications, or repeated measurements. Then, we would likely<br />
use the mean, or average of the replications.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 10
Although far from last, we will consider here one more important concept. You might not th<strong>in</strong>k<br />
anyth<strong>in</strong>g of it at first, but do you suppose that it's a good idea to use just two saw blades <strong>for</strong> the<br />
experiment - one old, one new What if we happened to get a faulty blade out of the<br />
batch If we run 4 replications with each blade, we might consider hav<strong>in</strong>g 4 of the old blades and<br />
4 of the new blades.<br />
If you have a total of 8 sheets of wood to be cut, is it okay to cut the first 4 with the old blade and<br />
the last 4 with the new blade Surpris<strong>in</strong>gly, the answer is "no." Why not Suppose the sheets<br />
were delivered freshly cut, and still moist. Well, moisture is subject to gravity, and so the last<br />
four boards might be more moist than the top four. Thus, we must randomize each board to one<br />
of the two types of saw blades. In other words, we randomly assign each board to a blade. We<br />
will not consider this any further at this po<strong>in</strong>t.<br />
Homework Problems - 1.1<br />
1. Classify each of the follow<strong>in</strong>g variables as nom<strong>in</strong>al, ord<strong>in</strong>al, <strong>in</strong>terval, or ratio scale.<br />
Justify your answer.<br />
a. Favorite flavor of ice cream<br />
b. Temperature ( F)<br />
c. Accounts Receivable Balance<br />
d. Rank<strong>in</strong>g of Presidential Candidates Accord<strong>in</strong>g to Preference<br />
2. Based on a study of 2121 children between the ages of one and four, researchers at the<br />
Medical College of Wiscons<strong>in</strong> concluded that there was an association between iron<br />
deficiency and the length of time that a child is bottle-fed (Milwaukee Journal Sent<strong>in</strong>al,<br />
November 26, 2005).<br />
a. How many elements does this dataset conta<strong>in</strong><br />
b. Is the variable categorical or quantitative Expla<strong>in</strong>.<br />
3. The student senate at a university with 15,000 students is <strong>in</strong>terested <strong>in</strong> the proportion of<br />
students who favor a change <strong>in</strong> the grad<strong>in</strong>g system to allow <strong>for</strong> plus and m<strong>in</strong>us grades<br />
(e.g., B+, B, B-, rather than just B). Two hundred students are <strong>in</strong>terviewed to determ<strong>in</strong>e<br />
their attitude toward this proposed change.<br />
a. How many elements does this dataset conta<strong>in</strong><br />
b. Is the variable categorical or quantitative Expla<strong>in</strong>.<br />
4. An article titled “Guard Your Kids Aga<strong>in</strong>st Allergies: Get Them a Pet” (San Luis Obispo<br />
Tribune, August 28, 2002) described a study that led researchers to conclude that “babies<br />
raised with two or more animals were about half as likely to have allergies by the time<br />
they turned six.”<br />
a. Is this study an observational study or an experiment Expla<strong>in</strong>.<br />
b. Describe a potential confound<strong>in</strong>g variable that illustrates why it is unreasonable to<br />
conclude that be<strong>in</strong>g raised with two or more animals is the cause of the observed<br />
lower allergy rate.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 11
5. The article “Television‟s Value to Kids: It‟s All <strong>in</strong> How They Use It” (Seattle Times, July<br />
6, 2005) described a study <strong>in</strong> which researchers analyzed standardized test results and<br />
television view<strong>in</strong>g habits of 1700 children. They found that children who averaged more<br />
than two hours of television view<strong>in</strong>g per day when they were younger than 3 tended to<br />
score lower on measures of read<strong>in</strong>g ability and short term memory.<br />
a. Is the study described an observational study or an experiment<br />
b. Is it reasonable to conclude that watch<strong>in</strong>g two or more hours of television is the<br />
cause of lower read<strong>in</strong>g scores Expla<strong>in</strong>.<br />
6. “More than half of Cali<strong>for</strong>nia‟s doctors say they are so frustrated with managed care they<br />
will quit, retire early, or leave the state with<strong>in</strong> three years.” This conclusion from an<br />
article titled “Doctors Feel<strong>in</strong>g Pessimistic, Study F<strong>in</strong>ds” (San Luis Obispo Tribune, July<br />
15, 2001) was based on a mail survey conducted by the Cali<strong>for</strong>nia Medical Association.<br />
Surveys were mailed to 19,000 Cali<strong>for</strong>nia doctors, and 2000 completed surveys were<br />
returned.<br />
a. Is this study an observational study or an experiment Expla<strong>in</strong>.<br />
b. Describe any concerns you have regard<strong>in</strong>g the conclusion drawn.<br />
1.2Descriptive VS. Inferential <strong>Statistics</strong><br />
1.2.1 The Purpose of <strong>Statistics</strong> and “<strong>Statistics</strong>”<br />
<strong>Statistics</strong> is a branch of mathematics that deals with the analysis of data. This is often confus<strong>in</strong>g<br />
to some people, s<strong>in</strong>ce the lower-case version of this word, statistic, actually means: a piece of<br />
data. So, we have statistics, which are the data themselves, and we have <strong>Statistics</strong>, which deals<br />
with the analysis of statistics. Confus<strong>in</strong>g, huh We generally use the word statistics loosely to<br />
mean “data.”<br />
A statistician is a special type of mathematician who deals with the analysis of data. Many<br />
people confuse the profession of the statistician with a person who simply has many statistics<br />
memorized. While some certa<strong>in</strong>ly may, most do not.<br />
Needless to say, our purpose <strong>in</strong> the field of <strong>Statistics</strong> is to understand data. Depend<strong>in</strong>g on one‟s<br />
goal, statistics may be used to simply describe an obta<strong>in</strong>ed set of data or to extrapolate the data to<br />
describe someth<strong>in</strong>g much larger. These two goals are respectively called, descriptive and<br />
<strong>in</strong>ferential statistics.<br />
1.2.2 Descriptive <strong>Statistics</strong><br />
Suppose you work <strong>in</strong> the account<strong>in</strong>g department and have collected the follow<strong>in</strong>g data on<br />
revenues earned from new and exist<strong>in</strong>g customers over the past day:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 12
Account Type Revenue ($)<br />
New $5,296<br />
Old $2,230<br />
Old $7,643<br />
Old $3,897<br />
Old $9,590<br />
Old $2,689<br />
Old $5,890<br />
Old $9,561<br />
New $3,643<br />
New $8,861<br />
Old $3,946<br />
Your goal is to summarize the data <strong>in</strong> some mean<strong>in</strong>gful way(s). Descriptive statistics is the<br />
method of describ<strong>in</strong>g or summariz<strong>in</strong>g data. How could this be done<br />
We first consider the types of variables we have present:<br />
Account type – Categorical<br />
o New, Old<br />
<br />
Revenue – Quantitative<br />
o Range from $2,230 to $9,590<br />
With categorical variables, we cannot mathematically manipulate the observed values, or<br />
observations (here we have “New” and “Old” <strong>for</strong> observations). We can only provide<br />
descriptions of the values.<br />
We can provide the relative frequency of these values. A relative frequency is a ratio of the<br />
number of observations of a given value to the total number of observations. Here, we could<br />
summarize by say<strong>in</strong>g:<br />
Account Type Relative Frequency<br />
New<br />
Old<br />
This allows us to conclude that 27% of the sales came from new clients while 73% came from<br />
exist<strong>in</strong>g clients. This is very valuable <strong>in</strong><strong>for</strong>mation! This <strong>in</strong><strong>for</strong>mation demonstrates that the<br />
company has grown over the course of this one day.<br />
We could present these two descriptive statistics to management by either provid<strong>in</strong>g the raw<br />
percentages, or by some visual display, such as a pie chart or a bar graph. A pie chart shows<br />
the ratios (or all parts of one whole) of the categorical variable and thus the entire circle<br />
represents 100% of all account types (100% of the categorical variable values):<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 13
Frequency<br />
Account Type<br />
New<br />
27%<br />
Old<br />
73%<br />
This literally shows the “<strong>in</strong>gredients” of the pie. A correspond<strong>in</strong>g bar graph might be:<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
New<br />
Account Type<br />
Type<br />
Old<br />
In a similar way, we could describe Revenue, the quantitative variable. Typically, quantitative<br />
variables are described by:<br />
<br />
<br />
Central tendency – measure of the “typical” or center-most observation. Examples are<br />
mean (average), median (the value that is literally the middle number), and mode (most<br />
frequently occurr<strong>in</strong>g number – typically not used and data sets usually do not have one).<br />
Variability – measure of how spread-out the data values are. A number of possible<br />
measures exist <strong>in</strong>clud<strong>in</strong>g (but not limited to): range, <strong>in</strong>terquartile range, and standard<br />
deviation.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 14
For the present time, we‟ll proceed to describe one of each of the above descriptive statistics.<br />
The rest will be discussed <strong>in</strong> later sections.<br />
S<strong>in</strong>ce we‟re most used to f<strong>in</strong>d<strong>in</strong>g a simple average, or the mean, we will do that here. Recall, that<br />
the mean can be found by summ<strong>in</strong>g the observations and divid<strong>in</strong>g by the number of<br />
observations:<br />
Recall that when we f<strong>in</strong>d an average, we are plac<strong>in</strong>g all values <strong>in</strong>to a common “pot.” We then<br />
divide the pot <strong>in</strong>to equal parts. That is to say, if each company had spent the same amount of<br />
money on each purchase, they would each spend $5,750. We like to th<strong>in</strong>k of this as a measure of<br />
the center value. Spend<strong>in</strong>g less than this amount puts a company below the average and spend<strong>in</strong>g<br />
more puts the company above the average.<br />
Mean (Simple Average)<br />
The mean, or simple average, of a quantitative variable is expressed as:<br />
This value represents the amount allocated to each observation, if each observation were to<br />
receive an equal share of the total. We th<strong>in</strong>k of this as the “center” value.<br />
In conjunction with measures that summarize the center, it is critical to focus also on how spread<br />
out the data is. One such measure is the range. The range is simply the difference between the<br />
m<strong>in</strong>imum and maximum values <strong>in</strong> the dataset. In this <strong>in</strong>stance, we have:<br />
M<strong>in</strong>imum: $2,230<br />
Maximimum: $9,590<br />
The difference is:<br />
Thus, the range of the dataset is $7,360. This tells us that the amount spent varied by as much as<br />
$7,360 from company-to-company.<br />
Range<br />
Range, a measure of the variability (or spread) of a dataset, is measured by tak<strong>in</strong>g the difference<br />
between the largest and smallest observed value. That is,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 15
Example 1: For the example considered above, summarize the center and spread of revenue<br />
by account type. Describe any <strong>in</strong><strong>for</strong>mation revealed by splitt<strong>in</strong>g up the data <strong>in</strong> this fashion.<br />
SOLUTION: We are be<strong>in</strong>g asked to look at values specific to the account type. Thus, we will<br />
have two means and two ranges.<br />
For “New” accounts:<br />
Account Type Revenue ($)<br />
New $5,296<br />
New $3,643<br />
New $8,861<br />
For “Old” accounts:<br />
Account Type Revenue ($)<br />
Old $2,230<br />
Old $7,643<br />
Old $3,897<br />
Old $9,590<br />
Old $2,689<br />
Old $5,890<br />
Old $9,561<br />
Old $3,946<br />
We summarize this <strong>in</strong><strong>for</strong>mation <strong>in</strong> a table:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 16
Row Labels Average of Revenue ($) Max of Revenue ($) M<strong>in</strong> of Revenue ($) Range<br />
New 5933 8861 3643 5218<br />
Old 5681 9590 2230 7360<br />
Grand Total 5750 9590 2230<br />
We see that both company‟s tend to have about the same average purchase amount. However, it<br />
appears that the amount spent by old customers is prone to more fluctuation than that of new<br />
customers. This might be due simply to the fact that there are only three new customers.<br />
Technology Note: All of the <strong>in</strong><strong>for</strong>mation above was generated us<strong>in</strong>g Microsoft Excel.<br />
1.2.3 Inferential <strong>Statistics</strong><br />
Descriptive statistics is a great way to describe what you have, but how can we describe data that<br />
we do not have<br />
Let‟s consider an example. You are the manager of the production branch at Healthy Heart<br />
Organic Foods. Due to recent workload <strong>in</strong>creases, you are concerned that your employees‟ team<br />
morale has decreased. You have 864 employees work<strong>in</strong>g <strong>in</strong> your department. You would like to<br />
conduct a survey, but you do not have the means to <strong>in</strong>vestigate the data <strong>in</strong> each of the surveys<br />
provided. Certa<strong>in</strong>ly, you could pay your assistant overtime to analyze them <strong>for</strong> you, but that<br />
would be costly of his time and payroll. Instead, you decide to randomly survey 50 of the<br />
employees <strong>in</strong> your department <strong>in</strong> order to get an idea of the overall morale. This process of<br />
collect<strong>in</strong>g data on a smaller portion of the whole <strong>in</strong> order to generalize to the whole is known as<br />
statistical <strong>in</strong>ference. This branch of statistics is called <strong>in</strong>ferential statistics.<br />
It is of utmost importance to make appropriate conclusions when report<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs of any study,<br />
a survey or an experiment. For example, if we f<strong>in</strong>d that rats die after <strong>in</strong>gestion of 20mg of<br />
caffe<strong>in</strong>e, does that mean caffe<strong>in</strong>e will kill a human, as well This br<strong>in</strong>gs up the worthwhile<br />
discussion of a population versus the sample. Let‟s consider the figure below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 17
First off, a researcher must decide who his target population is. That is, is he try<strong>in</strong>g to describe<br />
all people <strong>in</strong> the United States All Asian children between the ages of 2 and 5 All elk <strong>in</strong><br />
M<strong>in</strong>nesota The population is the set of all people, creatures, th<strong>in</strong>gs, etc., that we wish to<br />
describe.<br />
It is often quite time-consum<strong>in</strong>g and costly to conduct a study based on whole populations. Even<br />
presidential polls rarely <strong>in</strong>volve more than a couple hundred participants. Through one of a<br />
variety of processes, only a select number of elements of the target population will be selected.<br />
This select number is referred to as the sample. The process of select<strong>in</strong>g a sample from the<br />
population that we will consider is simple random sampl<strong>in</strong>g (SRS). This process helps to<br />
ensure that any differences that we notice among sample elements is entirely due to chance and,<br />
importantly, that every element <strong>in</strong> the target population has an equally likely chance of be<strong>in</strong>g <strong>in</strong><br />
the sample.<br />
Simple random sampl<strong>in</strong>g can be done by many means. You’ve probably heard of the random<br />
process of draw<strong>in</strong>g a name from a box to declare the w<strong>in</strong>ner of a raffle. More sophisticated<br />
means of this are done by a random number generator on a computer, where<strong>in</strong> every element of<br />
the population is assigned a whole number. Then, a series of random numbers is drawn by a<br />
computer and those elements are selected to be <strong>in</strong> the sample.<br />
We can see <strong>in</strong> the illustration above that our goal is to then make <strong>in</strong>ferences about the population<br />
based on our observations of the sample. Just as you might hear from Gallup: “55% of voters<br />
plan on vot<strong>in</strong>g <strong>for</strong> Candidate X,” we try to make generalizations based on the target population.<br />
As another example, consider a light<strong>in</strong>g company that is hop<strong>in</strong>g to manufacture a light bulb with<br />
a new type of filament. As with any light bulb, a consumer would want to know how long the<br />
light bulb is expected to last. Un<strong>for</strong>tunately, not every light bulb will last equally long as every<br />
other light bulb. This means that an average will have to be taken. To add to this, it is not<br />
possible to test every s<strong>in</strong>gle light bulb to determ<strong>in</strong>e how long it will last. So, the company<br />
decides to randomly test 200 bulbs that come through the assembly l<strong>in</strong>e. They hope to use this<br />
sample, s<strong>in</strong>ce it is random and is assumed to be representative of all light bulbs, to estimate the<br />
true average lifespan of a light bulb with this new filament. Here is an overview of their<br />
<strong>in</strong>ferential statistics process:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 18
(SOURCE: Essentials of Modern Bus<strong>in</strong>ess <strong>Statistics</strong>, 4 th Edition, Anderson, et. al.)<br />
Though it might seem simple enough to conclude that the average light bulb survives <strong>for</strong> 76<br />
hours, we have to take <strong>in</strong>to account the variability <strong>in</strong> the lifetimes. That is to say, we need some<br />
way to produce a reasonable <strong>in</strong>terval <strong>for</strong> the true average, s<strong>in</strong>ce it is the entire population we are<br />
look<strong>in</strong>g to describe. A discussion of this <strong>in</strong>ference process is left <strong>for</strong> future sections.<br />
Homework Problems - 1.2<br />
1. Over its first week <strong>in</strong> the Box Office (12/14/2012 to 12/20/2012), the movie The Hobbit:<br />
An Unexpected Journey grossed the follow<strong>in</strong>g amounts, <strong>in</strong> millions of dollars (no<br />
particular order):<br />
6.9 9.2 1.6 1.9 1.9 1.6 4.9<br />
(SOURCE: www.the-numbers.com)<br />
a. Calculate the mean.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the mean.<br />
c. Calculate the range.<br />
d. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the range.<br />
e. Provide a brief written report (summary) to the producers of the film on how the<br />
film is do<strong>in</strong>g and the stability of gross revenues.<br />
2. A market<strong>in</strong>g firm conducts a focus group with eighteen randomly selected college<br />
students to determ<strong>in</strong>e their preference <strong>for</strong> a variety of cloth<strong>in</strong>g l<strong>in</strong>es.<br />
a. Describe the sample.<br />
b. Describe the population.<br />
c. What variables might the market<strong>in</strong>g firm want to measure<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 19
d. Is the firm‟s goal to conduct descriptive or <strong>in</strong>ferential statistics<br />
3. In a quality control process, 250 packages of cheese are randomly selected from an<br />
assembly l<strong>in</strong>e. Each package of cheese will be described as either “pass” or “fail,”<br />
depend<strong>in</strong>g on whether or not it passes the <strong>in</strong>spection.<br />
a. Describe the sample.<br />
b. Describe the population.<br />
c. Quality control will fail if more than 1% of the packages fail. How many<br />
packages must pass<br />
4. Two datasets have a range of 30. Describe how it is possible that one dataset is<br />
considered to be more spread out that the other dataset.<br />
5. One hundred randomly selected CGCC students are surveyed and asked, “Do you believe<br />
that racism is an issue <strong>in</strong> the college sett<strong>in</strong>g” The survey makers would like to generalize<br />
to college students. What is wrong with their study<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 20
1.3<strong>Statistics</strong> <strong>in</strong> Excel<br />
When conduct<strong>in</strong>g an analysis of realistic amounts of data, it is tiresome, mundane, and even<br />
unfeasible to carry out computations by hand. Microsoft Excel is by far a more powerful and<br />
accessible piece of software that does this all <strong>for</strong> us. As such, we seek to better understand how it<br />
works <strong>in</strong> this section. All images below come from the most recent version of Microsoft Excel.<br />
Excel is a spreadsheet-based software. This means that each entry, or cell, represents one piece<br />
of <strong>in</strong><strong>for</strong>mation that is all a part of a larger grid of cells. A cell may conta<strong>in</strong> numerical or textual<br />
<strong>in</strong><strong>for</strong>mation.<br />
1.3.1 Sum(), Average(), M<strong>in</strong>(), and Max()<br />
Eventually, you will learn to make beautiful spreadsheets, but we are now only concerned with<br />
some basic features. Let‟s beg<strong>in</strong> by enter<strong>in</strong>g the follow<strong>in</strong>g account<strong>in</strong>g data from Section 1.2:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 21
Account Type Revenue ($)<br />
New $5,296<br />
Old $2,230<br />
Old $7,643<br />
Old $3,897<br />
Old $9,590<br />
Old $2,689<br />
Old $5,890<br />
Old $9,561<br />
New $3,643<br />
New $8,861<br />
Old $3,946<br />
We can choose any cell we want to beg<strong>in</strong> enter<strong>in</strong>g data. Let‟s choose cell A1 to type <strong>in</strong> the<br />
header. This cell reference means that we are look<strong>in</strong>g at row A and column 1. We will enter our<br />
second column‟s label <strong>in</strong>to cell B1. We will list the data vertically, as shown <strong>in</strong> the table above.<br />
After click<strong>in</strong>g on a cell and typ<strong>in</strong>g <strong>in</strong> each entry, simply press ENTER or TAB to move to the<br />
next cell. Do not press ESC, or the data you are typ<strong>in</strong>g will be cancelled.<br />
In order to see the entire labels <strong>in</strong> cells A1 and B1, we can expand the column by plac<strong>in</strong>g the<br />
cursor between the grey-shaded labels <strong>for</strong> columns A and B, click<strong>in</strong>g, hold<strong>in</strong>g, and dragg<strong>in</strong>g the<br />
w<strong>in</strong>dow to an appropriate size.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 22
We can make it a bit more presentable by center<strong>in</strong>g and by bold<strong>in</strong>g the labels.<br />
Excel is extremely useful due to the fact that it allows us to create <strong>for</strong>mulas based on the values<br />
of exist<strong>in</strong>g cells or cell ranges (a collection of one or more cells).<br />
A <strong>for</strong>mula can either act on a provided value or on a provided set of cells. For example, suppose<br />
we want to add up the total revenue. We want the result to appear <strong>in</strong> cell D3. To <strong>in</strong>itiate a<br />
<strong>for</strong>mula, we must beg<strong>in</strong> with = <strong>in</strong> the desired <strong>for</strong>mula cell. Thus, we could click cell D3 and<br />
type:<br />
This, however, would defeat the purpose of hav<strong>in</strong>g entered all the data <strong>in</strong> already! So, we will<br />
use the built <strong>in</strong> sum function. To use this, we type:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 23
= sum(B2:B12)<br />
This tells Excel to sum up the range of values from B2 to B12. The colon <strong>in</strong>dicates that we want<br />
the full range and not just the two cells B2 and B12. If we were only to have wanted to sum cells<br />
B2 and B12 (no <strong>in</strong> between), then we would have replaced the colon with a comma.<br />
NOTE: Excel is not case-sensitive when it comes to <strong>for</strong>mulas. You can type SUM or Sum or<br />
even sUm and Excel will recognize what you are ask<strong>in</strong>g it to do. However, if you are analyz<strong>in</strong>g<br />
categorical data, then “New” is not recognized as be<strong>in</strong>g the same as “new.”<br />
We get:<br />
(NOTE: It is highly recommended that you label your spreadsheet values. Be<strong>for</strong>e or after<br />
<strong>in</strong>sert<strong>in</strong>g the sum <strong>in</strong>to D3, it is a good idea to label that cell‟s content, perhaps <strong>in</strong> cell C3 as<br />
shown above. This will be very helpful when your spreadsheet is loaded with <strong>in</strong><strong>for</strong>mation.)<br />
To get the proper <strong>for</strong>matt<strong>in</strong>g, highlight cell D3 and select “Currency” from the Number column<br />
<strong>in</strong> the Home Tab. This <strong>for</strong>matt<strong>in</strong>g only applies to the selected cell(s).<br />
To f<strong>in</strong>d the average revenue, we would simply type the follow<strong>in</strong>g <strong>in</strong>to the desired cell (we‟ll use<br />
D4):<br />
= average(B2:B12)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 24
For measures such as the range, Excel does not have a built-<strong>in</strong> range function. Excel does have a<br />
function to locate the maximum and m<strong>in</strong>imum values <strong>in</strong> a range of cells. Into cell D5, we will<br />
type <strong>in</strong>:<br />
= max(B2:B12) – m<strong>in</strong>(B2:B12)<br />
This will f<strong>in</strong>d the maximum value from B2 to B12 and subtract away the m<strong>in</strong>imum from B2 to<br />
B12, giv<strong>in</strong>g us precisely the range. If it is desirable to see the max or the m<strong>in</strong>, you can choose a<br />
cell and simply type <strong>in</strong> the max portion or the m<strong>in</strong> portion without do<strong>in</strong>g the subtraction, as<br />
shown below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 25
Suppose that this company assumes the daily revenue of $63,246 is (roughly) expected to be<br />
earned on a daily basis over the next 30-day month. To get the month‟s revenue we would like to<br />
multiply this amount by 30. To do this, we would simply type <strong>in</strong>to our desired output cell:<br />
= 30*D3<br />
NOTE: To <strong>in</strong>dicate multiplication <strong>in</strong> Excel <strong>for</strong>mulas, you must use the multiplication sign.<br />
Parenthesis to <strong>in</strong>dicate multiplication will produce an error.<br />
There are literally hundreds of functions available through Excel. A very useful tool <strong>for</strong> learn<strong>in</strong>g<br />
how to do new th<strong>in</strong>gs <strong>in</strong> Excel is to Google what you are try<strong>in</strong>g to accomplish. For example, if I<br />
wanted to f<strong>in</strong>d the standard deviation of revenues, I might search Google <strong>for</strong> “standard deviation<br />
<strong>in</strong> Excel.” Thousands of results are bound to pop-up. Why stop there… try YouTube <strong>for</strong> many<br />
useful videos.<br />
1.3.2 Countif()<br />
It is nice to know that Excel has <strong>for</strong>mulas to operate on quantities, but it could still be<br />
devastat<strong>in</strong>g to have to count categorical values by hand.<br />
The countif() function is useful <strong>for</strong> such an act. This function works as follows: you provide a<br />
range of cells <strong>for</strong> the function to evaluate. You then provide a condition that it should search <strong>for</strong><br />
and it counts the number of such <strong>in</strong>stances. Suppose we want to count the number of new<br />
accounts <strong>in</strong> cells B2 to B12. We would enter:<br />
= countif(B2:B12, “New”)<br />
NOTE: we separate the cell range with a comma. After the comma, we type <strong>in</strong> parenthesis the<br />
word it is to search <strong>for</strong>. Note that case does matter here, s<strong>in</strong>ce we need to tell Excel exactly what<br />
to search <strong>for</strong>.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 26
We get:<br />
We can do the same <strong>for</strong> Old.<br />
A neat little trick is to modify our <strong>for</strong>mula. Let‟s say that we want to m<strong>in</strong>imize the number of<br />
areas <strong>in</strong> our spreadsheet that we would need to change if, say, we began call<strong>in</strong>g “New” accounts<br />
“NB” <strong>for</strong> “New Bus<strong>in</strong>ess.” We would need to change all the account type names, as well as the<br />
search criteria <strong>in</strong> the <strong>for</strong>mula. To make this easier, we can tell our <strong>for</strong>mula to search <strong>for</strong><br />
someth<strong>in</strong>g that is already typed <strong>in</strong>to an exist<strong>in</strong>g cell. S<strong>in</strong>ce C10 conta<strong>in</strong>s the actual word we want<br />
to search <strong>for</strong>, we will simply put C10 after the comma <strong>in</strong>stead of the word “New.”<br />
= countif(B2:B12, C10)<br />
This tells Excel what cells to count, and it tells it what cell to f<strong>in</strong>d the search criteria <strong>in</strong>. We still<br />
get the same result. Caution to the w<strong>in</strong>d: if you modify the entry <strong>in</strong> C10, your result <strong>in</strong> D10 will<br />
change accord<strong>in</strong>gly (or it might produce an error).<br />
Homework Problems - 1.3<br />
1. A new policy prohibit<strong>in</strong>g personal emails be<strong>in</strong>g sent is en<strong>for</strong>ced by a telemarket<strong>in</strong>g<br />
company. A climate survey was then conducted to ask whether a randomly selected<br />
number of employees agrees with the policy, and the duration of time they‟ve been with<br />
the company. The results are below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 27
Agrees w/Policy<br />
Change<br />
Years at<br />
Company<br />
Y 4<br />
Y 8<br />
N 3<br />
Y 10<br />
N 3<br />
N 3<br />
N 6<br />
Y 3<br />
Y 5<br />
N 8<br />
N 1<br />
Y 8<br />
Y 10<br />
Y 5<br />
Y 8<br />
Y 3<br />
N 8<br />
N 8<br />
Y 9<br />
a. Determ<strong>in</strong>e the mean number of years this sample has been with the company.<br />
b. Determ<strong>in</strong>e the m<strong>in</strong>imum and maximum number of years a person from this<br />
sample has been with the company.<br />
c. Determ<strong>in</strong>e the comb<strong>in</strong>ed overall number of years this sample has been with the<br />
company.<br />
d. Determ<strong>in</strong>e the frequency with which people with<strong>in</strong> this sample agreed and<br />
disagreed with the policy change.<br />
e. Calculate the mean, the m<strong>in</strong>imum and maximum, and the range <strong>for</strong> each of the<br />
two groups (agree and disagree).<br />
f. Describe any patterns that emerged when consider<strong>in</strong>g the two groups separately.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 28
Chapter 2<br />
Visual Representations of Data<br />
2.1 Visualiz<strong>in</strong>g Categorical Data<br />
When summariz<strong>in</strong>g data, it goes without say that there are appropriate and <strong>in</strong>appropriate ways to<br />
display the data. For example, if you collected a person‟s age and <strong>in</strong>come, you might be<br />
<strong>in</strong>terested <strong>in</strong> study<strong>in</strong>g <strong>in</strong>come as a function of age. In this case, you probably would not want to<br />
build a pie chart, s<strong>in</strong>ce you‟re study<strong>in</strong>g quantitative variables (two of them, at that).<br />
In the previous chapter, the ma<strong>in</strong> types of categorical data visualizations were mentioned – bar<br />
graphs and pie charts. Our aim here is simply to summarize and to show how to use them <strong>in</strong><br />
conjunction with Excel. We‟ll create three types of representations:<br />
<br />
<br />
<br />
Pie Chart<br />
Frequency Bar Graph – Vertical axis keeps tracks the number of <strong>in</strong>stances of each<br />
observation<br />
Relative Frequency Bar Graph – Vertical axis keeps tracks the ratio of <strong>in</strong>stances of each<br />
observation (decimal or percentage, typically)<br />
2.1.1 Creat<strong>in</strong>g a Pie Chart Us<strong>in</strong>g Excel<br />
Suppose a hotel owner asks 20 randomly selected recent guests to respond to the follow<strong>in</strong>g<br />
statement regard<strong>in</strong>g their experiences at the new hotel lounge:<br />
“The d<strong>in</strong><strong>in</strong>g experience <strong>in</strong> Harlan’s Hotel Lounge is worth revisit<strong>in</strong>g.”<br />
Respondents circle one of the follow<strong>in</strong>g letter comb<strong>in</strong>ations:<br />
- SD - Strongly Disagree<br />
- D -Disagree<br />
- A - Agree<br />
- SA - Strongly Agree<br />
The result<strong>in</strong>g data is shown below:<br />
Participant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20<br />
Op<strong>in</strong>ion D A SD A SD SA A A A A A A D A A A A A A A<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 29
To represent the data to his shareholders, his market<strong>in</strong>g team constructs the above visual<br />
representations.<br />
S<strong>in</strong>ce the participant number is not important, it is okay to ignore that l<strong>in</strong>e of the dataset. Our<br />
focus is on the Op<strong>in</strong>ion row. This is a categorical variable, so we‟ll beg<strong>in</strong> by count<strong>in</strong>g the<br />
number of SD, D, A, and SA responses by us<strong>in</strong>g Excel‟s countif() option. Further, we‟ll calculate<br />
the relative frequency of each response by divid<strong>in</strong>g the number of responses <strong>for</strong> each category by<br />
the total number of observations, which we tally below all the <strong>in</strong>dividual frequencies:<br />
One new trick worth mention<strong>in</strong>g is Excel‟s ability to recognize patterns <strong>in</strong> our <strong>for</strong>mulas. Let‟s<br />
say that we typed <strong>in</strong> our countif() <strong>for</strong>mula <strong>for</strong> SD <strong>in</strong> G7 as follows.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 30
We now have to enter a <strong>for</strong>mula <strong>for</strong> the three rema<strong>in</strong><strong>in</strong>g op<strong>in</strong>ions. This can get time-consum<strong>in</strong>g.<br />
So, we attempt to copy cell G7 and paste it <strong>in</strong> G8:<br />
This does work! Note that, s<strong>in</strong>ce we shifted the <strong>for</strong>mula down one level, F7 turned <strong>in</strong>to F8. That<br />
is, the search criteria is now be<strong>in</strong>g “pulled” from F8, the cell correspond<strong>in</strong>g to an op<strong>in</strong>ion of „D‟.<br />
However, we have one problem: the count<strong>in</strong>g region also shifted from D6:D25 to D7:D26. We<br />
don‟t want that! To tell Excel that we still want the count<strong>in</strong>g region to be D6:D25 and to not<br />
change when we copy our <strong>for</strong>mula, we “lock” the rows and columns by putt<strong>in</strong>g a dollar-sign ($)<br />
be<strong>for</strong>e the row letter and be<strong>for</strong>e the column number, as shown below:<br />
(HINT: If you place your cursor over each of the cell names <strong>in</strong> the <strong>for</strong>mula and press command<br />
F4 on your keyboard, you will notice the dollar-sign toggle <strong>for</strong> you)<br />
Notice that F7 conta<strong>in</strong>s no dollar-signs, so as to <strong>in</strong>dicate to Excel that we wish <strong>for</strong> the criteria cell<br />
to adjust down one row (still <strong>in</strong> column F) as we move down one row. We can now copy-paste<br />
the <strong>for</strong>mula down the rema<strong>in</strong><strong>in</strong>g cells:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 31
In G12, we would like the sum of the frequencies, so we type:<br />
= sum(G7:G10)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 32
We know from the data that this value is correct!<br />
To get the relative frequencies, we want to divide each frequency by the constant 20. For<br />
<strong>in</strong>stance, the relative frequency of „A‟ would be 2/20 = 0.1. Instead of tell<strong>in</strong>g Excel to divide 2<br />
by 20, we will type the follow<strong>in</strong>g <strong>for</strong>mula <strong>in</strong>to H7:<br />
= G7/$G$11<br />
Note that we lock cell G11 so that, when we copy this <strong>for</strong>mula to the rema<strong>in</strong><strong>in</strong>g cells, we<br />
cont<strong>in</strong>ue to divide by 20, the value <strong>in</strong> G11.<br />
It is neat to note that we can copy the <strong>for</strong>mula all the way down to H11, s<strong>in</strong>ce it will simply take<br />
20 and divide it by 20, <strong>in</strong>dicat<strong>in</strong>g that the total is 1 or 100% of the data.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 33
We are now prepared to construct visuals.<br />
To build a pie chart, we can simply highlight the four op<strong>in</strong>ions and the correspond<strong>in</strong>g<br />
frequencies (click and drag from cell F7 to G10), select<strong>in</strong>g the Insert tab, click<strong>in</strong>g on Pie <strong>in</strong> the<br />
Charts column, and select<strong>in</strong>g the desired pie chart. We‟ll select the first one.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 34
Alternatively, it is possible to <strong>in</strong>sert a blank pie chart and to then select the data afterwards. The<br />
above process saves a couple of steps.<br />
Now we would like to label the chart. It would be nice to see a title and the percentages <strong>for</strong> each<br />
of the slices. To do this, select the chart and click on Design <strong>in</strong> the Chart Tools tab that appears.<br />
In the Chart Layouts column, we can select the style of chart most appropriate to our needs. For<br />
demonstration purposes, the first option will be shown below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 35
To add a suitable title, click “Chart Title” and overwrite it with an appropriate name. If the pie<br />
chart become distorted or label are moved undesirably, the chat box can be adjusted by dragg<strong>in</strong>g<br />
out its corners.<br />
There are many options when it comes to <strong>for</strong>matt<strong>in</strong>g graphs and charts. This will be left <strong>for</strong><br />
exploration. Note also that many onl<strong>in</strong>e sources, such as YouTube, offer tutorials on professional<br />
<strong>for</strong>matt<strong>in</strong>g with<strong>in</strong> Excel.<br />
2.1.2 Creat<strong>in</strong>g a Bar Graph Us<strong>in</strong>g Excel<br />
Depend<strong>in</strong>g on what one would like to emphasize, a bar graph may be suitable to meet that need.<br />
We can create either a frequency bar graph or a relative frequency, depend<strong>in</strong>g on whether we<br />
want to display the number of times an observation appears or the percentage of observations<br />
result<strong>in</strong>g <strong>in</strong> each of the possible variable values.<br />
Us<strong>in</strong>g our example from above, s<strong>in</strong>ce the frequencies are <strong>in</strong> the column adjacent to the op<strong>in</strong>ion<br />
value, we can simply highlight all observations and frequencies and select the Insert tab, the<br />
Charts column, and select the first 2-D Column graph from Column. Be careful not to select the<br />
Total row.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 36
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
Series1<br />
4<br />
2<br />
0<br />
SD D A SA<br />
There is only one variable here, we can click on the “Series1” <strong>in</strong> the legend and press DELETE.<br />
This will free-up some space.<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
SD D A SA<br />
With the graph selected, Choose the Layout tab that appears <strong>in</strong> the Chart Tools area.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 37
Frequency<br />
You can label the graph by select<strong>in</strong>g appropriate options from “Chart Title” and “Axis Titles” on<br />
the left side of the selected tab.<br />
Guest Op<strong>in</strong>ions of Harlan's Lounge<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
SD D A SA<br />
Op<strong>in</strong>ion<br />
In the relative frequency bar graph, we wish only to change the measurement on the vertical axis.<br />
We want to draw the proportions from the third column of our data.<br />
We can update our current bar graph to reflect this. If you do not want to lose the <strong>in</strong><strong>for</strong>mation <strong>in</strong><br />
your frequency bar graph, you can copy the graph and paste it beside the exist<strong>in</strong>g graph. This<br />
will allow us to modify the data that is be<strong>in</strong>g drawn <strong>in</strong>.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 38
Selected the copied graph. In Chart Tools, select the Design tab. From there, click on Select<br />
Data.<br />
Select the “Edit” option above the “Legend Entries” box.<br />
Beside the “Series values” box, click the icon. This will now allow you to select the values<br />
of the dependent variable. Click and drag to select all the relative frequencies, except the total<br />
frequency. Then press the icon to close the dialogue box. After relabel<strong>in</strong>g the vertical axis,<br />
you should now see:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 39
Relative Frequency<br />
Guest Op<strong>in</strong>ions of Harlan's Lounge<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
SD D A SA<br />
Op<strong>in</strong>ion<br />
We notice that both graphs look nearly identical. This is due to the fact that the relative<br />
frequencies are proportional to the frequencies (they are the frequencies multiplied by 1/20!).<br />
2.1.3 Conclusions<br />
The owner of the hotel can reasonably conclude that 80% of his recent guests enjoyed the lounge<br />
(enough to consider revisit<strong>in</strong>g!). He can conclude that 20% of his guests either did not care <strong>for</strong> it<br />
or absolutely hated it! If he is <strong>in</strong>terested <strong>in</strong> additional repeat visitors, perhaps he might like to<br />
determ<strong>in</strong>e how to make the experience better <strong>for</strong> those who seem to be highly dissatisfied. Are<br />
these descriptive measures demonstrative of the entire population of visitors To a greater or<br />
lesser extent – perhaps.<br />
Homework Problems - 2.1<br />
1. The follow<strong>in</strong>g dataset represents the meat selection made by <strong>in</strong>dividuals at a d<strong>in</strong>ner<br />
banquet. Attendees selected from beef (B), chicken (C) veal (V), or pork (P).<br />
B C B C V B C<br />
C C P P B B C<br />
a. Is this data categorical or quantitative<br />
b. Create a table that shows the frequency and relative frequency <strong>for</strong> each of the<br />
choices. Use Excel.<br />
c. Create a frequency bar graph. Label all axes.<br />
d. Create a relative frequency bar graph. Label all axes.<br />
e. Create a pie chart. Label all axes.<br />
f. Write a brief report (summary) describ<strong>in</strong>g the meal preferences of these attendees.<br />
Describe any general trends. Use specific data and make appropriate conclusions.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 40
2. The follow<strong>in</strong>g data represents per capita meat consumption (pounds per person) <strong>in</strong> 2009<br />
<strong>for</strong> a variety of meats (SOURCE: U.S. Statistical Abstract, Table 217).<br />
Pounds per<br />
Meat Person<br />
Beef 58.1<br />
Veal 0.3<br />
Lamb and mutton 0.7<br />
Pork 46.6<br />
Chicken 56.0<br />
Turkey 13.3<br />
a. Us<strong>in</strong>g Excel, f<strong>in</strong>d the mean and range of the data.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the mean you found.<br />
c. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the range you found.<br />
d. What conclusions can be made about the center and spread of per-capita meat<br />
consumption<br />
3. On open<strong>in</strong>g day, the owners of Green Heart Restaurant <strong>in</strong>vited 29 food critics to be a part<br />
of the cul<strong>in</strong>ary experience. Each critic gave a grade of A (Best), B, C, D, or F (Worst) to<br />
reflect the quality of the overall d<strong>in</strong><strong>in</strong>g experience. The scores are shown below:<br />
A B B A C B C B B<br />
D C B B A A C C C<br />
C B A D C C B B B<br />
A B<br />
a. Generate a relative frequency bar chart.<br />
b. Generate a pie chart.<br />
c. What should the owners take away from the experiences of the critics<br />
4. Consider the scenario <strong>in</strong> problem 1.<br />
a. What is the sample<br />
b. What is the population of <strong>in</strong>terest<br />
c. What other variable(s) might be of <strong>in</strong>terest to the data analyst to better study<br />
attendees‟ eat<strong>in</strong>g preferences<br />
5. Consider the scenario <strong>in</strong> problem 3.<br />
a. What is the sample<br />
b. What is the population of <strong>in</strong>terest<br />
c. What other variable(s) might be of <strong>in</strong>terest to the data analyst to better study the<br />
target demographic<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 41
6. Suppose you are the owner of an account<strong>in</strong>g firm. You would like to better understand<br />
the employment of the residents with<strong>in</strong> ten miles of your firm.<br />
a. What variables would you collect Which are quantitative and which are<br />
qualitative<br />
b. What is the population of <strong>in</strong>terest<br />
c. How would you go about collect<strong>in</strong>g data <strong>for</strong> this study Be specific.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 42
Frequency<br />
2.2 Visualiz<strong>in</strong>g Quantitative Data<br />
To make an assessment of how efficient the technical support department is <strong>in</strong> help<strong>in</strong>g customers<br />
solve software issues, management keeps track of the length of each phone call tak<strong>in</strong>g place over<br />
the day. They f<strong>in</strong>d the follow<strong>in</strong>g:<br />
Length of Call (m<strong>in</strong>.)<br />
1 2 13 4 12<br />
4 10 6 6 9<br />
4 3 4 0 12<br />
6 4 4 13 15<br />
0 4 10 4 10<br />
7 2 10 8 4<br />
7 0 4 4 4<br />
S<strong>in</strong>ce this data is quantitative, the discussed visual displays are not appropriate. However,<br />
management still would like to visualize the 35 observations.<br />
One quick, by-hand technique to visualize how the times appear would be a dot plot, or a simple<br />
number l<strong>in</strong>e, with any repeats stacked above others. Given the presence of great technology, we<br />
will use Excel to create a histogram, which is a graph similar to a bar graph (can be either<br />
frequency or relative frequency). The difference is that, <strong>in</strong>stead of hav<strong>in</strong>g nom<strong>in</strong>al categories on<br />
the horizontal axis, we will create numerical categories. For example, we could simply create<br />
tick marks <strong>for</strong> each observation value present <strong>in</strong> the table and to then display the number of time<br />
it appears. Often, with small amounts of data, the graph may appear spread out. In this case, we<br />
might decide to create a bar represent<strong>in</strong>g, say, all calls that fall between 0 and 3 m<strong>in</strong>utes. Let‟s<br />
demonstrate both:<br />
14<br />
12<br />
10<br />
Call Times<br />
8<br />
6<br />
4<br />
2<br />
0<br />
0 1 2 3 4 6 7 8 9 10 12 13 15<br />
Length (m<strong>in</strong>.)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 43
Frequency<br />
We clearly see that most calls are between about 4 and 10 m<strong>in</strong>utes (a 4-m<strong>in</strong>ute call is most<br />
frequent – the mode). Alternatively, we might choose to create equal-width categories. Let‟s say<br />
we have categories that show the times as 3-m<strong>in</strong>ute blocks:<br />
14<br />
12<br />
10<br />
Call Times<br />
8<br />
6<br />
4<br />
2<br />
0<br />
0-2 3-5 6-8 9-11 12-15<br />
Length (m<strong>in</strong>.)<br />
Beautiful! Now it is more clear how call times are distributed. This visualization is a bit simpler<br />
than the one above, as it groups times <strong>in</strong>to more manageable categories. Note that the bars are<br />
touch<strong>in</strong>g. This is the dist<strong>in</strong>ction of a histogram from a bar graph – we want to emphasize that<br />
times are cont<strong>in</strong>uous and that every time length between 0 and 15 are accounted <strong>for</strong> (even<br />
fractions of m<strong>in</strong>ute, potentially).<br />
We can make these categories as wide or narrow as we‟d like. We call these categories b<strong>in</strong>s.<br />
Th<strong>in</strong>k about this as you would about sort<strong>in</strong>g recycl<strong>in</strong>g materials <strong>in</strong>to one of several b<strong>in</strong>s.<br />
2.2.1 Creat<strong>in</strong>g a Frequency Histogram Us<strong>in</strong>g Excel<br />
The most time-consum<strong>in</strong>g part of build<strong>in</strong>g a histogram by hand is organiz<strong>in</strong>g the data and<br />
count<strong>in</strong>g the number of observations. Excel does this quite easily via the use of a pivot table. A<br />
pivot table is a “live” table whose values can be <strong>for</strong>matted <strong>in</strong> many different ways.<br />
We must first beg<strong>in</strong> with the dataset <strong>in</strong> Excel as a raw column or row of data:<br />
To <strong>in</strong>sert a pivot table, highlight the entire set of data, <strong>in</strong>clud<strong>in</strong>g the data label. Click on the<br />
Insert tab and choose the PivotTable option from the Tables column. A data prompt should<br />
appear with the table range already appear<strong>in</strong>g <strong>in</strong> the box:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 44
You can either choose to have Excel place the table with<strong>in</strong> the same worksheet, or you can have<br />
it create a new one. This choice is up to you. If you choose “Exist<strong>in</strong>g Worksheet” you will have<br />
to specify a cell to paste it to. Choose a cell that is out of the way of any exist<strong>in</strong>g data so that it<br />
doesn‟t “bump” <strong>in</strong>to it if the pivot table becomes quite large.<br />
You should now see someth<strong>in</strong>g similar to the table below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 45
When highlighted, a “PivotTable Field List” w<strong>in</strong>dow should appear to the right of your screen<br />
with the name(s) of the variable(s) <strong>in</strong> the “Choose fields to add to report” box.<br />
This generic template will now allow us to construct a table. From the PivotTable Field List<br />
w<strong>in</strong>dow, we will drag the Times variable <strong>in</strong>to the Row Labels box. This will create a series of<br />
rows with each of the observations appear<strong>in</strong>g, only once. Thus, we will not have to see repeats!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 46
If we had additional variables, the row labels can be any variable desired. For each of these rows,<br />
we would like to see a frequency count. This is where the “Values” box comes <strong>in</strong> handy. Drag<br />
the Times variable <strong>in</strong>to the “Values” box:<br />
The values of time are, by default, the sums of the times <strong>for</strong> each of the row labels. This is not<br />
what we want. We want “Count of Times.” To change the type of value, click the arrow on the<br />
“Sum of Times” button. Choose “Value Field Sett<strong>in</strong>gs.” Change “Summarize value field by”<br />
option to “Count” and close the dialogue box:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 47
We can double-check that these values are correct by not<strong>in</strong>g that the Grand Total is 35, the same<br />
as the number of observations. We would like a histogram to show the “Row Labels” along the<br />
horizontal axis and the “Count of Times” along the vertical axis. To do this, select the pivot table<br />
and choose the Options tab from the PivotTable Tools menu.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 48
Select PivotChart and select the first graph<strong>in</strong>g option:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 49
Frequency<br />
We make a few adjustments: delete the legend, re-label the chart title, and remove the two grey<br />
boxes. Now that a graph has been <strong>in</strong>serted, a PivotChart Tools menu appears when the graph is<br />
highlighted. This is very similar to <strong>in</strong>sert<strong>in</strong>g a regular graph. Select Layout to add axis labels. To<br />
remove the grey boxes, right-click either box and select “Hide All Field Buttons on Chart.”<br />
14<br />
12<br />
10<br />
Histogram of Call Times<br />
8<br />
6<br />
4<br />
2<br />
0<br />
0 1 2 3 4 6 7 8 9 10 12 13 15<br />
Times (m<strong>in</strong>.)<br />
To make the gaps between bars disappear, select the graph and choose the eighth graph option<br />
from the Design tab <strong>in</strong> the PivotChart Tools menu shown below (NOTE: this option will<br />
automatically put <strong>in</strong> axis labels):<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 50
Frequency<br />
To make solid black l<strong>in</strong>es appear as the outl<strong>in</strong>es <strong>for</strong> each bar, change the bar styles from “Chart<br />
Styles.”<br />
14<br />
12<br />
10<br />
Histogram of Call Times<br />
8<br />
6<br />
4<br />
2<br />
0<br />
0 1 2 3 4 6 7 8 9 10 12 13 15<br />
Times (m<strong>in</strong>.)<br />
We now would like to adjust the b<strong>in</strong> widths. Do<strong>in</strong>g this is simple!<br />
Select the pivot table. From the Options tab under the PivotTable Tools menu, choose “Group<br />
Selection” from the Group column. In the dialogue box that appears, the “Start<strong>in</strong>g at” and<br />
“End<strong>in</strong>g at” boxes should reflect the smallest and largest values of the variable. You can adjust<br />
these to be wider or narrower, if you choose to show less than the full dataset. In the “By:” box,<br />
put the width of the classes. In this case, we chose 3. Press “OK” and the you should then see the<br />
updated pivot table and graph!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 51
To change frequency to relative frequency, we must now change “Count of Times” <strong>in</strong> the<br />
“Values” box of the “PivotTable Field List.” Click on “Count of Times” and select “Value Field<br />
Sett<strong>in</strong>gs.” With<strong>in</strong> the dialogue box, choose the “Show Value As” tab and choose values to show<br />
as “% of Grand Total.” Press “OK.” Adjust the vertical axis label accord<strong>in</strong>gly.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 52
Relative Frequency<br />
Histogram of Call Times<br />
40.00%<br />
35.00%<br />
30.00%<br />
25.00%<br />
20.00%<br />
15.00%<br />
10.00%<br />
5.00%<br />
0.00%<br />
0-2 3-5 6-8 9-11 12-15<br />
Times (m<strong>in</strong>.)<br />
Homework Problems - 2.2<br />
1. An <strong>in</strong>structor grades a math test and produces the follow<strong>in</strong>g histogram:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 53
Frequency<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Histogran of Test Percentages<br />
60-64 65-69 70-74 75-79 85-90<br />
Percentage Earned<br />
a. What can the <strong>in</strong>structor conclude about the fairness of the test<br />
b. What appears to be the mean score, based on the histogram<br />
c. What is the approximate range of scores, and why is it only possible to be<br />
approximate this from the given <strong>in</strong><strong>for</strong>mation<br />
2. A cashier at a mall retail cloth<strong>in</strong>g outlet asked customers their age <strong>for</strong> an anonymous<br />
survey. The ages he collected can be found below:<br />
31 34 30 30 31 27 33 36<br />
33 30 29 28 20 32 24 30<br />
32 30 30 22 31 38 28 31<br />
25 24 25 31 25 24 36 32<br />
24 31 31 32 31 31 28 31<br />
33 20 32 32 52 31 27 30<br />
a. Us<strong>in</strong>g Excel, f<strong>in</strong>d the mean and range of the data.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the mean you found.<br />
c. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the range you found.<br />
d. Create a relative frequency histogram <strong>for</strong> age. Leave your b<strong>in</strong> width as 1 year.<br />
e. Create a relative frequency histogram <strong>for</strong> age with b<strong>in</strong> width 5 years.<br />
f. Describe any trends <strong>in</strong> the age of shoppers at this store.<br />
g. Based on your answer to e), which age group(s) can be omitted from the<br />
company‟s market<strong>in</strong>g tactics, <strong>in</strong> an ef<strong>for</strong>t to focus only on the regular shoppers<br />
3. The total number of people (<strong>in</strong> millions) work<strong>in</strong>g <strong>in</strong> all of the various <strong>in</strong>dustries <strong>in</strong> the<br />
United States <strong>in</strong> 2010 is given <strong>in</strong> the table below:<br />
2.206 0.731 9.077 14.081 8.789 5.293 3.805 15.934<br />
7.134 5.88 1.253 3.149 9.35 6.605 2.745 15.253<br />
9.115 6.138 32.062 13.155 18.907 6.249 9.406 3.252<br />
12.53 2.966 9.564 6.769 6.102 0.667 6.983<br />
(SOURCE: U.S. Statistical Abstract, Table 619)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 54
a. Us<strong>in</strong>g Excel, f<strong>in</strong>d the mean and range of the data.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the mean you found.<br />
c. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the range you found.<br />
d. Create a relative frequency histogram <strong>for</strong> age. Leave your b<strong>in</strong> width as 2 million<br />
people.<br />
e. Create a relative frequency histogram <strong>for</strong> age with b<strong>in</strong> width 5 million people.<br />
f. The federal government regularly publishes reports on employment across the<br />
many <strong>in</strong>dustries. Us<strong>in</strong>g the <strong>in</strong><strong>for</strong>mation you have gathered, generate a brief report<br />
detail<strong>in</strong>g your f<strong>in</strong>d<strong>in</strong>gs, <strong>in</strong>clud<strong>in</strong>g any trends <strong>in</strong> employment.<br />
4. A resort cha<strong>in</strong> that wishes to expand is constantly search<strong>in</strong>g <strong>for</strong> new sites to add<br />
properties that will be profitable. A good place to start is by consider<strong>in</strong>g climates.<br />
Suppose Starwood Hotels and Resorts Worldwide obta<strong>in</strong>s the follow<strong>in</strong>g data from the<br />
U.S. Census Bureau on highest temperatures ever recorded <strong>in</strong> various cities <strong>in</strong> the United<br />
States:<br />
112 100 128 120 134 114 106 110<br />
109 112 100 118 117 116 118 121<br />
114 114 105 109 107 112 115 115<br />
118 117 118 125 106 110 122 108<br />
110 121 113 120 119 111 104 111<br />
120 113 120 117 107 110 118 112<br />
114 115<br />
(SOURCE: U.S. Statistical Abstract, Table 391)<br />
a. Us<strong>in</strong>g Excel, f<strong>in</strong>d the mean and range of the data.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the mean you found.<br />
c. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the range you found.<br />
d. Create a relative frequency histogram <strong>for</strong> age. Leave your b<strong>in</strong> width as 5 degree.<br />
e. Create a relative frequency histogram <strong>for</strong> age with b<strong>in</strong> width 10 degrees.<br />
f. What percentage of states can be elim<strong>in</strong>ated from consideration if the company<br />
will not take any risks with states that have had a record high over 115 F<br />
g. Summarize the distribution of high temperatures <strong>in</strong> the U.S.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 55
2.3 Descriptive <strong>Statistics</strong> – Center and Position<br />
Histograms provide us with a great visualization of the overall distribution of values. A<br />
distribution describes the layout of the values of a quantitative or categorical variable. To further<br />
describe the differences between two similar distributions, it is helpful to use statistics that<br />
describe center, location, and spread.<br />
2.3.1 Mean and Median<br />
To make peace with some regularly occurr<strong>in</strong>g notation <strong>in</strong> statistics, we will use<br />
mean “the sum of.” For <strong>in</strong>stance,<br />
(“sigma”) to<br />
Let‟s say that we have a set of variable values. To dist<strong>in</strong>guish each of these “ ‟s” we‟ll use<br />
subscripts, denot<strong>in</strong>g them:<br />
Then, to <strong>in</strong>dicate that we want to sum these values across all subscripts, we would write:<br />
Which means, “sum up all<br />
values <strong>in</strong> the dataset,” or<br />
Us<strong>in</strong>g this new notation, we already know how to calculate the mean:<br />
Mean – x-bar notation<br />
The mean value, or average, of a dataset conta<strong>in</strong><strong>in</strong>g<br />
values can be written as:<br />
̅, is used to denote the mean of a sample and can be read as “x-bar.”<br />
A common po<strong>in</strong>t of confusion <strong>for</strong> students is the difference <strong>in</strong> the subscript and the<br />
denom<strong>in</strong>ator . Many people th<strong>in</strong>k that the subscript should be to match the number of<br />
elements <strong>in</strong> the dataset. However, specifically refers to the very last value <strong>in</strong> the dataset. We<br />
treat the as an <strong>in</strong>dex that goes across all subscripts from 1 all the way up to and <strong>in</strong>clud<strong>in</strong>g . To<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 56
̅<br />
̅<br />
account <strong>for</strong> this discrepancy, mathematicians usually write where the <strong>in</strong>dex should start below<br />
sigma and the maximum value above the sigma. For example, if there are 3 values <strong>in</strong> the dataset,<br />
we would write the mean as:<br />
As you can see, the sigma notation can quickly become convoluted, and so we typically just<br />
write to <strong>in</strong>dicate the sum of all -values.<br />
Median<br />
The median value of a dataset is the value that represents the physical center of the data set. To<br />
locate the median:<br />
Organize the data values from smallest to largest. Then,<br />
If there is an odd number of values <strong>in</strong> the data set, the center value can be located by count<strong>in</strong>g <strong>in</strong><br />
positions from the smallest value, <strong>in</strong>clud<strong>in</strong>g the smallest value. Alternatively, one can count <strong>in</strong> an<br />
equal number of values from the left and right endpo<strong>in</strong>ts to locate the center value.<br />
If there is an even number of values <strong>in</strong> the data set, average the two middle-most values together.<br />
The locations of the two middle-most values are:<br />
Positions from the smallest value, <strong>in</strong>clud<strong>in</strong>g the smallest value. Once aga<strong>in</strong>, these values can be<br />
found by count<strong>in</strong>g from the left and the right endpo<strong>in</strong>ts of the dataset.<br />
Example 1: F<strong>in</strong>d the mean and median salaries <strong>for</strong> the company represented by the follow<strong>in</strong>g<br />
dataset (<strong>in</strong> thousands). Expla<strong>in</strong> which measure better reflects the overall company<br />
demographic.<br />
SOLUTION: We first f<strong>in</strong>d the mean:<br />
This means that, on average, employees earn $148,200 per year.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 57
We beg<strong>in</strong> by list<strong>in</strong>g them <strong>in</strong> ascend<strong>in</strong>g order:<br />
The two middle values are 48 and 50 (these values are four values <strong>in</strong> from either side). These<br />
represent the 10/2=5 th and 10/2+1=6 th values <strong>in</strong> the dataset. To f<strong>in</strong>d the median, we average them<br />
together to get<br />
The median salary is $49,000 per employee per year.<br />
The median is clearly a more viable measure. The mean takes <strong>in</strong>to account all values, <strong>in</strong>clud<strong>in</strong>g<br />
the outlier, or “extreme” salary of $1.1 million per year. The median is not <strong>in</strong>fluenced by<br />
extreme outliers.<br />
To f<strong>in</strong>d the mean and median salaries <strong>in</strong> Excel we use the functions average() and median().<br />
The parameter <strong>for</strong> both functions is the cell range correspond<strong>in</strong>g to the dataset.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 58
2.3.2 Percentile<br />
Another useful tool <strong>for</strong> describ<strong>in</strong>g the location of data po<strong>in</strong>ts is a percentile.<br />
Percentile<br />
The th percentile is a value such that percent of the values <strong>in</strong> a dataset (of values) are less<br />
than or equal to this value.<br />
To f<strong>in</strong>d the location of this value, that is, the <strong>in</strong>dex, , first arrange the data <strong>in</strong> ascend<strong>in</strong>g order.<br />
The <strong>in</strong>dex can be calculated by:<br />
. /<br />
That is, f<strong>in</strong>d the th percent of the number of observations. Round up if the <strong>in</strong>dex is a decimal<br />
and take the average of the values <strong>in</strong> positions and if the calculated value of is an<br />
<strong>in</strong>teger. One of these two actions will be taken<br />
Example 2: F<strong>in</strong>d the 50 th percentile <strong>for</strong> the salaries <strong>in</strong> Example 1:. Interpret the real-world<br />
mean<strong>in</strong>g of this value.<br />
The values, <strong>in</strong> ascend<strong>in</strong>g order, are:<br />
We take<br />
. S<strong>in</strong>ce this is an <strong>in</strong>teger, we average together the values <strong>in</strong> positions 5 and<br />
6, giv<strong>in</strong>g us a value of 49. This means that 50% of employees represented <strong>in</strong> this dataset make<br />
$49,000 or less.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 59
Not surpris<strong>in</strong>gly, the 50 th percentile is actually the median of the dataset! This is always true.<br />
In Excel, we can use the Percentile() function. The set-up of this function‟s parameters is:<br />
=percentile(cell range, p/100)<br />
Thus, <strong>for</strong> this dataset, we would have:<br />
2.3.3 Quartiles<br />
Often times, data analysts like to th<strong>in</strong>k about data <strong>in</strong> terms of quartiles, or quarters. There are 4<br />
quartiles and can be represented as follows:<br />
<br />
<br />
<br />
<br />
Quartile 1 = 25 th Percentile<br />
Quartile 2 = 50 th Percentile<br />
Quartile 3 = 75 th Percentile<br />
Quartile 4 = 100 th Percentile<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 60
2.3.4 Rank<br />
What if, on the other hand, an employee wants to know what the rank of his salary is (he knows<br />
his percentile value) This requires reverse-eng<strong>in</strong>eer<strong>in</strong>g of the idea of a percentile. Without the<br />
use of any mathematical <strong>for</strong>mulas, we would need to count the number of values that are equal to<br />
or lesser than salary <strong>in</strong> question. To make this easier, we can use Excel‟s Rank() function. The<br />
parameters we will use are as follows:<br />
= rank(value, cell range, 1)<br />
This will return the number of values that are less than or equal to the value <strong>in</strong> question. If we<br />
changed the parameter of 1 to a 0, Excel would return the rank<strong>in</strong>g of that value, treat<strong>in</strong>g rank<strong>in</strong>gs<br />
as be<strong>in</strong>g similar to the ranks of, say, runners <strong>in</strong> a race.<br />
We will then need to divide this output by the number of observations <strong>in</strong> the dataset. To make<br />
the count<strong>in</strong>g process more automated, we can take this output and divide it by the output of the<br />
count() function. This function will simply count the number of entries <strong>in</strong> the specified range,<br />
and has the follow<strong>in</strong>g parameter:<br />
= count(cell range)<br />
Let‟s say the employee mak<strong>in</strong>g $24,000 would like to know his salary‟s rank. To calculate, we<br />
would type the follow<strong>in</strong>g:<br />
Giv<strong>in</strong>g us:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 61
Thus, his salary is <strong>in</strong> the 30 th percentile. This means that 30% of people represented <strong>in</strong> this<br />
dataset make $24,000 or less.<br />
Another approach would be to use the “Rank and Percentiles” tool <strong>in</strong> an Excel add-<strong>in</strong> called<br />
Analysis ToolPak. This method will show the ranks and percentiles of all values <strong>in</strong> the dataset<br />
and is only useful <strong>for</strong> relatively small, manageable datasets. The Analysis ToolPak will be<br />
important later on, so we‟ll describe it‟s <strong>in</strong>stallation here.<br />
2.3.5 Analysis ToolPak<br />
To <strong>in</strong>stall the Analysis ToolPak, select the File tab with<strong>in</strong> Excel. Then select Options from the<br />
ribbon that appears. Select the Add-Ins option. Click Analysis ToolPak and press Go.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 62
Check the “Analysis ToolPak” and “Analysis ToolPak – VBA” features from the pop-up<br />
w<strong>in</strong>dow and press OK.<br />
You now have the ToolPak <strong>in</strong>stalled.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 63
To use the “Rank and Percentile” tool, select the Data tab. Choose Data Analysis from the<br />
Analysis column. Pick “Rank and Percentile” from the pop-up w<strong>in</strong>dow and press OK.<br />
Select the <strong>in</strong>put range:<br />
You can either specify an output range, or have Excel create a new worksheet with the results.<br />
This is up to your preferences. Check “Labels <strong>in</strong> First Row” and be sure that the data label has<br />
been selected.<br />
The results are shown below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 64
You‟ll immediately notice that a salary of $24,000 is shown as be<strong>in</strong>g <strong>in</strong> the 22.2-percentile,<br />
which does not agree with our calculation. Every software package uses some technique to<br />
conduct this calculation. A common agreement <strong>for</strong> calculation purposes does not exist.<br />
Fortunately, they both are <strong>in</strong> the same “ballpark.”<br />
Homework Problems - 2.3<br />
1. Suppose your <strong>in</strong>structor releases scores on a recent project. The scores are as follows:<br />
83 89 76 41 92 85 76 71<br />
95 92 80 84 77 78 81 75<br />
64 30 80 79 78 70 75 81<br />
99 85 80 82 70 69 71 70<br />
a. Generate a relative frequency histogram and comment on any <strong>in</strong>terest<strong>in</strong>g<br />
observations of the distribution.<br />
b. Compare the mean and median. What causes them to be different <strong>in</strong> this particular<br />
way<br />
c. What score would be required <strong>in</strong> order to be <strong>in</strong> the 80 th percentile<br />
d. In what percentile is a person who scores 71% on this project<br />
2. In order to make way <strong>for</strong> new products, a grocery store cha<strong>in</strong> would like to determ<strong>in</strong>e<br />
whether the Lunch Pack or Family Pack of Flaxem Crackers generate more revenue. The<br />
follow<strong>in</strong>g two datasets show the revenue generated by each over a 10-month period:<br />
Lunch<br />
Family<br />
450 510 550 330 400<br />
500 550 290 310 300<br />
500 400 600 310 350<br />
600 200 200 600 430<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 65
a. Compare the mean and median of each dataset. What can be said about the<br />
middle-most revenues<br />
b. F<strong>in</strong>d all four quartiles <strong>for</strong> each dataset. Use this <strong>in</strong><strong>for</strong>mation to make an argument<br />
<strong>for</strong> why this grocer should hang on to the Family Pack.<br />
c. For each of the datasets, determ<strong>in</strong>e the top 10% of revenues that can be expected.<br />
d. F<strong>in</strong>d the range of the data. Comment on how this might <strong>in</strong>fluence the grocer‟s<br />
decision.<br />
3. Suppose that Budget Car Rentals assesses a variety of new 2012 and 2013 sedans <strong>for</strong> its<br />
new l<strong>in</strong>e of rental cars. It f<strong>in</strong>ds the follow<strong>in</strong>g <strong>in</strong><strong>for</strong>mation on city and highway fuel<br />
efficiencies (mpg) <strong>for</strong> eight vehicles <strong>in</strong> consideration:<br />
Year 2012 2013 2013 2012 2012 2012 2012 2012<br />
Make Toyota Ford Ford Honda Toyota Toyota Hyundai VW<br />
Model Prius Hyb. Fusion Hyb. C-Max Hyb. Insight Camry LE Hyb. Camry XLE Hyb. Sonata Hyb. Passat<br />
City 51 47 44 41 43 40 34 31<br />
Highway 49 47 41 44 39 38 39 43<br />
(SOURCE: www.fueleconomy.gov)<br />
a. F<strong>in</strong>d the mean and median fuel efficiency <strong>for</strong> city and highway mileages of the<br />
vehicles be<strong>in</strong>g considered. Comment on any differences between the two values.<br />
b. What is the rank percentage of a vehicle that has 43 city mpg<br />
c. If the company makes its choice based on the top 15% of city and highway <strong>for</strong> the<br />
vehicles be<strong>in</strong>g considered, what will be the m<strong>in</strong>imum city and highway mileages<br />
they should consider<br />
d. Make a recommendation <strong>for</strong> which vehicle(s) should be purchased, if any.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 66
̅<br />
2.4 Descriptive <strong>Statistics</strong> – Variability<br />
The measure of center is always a good start. But what does a sample mean not tell us It fails to<br />
describe how far apart the data are from one another. In other words, we need to assess the<br />
variability of variance of the numbers we have collected.<br />
The simplest way we might go about describ<strong>in</strong>g the variability is by simply look<strong>in</strong>g at the range<br />
of the data, such that:<br />
Range = largest observation - smallest observation<br />
Albeit, this still does not help us identify how spread out the data are. For example, suppose we<br />
f<strong>in</strong>d our range to be 100 units (see dataset below). This might seem rather daunt<strong>in</strong>g at first, but<br />
what if all values were clumped between 0 and 10, and there existed an outlier of 110<br />
Obviously, this range is often determ<strong>in</strong>ed by outliers alone.<br />
0 1 3 10 8 7 4 110<br />
2.4.1 Standard Deviation<br />
To create a better measure of variability that takes all data po<strong>in</strong>ts <strong>in</strong>to account, just like the mean<br />
does, statisticians established a standard deviation. As the title implies, this is a standard tool<br />
that measures the average deviations (or by how much each values deviates) from the mean. This<br />
requires us to f<strong>in</strong>d all the deviations <strong>for</strong> po<strong>in</strong>ts <strong>in</strong> our dataset,<br />
We would f<strong>in</strong>d all of these. Let‟s demonstrate with the above dataset:<br />
Value<br />
̅<br />
0 -17.875<br />
1 -16.875<br />
3 -14.875<br />
10 -7.875<br />
8 -9.875<br />
7 -10.875<br />
4 -13.875<br />
110 92.125<br />
Mean: 17.875<br />
The deviations that we observe to be below the mean produce a negative deviation and the one<br />
above the mean has a positive deviation. To f<strong>in</strong>d an average deviation, we would ideally add<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 67
them. However, observe that the sum of the deviations is 0! This is true of any dataset, s<strong>in</strong>ce the<br />
mean represents the “balance” of the dataset. Due to mathematical concerns that we won‟t state<br />
here, mathematicians decided to square these values, s<strong>in</strong>ce squar<strong>in</strong>g converts all signed numbers<br />
<strong>in</strong>to positive values.<br />
Value ̅ ( ̅)<br />
0 -17.875 319.5156<br />
1 -16.875 284.7656<br />
3 -14.875 221.2656<br />
10 -7.875 62.01563<br />
8 -9.875 97.51563<br />
7 -10.875 118.2656<br />
4 -13.875 192.5156<br />
110 92.125 8487.016<br />
Mean: 17.875 Sum: 9782.88<br />
Great, now they can be summed up to give 9782.88! Thus, we have found the follow<strong>in</strong>g:<br />
∑( ̅)<br />
One would th<strong>in</strong>k that divid<strong>in</strong>g by 8 would now be appropriate to f<strong>in</strong>d the average. Due to<br />
mathematical properties that are beyond the scope of this course, the division will be by 7, which<br />
is . Thus:<br />
∑( ̅)<br />
This value that we have found is called the variance.<br />
NOTE: The division by has to do with the fact that we are often deal<strong>in</strong>g with a sample <strong>in</strong><br />
<strong>in</strong>ferential statistics and hope to make conclusions above a population.<br />
Sample Variance<br />
The variance of a sample, an un<strong>in</strong>terpretable measure of variability denoted by<br />
by the follow<strong>in</strong>g <strong>for</strong>mula:<br />
, can be found<br />
∑( ̅)<br />
To make all of these calculations more mean<strong>in</strong>gful (to have a true average), we should probably<br />
“unsquare” the value that we have. When we do this, we get the sample standard deviation:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 68
√ ∑( ̅) √ √<br />
This is what we can th<strong>in</strong>k of as the average deviation of each po<strong>in</strong>t from the mean. It is clearly<br />
high <strong>for</strong> this dataset. What is caus<strong>in</strong>g it The outlier of 110!<br />
Conclusion: On average, values <strong>in</strong> the dataset deviate from the mean by about 37 units.<br />
Sample Standard Deviation<br />
The standard deviation of a sample, denoted , is given by the follow<strong>in</strong>g <strong>for</strong>mula:<br />
∑( ̅)<br />
√<br />
Note that this is simply the square root of the variance.<br />
In Excel, the standard deviation can be calculated simply by us<strong>in</strong>g the function below:<br />
= stdev(cell range)<br />
Example 1: A river with mild current is known to have an average depth of 3 feet with a<br />
standard deviation of 3 feet. The bottom is not visible. Is the river safe to cross by foot Also,<br />
what is the variance<br />
SOLUTION: S<strong>in</strong>ce there is a standard deviation of 3 feet, we can conclude, that, on average, the<br />
river depth deviates by 3 feet from the mean. It would not be unusual to encounter a part of the<br />
river with a depth of 6 or more feet. There<strong>for</strong>e, the river should not be crossed by foot.<br />
S<strong>in</strong>ce the standard deviation is the square root of the variance, the variance is the square of the<br />
standard deviation. That is,<br />
Thus, the variance is 9. The variance does not have a valuable <strong>in</strong>terpretation.<br />
2.4.2 How Do We Interpret the Value We Get<br />
Th<strong>in</strong>k about this: n is a fixed value <strong>for</strong> our sample, specifically 5. The only th<strong>in</strong>g that could make<br />
s 2 large or small is the numerator. Thus, if the deviations are large (a bad th<strong>in</strong>g!), then the<br />
squared deviations will be large, and so the sum of squares will be large. This implies a large<br />
standard deviation.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 69
If the deviations are small (good th<strong>in</strong>g!), then the squared deviations will be small, and so the<br />
sum of squares will be small. This implies a small deviation.<br />
So, a large standard deviation means that there is a lot of variability, or that the values are vastly<br />
different from one another. A small standard deviation means the values <strong>in</strong> the data set are quite<br />
alike. In the near future, you'll see why it is important to have a small standard deviation. In<br />
general, as the variance and standard deviation get larger, our ability to make precise statements<br />
about the population quickly evaporates.<br />
We will be us<strong>in</strong>g variance and standard deviation consistently <strong>for</strong> the rest of the semester. It is<br />
important to get com<strong>for</strong>table with it.<br />
2.4.3 Do Population Variances and Standard Deviations Fall <strong>in</strong>to Play<br />
Indeed they do. Do you th<strong>in</strong>k that we can f<strong>in</strong>d them Def<strong>in</strong>itely not! The population variance<br />
requires the use of the population mean, . How do we get We take the average of all the<br />
values <strong>in</strong> the entire population. S<strong>in</strong>ce we typically don't know this value, we also typically don't<br />
know the population variance, so certa<strong>in</strong>ly we don't know the population standard deviation<br />
(s<strong>in</strong>ce it's the square root of the population variance).<br />
The table below summarizes the notations we need to recognize:<br />
Sample<br />
Population<br />
Variance Standard<br />
Deviation<br />
The population parameter, , is the lowercase Greek letter “Sigma.” (This is as opposed to the<br />
sample statistic, .)<br />
2.4.4 Interquartile Range<br />
The standard deviation, much like the mean, is easily skewed by excessively small or large<br />
values. We noticed this <strong>in</strong> the first example <strong>in</strong> this section. Us<strong>in</strong>g the idea of medians and<br />
percentiles is a safe bet <strong>for</strong> outlier-proof<strong>in</strong>g our spread estimates. An <strong>in</strong>terquartile range is the<br />
difference between the 3 rd quartile and the 1 st quartile. Remember, these are simply the 75 th and<br />
25 th percentiles, respectively. The difference is the middle 50% of the dataset.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 70
This gives us a nice measure of how spread out the data is about the median.<br />
Example 2: Consider the follow<strong>in</strong>g home prices and f<strong>in</strong>d both the standard deviation and the<br />
<strong>in</strong>terquartile range. Describe what conclusions can be drawn from these values.<br />
Values (thous. $) 95 875 96 89 87 88 93 91<br />
SOLUTION: Us<strong>in</strong>g Excel, we f<strong>in</strong>d the follow<strong>in</strong>g:<br />
The standard deviation <strong>in</strong>dicates that home prices, on average, vary by $277,100 from the mean<br />
value. However, we see from the <strong>in</strong>terquartile range that the middle 50% of homes only vary by<br />
$6,500. The standard deviation is be<strong>in</strong>g skewed by the home that is priced at $875,000. The<br />
<strong>in</strong>terquartile range tells us that the majority of home values stay pretty close to the median value.<br />
Additionally, we see that most home values are between $88,000 and $96,000.<br />
2.4.5 Descriptive <strong>Statistics</strong>: Analysis ToolPak <strong>in</strong> Excel<br />
To generate most of the features we have discussed up until now, we turn to Excel‟s Analysis<br />
ToolPak <strong>for</strong> a more automated approach.<br />
Let‟s consider the house data above:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 71
Values (thous. $)<br />
95<br />
875<br />
96<br />
89<br />
87<br />
88<br />
93<br />
91<br />
Access the Data Analysis tool from the Data tab <strong>in</strong> Excel. Select “Descriptive <strong>Statistics</strong>” from<br />
the menu and select the data from the spreadsheet conta<strong>in</strong><strong>in</strong>g the data.<br />
Be sure that you check “Summary <strong>Statistics</strong>.”<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 72
We can immediately see the mean and the median of the dataset. Additionally, we see the<br />
standard deviation, variance, range, m<strong>in</strong>/max, sum of the values, and the number of values <strong>in</strong> the<br />
dataset, among other tools to ignore <strong>for</strong> now. We see, as expected, that the dataset does not have<br />
a mode, or most frequently occurr<strong>in</strong>g value.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 73
2.4.6 Shapes of Distributions<br />
Now that we have a basis <strong>for</strong> measur<strong>in</strong>g data <strong>in</strong> terms of its center and spread, we turn back to<br />
mak<strong>in</strong>g connections with the visual shape of the distribution.<br />
There are many different shapes that we encounter <strong>for</strong> distributions. Let's discuss a few. First,<br />
note that the follow<strong>in</strong>g do not look like the rectangular histograms from earlier on. These are<br />
smoothed out <strong>for</strong>ms of what we experienced earlier. They are often used to describe the general<br />
shape of a distribution. And, of course, they are much easier to sketch.<br />
A histogram is said to be (a) unimodal if it has a s<strong>in</strong>gle peak, (b) bimodal if it has two peaks,<br />
and (c) multimodal if it has more than two peaks.<br />
If we follow the curves from left to right, we beg<strong>in</strong> at the lower tail, move over the peak(s), and<br />
arrive back down to what is called the upper tail.<br />
A unimodal histogram is said to be symmetric, if we are able to draw a l<strong>in</strong>e down the center<br />
such that the left side of the l<strong>in</strong>e is a mirror image of the right side. Consider the follow<strong>in</strong>g<br />
unimodal symmetric histograms:<br />
A unimodal histogram that is not symmetric is said to be skewed. If the upper tail of the<br />
histogram stretches out much farther than the lower tail, then the distribution of values is<br />
positively (right) skewed. On the other hand, if the lower tail is much longer than the upper tail,<br />
the histogram is negatively (left) skewed. Can you identify the follow<strong>in</strong>g unimodal histograms<br />
as positively or negatively skewed<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 74
Lastly, a normal curve is the most desired type, due to its (<strong>in</strong> general) nice properties. A normal<br />
curve occurs quite frequently. It has a bell shape and is sometimes called the Gaussian curve.<br />
Here are examples of normal curves:<br />
2.4.7 Skewness<br />
Excel also produces a nice measure that allows us to make conclusions about the general shape<br />
of the distribution. This measure is called skewness.<br />
If the skewness measure is:<br />
<br />
<br />
<br />
Postive, then the distribution is skewed right<br />
Negative, then the distribution is skewed left<br />
Zero, then the distribution is symmetric<br />
The farther from 0 that the skewness measure is, the more skewed <strong>in</strong> the respective direction the<br />
distribution will be.<br />
Consider the follow<strong>in</strong>g data show<strong>in</strong>g the number of televisions owned by randomly sampled<br />
<strong>in</strong>dividuals <strong>in</strong> a big city:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 75
3 4 3 2 3 2 1 1 0<br />
4 0 4 4 4 3 1 0 1<br />
4 3 3 0 4 2 1 2 4<br />
2 4 2 4 0 3 4 3 3<br />
2 2 0 2 1 1 3 2 2<br />
0 0 3 1 0 3 4 3 3<br />
0 1 4 4 2 1 2 0 2<br />
4 3 2 4 2 4 3 3 3<br />
1 2 0 3 0 2 3 2 0<br />
0 2 0 4 4 3 4 1 0<br />
Us<strong>in</strong>g Excel, we produce descriptive statistics us<strong>in</strong>g the Analysis ToolPak:<br />
We notice that the Skewness measure is positive: 0.51. This means the dataset is slightly skewed<br />
to the right:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 76
Frequency<br />
̅<br />
Histogram of TV's Owned<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 1 2 3 4 5 6<br />
Number of TV's<br />
2.4.8 Outlier Detection<br />
After analyz<strong>in</strong>g a dataset, how do we assess likely values <strong>for</strong> data and deem other values as<br />
outliers<br />
One approach is to determ<strong>in</strong>e how many standard deviations above (positive value) or below the<br />
mean (negative value) a given data value is.<br />
For <strong>in</strong>stance, suppose we have a dataset with mean 20 and standard deviation 3. We have an<br />
observation of 14. In terms of units, this value is 6 units below the mean. Thus, it has a deviation<br />
of -6. This deviation tells us that the data value <strong>in</strong> question is 2 standard deviations below the<br />
mean, s<strong>in</strong>ce:<br />
This measure is often called a z-score. Let‟s recap:<br />
-Score<br />
A -score tells us the number of standard deviations a data po<strong>in</strong>t, , is from its mean, ̅.<br />
Mathematically,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 77
The idea of a -score is quite helpful, <strong>in</strong> that it tells us how far it is from the mean, relative to the<br />
size of the standard deviation (the average spread). If is very close to 0, then the score is not far<br />
from the mean. If it is very large, it is very far from the mean.<br />
A very useful theorem established by Russian mathematician, Lvovich Chebyshev, allows us to<br />
determ<strong>in</strong>e how large is very large. Chebyshev established the follow<strong>in</strong>g theorem:<br />
Chebyshev’s Theorem<br />
For any , at least . / of the data values must be with<strong>in</strong> (to the left and the right)<br />
standard deviations of the mean, <strong>for</strong> any.<br />
This works <strong>for</strong> any and all distributions.<br />
Example 3:<br />
A data value is 3 standard deviations above the mean. Is this an extreme value<br />
SOLUTION: Chebyshev‟s Theorem states that<br />
89% of all data po<strong>in</strong>ts <strong>in</strong> this distribution will lie between -3 and +3 standard deviations from the<br />
mean. Thus, there is, at most, an 11% chance of observ<strong>in</strong>g someth<strong>in</strong>g higher than +3 standard<br />
deviations. This data value is fairly unlikely an might be considered a mild outlier.<br />
Homework Problems - 2.4<br />
1. The Connecticut Agricultural Experiment Station conducted a study of the calorie content<br />
of different types of beer. The calorie content (calories per 100 mL) <strong>for</strong> 26 brands of<br />
light beer are:<br />
29 28 33 31 30 33 30 28 27 41 39 31 29<br />
23 32 31 32 19 40 22 34 31 42 35 29 43<br />
a. F<strong>in</strong>d the standard deviation. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of this value.<br />
b. F<strong>in</strong>d the <strong>in</strong>terquartile range. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of this value.<br />
c. F<strong>in</strong>d the skewness. What type of shape does this distribution have<br />
2. The UNICEF report “Progress <strong>for</strong> Children” (April, 2005) <strong>in</strong>cluded the accompany<strong>in</strong>g<br />
data on the percentage of primary-school-age children who were enrolled <strong>in</strong> school <strong>for</strong> 23<br />
countries <strong>in</strong> Central Africa.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 78
58.3 34.6 35.5 45.4 38.6 63.8 53.9 61.9 69.9 43 85 63.4<br />
58.4 61.9 40.9 73.9 34.8 74.4 97.4 61 66.7 79.6 98.9<br />
a. F<strong>in</strong>d the range, standard deviation, and <strong>in</strong>terquartile range. Expla<strong>in</strong> what these<br />
three values tell us about the shape of the distribution.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the standard deviation and the <strong>in</strong>terquartile<br />
range.<br />
c. Produce descriptive statistics <strong>for</strong> this dataset with the Analysis ToolPak <strong>in</strong> Excel.<br />
d. Is the distribution skewed If so, <strong>in</strong> which direction<br />
e. Create a relative frequency histogram. Describe any trends <strong>in</strong> the data.<br />
f. Is an observation of 79.6 an outlier Use Chebyshev‟s Theorem to justify your<br />
answer.<br />
3. The article “Determ<strong>in</strong>ation of Most Representative Subdivision” (Journal of Energy<br />
Eng<strong>in</strong>eer<strong>in</strong>g [1993]: 43-55) gave data on various characteristics of subdivisions that<br />
could be used <strong>in</strong> decid<strong>in</strong>g whether to provide electrical power us<strong>in</strong>g overhead l<strong>in</strong>es or<br />
underground l<strong>in</strong>es. Data on the variable x = total length of streets with<strong>in</strong> a subdivision (<strong>in</strong><br />
feet) are as follows:<br />
1280 5320 4390 2100 1240 3060 4770 1050<br />
360 3330 3380 340 1000 960 1320 530<br />
3350 540 3870 1250 2400 960 1120 2120<br />
450 2250 2320 2400 3150 5700 5220 500<br />
1850 2460 5850 2700 2730 1670 100 5770<br />
3150 1890 510 240 396 1419 2109<br />
a. F<strong>in</strong>d the range, standard deviation, and <strong>in</strong>terquartile range. Expla<strong>in</strong> what these<br />
three values tell us about the shape of the distribution.<br />
b. Expla<strong>in</strong> the real-world mean<strong>in</strong>g of the standard deviation and the <strong>in</strong>terquartile<br />
range.<br />
c. Produce descriptive statistics <strong>for</strong> this dataset with the Analysis ToolPak <strong>in</strong> Excel.<br />
d. Is the distribution skewed If so, <strong>in</strong> which direction<br />
e. F<strong>in</strong>d the -score <strong>for</strong> the observation 79.6. Expla<strong>in</strong> what your answer means <strong>in</strong><br />
real-world terms.<br />
f. Create a relative frequency histogram. Is an observation of 79.6 an outlier Use<br />
Chebyshev‟s Theorem to justify your answer.<br />
4. Us<strong>in</strong>g the five class <strong>in</strong>tervals 100 to 120, 120 to 140, . . ., 180 to 200, devise a frequency<br />
distribution based on 70 observations whose histogram could be described as follows:<br />
a. symmetric b. bimodal c. positively (right) skewed d. negatively (left) skewed<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 79
5. The Highway Loss Data Institute publishes data on repair costs result<strong>in</strong>g from a 5-mph<br />
crash test of a car mov<strong>in</strong>g <strong>for</strong>ward <strong>in</strong>to a flat barrier. The follow<strong>in</strong>g table gives data <strong>for</strong><br />
10 midsize luxury cars tested <strong>in</strong> October 2002:<br />
Model Repair Cost<br />
Audi A6 0<br />
BMW 328i 0<br />
Cadillac Catera 900<br />
Jaguar X 1254<br />
Lexus ES300 234<br />
Lexus IS300 979<br />
Mercedes C320 707<br />
Saab 9-5 670<br />
Volvo S60 769<br />
Volvo S80 4194<br />
a. Us<strong>in</strong>g Analysis ToolPak <strong>in</strong> Excel, generate all descriptive statistics. Discuss the<br />
best measure of center and the best measure of spread based on what you see.<br />
Justify why these measure were selected.<br />
b. F<strong>in</strong>d the -score <strong>for</strong> the observation 4194. Expla<strong>in</strong> what your answer means <strong>in</strong><br />
real-world terms.<br />
c. Is $4,194 considered an extreme outlier Also use Chebyshev‟s Theorem to<br />
numerically re<strong>in</strong><strong>for</strong>ce your answer.<br />
6. Cost-to-charge ratios were reported <strong>for</strong> the 10 hospitals <strong>in</strong> Cali<strong>for</strong>nia with the lowest<br />
ratios (San Luis Obispo Tribune, December 15, 2002). The 10 cost-to-charge values<br />
were<br />
8.81 10.26 10.2 12.66 12.86 12.96 13.04 13.14 14.7 14.84<br />
Discuss relevant descriptive statistics and a relative frequency distribution . Use your<br />
<strong>in</strong><strong>for</strong>mation to make a conclusion about the state of hospitals <strong>in</strong> Cali<strong>for</strong>nia.<br />
7. The technical report “Ozone Season Emissions by State” (U.S. Environmental Protection<br />
Agency, 2002) gave the follow<strong>in</strong>g nitrous oxide emissions (<strong>in</strong> thousands of tons) <strong>for</strong> 16<br />
states <strong>in</strong> the cont<strong>in</strong>ental United States:<br />
76 22 40 7 30 5 6 136 72 33<br />
0 89 136 39 92 40 13 27 1 63<br />
Generate a brief report about the distribution of nitrous oxide emissions <strong>in</strong> the sampled<br />
states. Use descriptive measures and visuals to justify your answer.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 80
Chapter 3<br />
Probability and <strong>Decision</strong> Theory<br />
When you stop, I mean really stop, and th<strong>in</strong>k about<br />
how often you th<strong>in</strong>k <strong>in</strong> terms of probabilities, I am<br />
confident you‟ll f<strong>in</strong>d you use it more often than not.<br />
Do you ever decide to get to work by tak<strong>in</strong>g one<br />
route as opposed to another Would you f<strong>in</strong>d<br />
yourself mak<strong>in</strong>g health decisions based on your<br />
doctor‟s advice <strong>in</strong>stead of the advice you might<br />
receive from a ten-year-old child Have you ever<br />
purchased a birthday gift <strong>for</strong> someone after deep<br />
contemplation of what that person might like Do<br />
you trust one news network over another What are<br />
your decisions based on <strong>in</strong> these situations<br />
Whether or not you‟re will<strong>in</strong>g to give <strong>in</strong> to your<br />
<strong>in</strong>ner nerd, you should admit that you th<strong>in</strong>k <strong>in</strong> terms of chances and likelihood. I imag<strong>in</strong>e that<br />
you do have a preferred route. I th<strong>in</strong>k that you do trust an expert‟s medical op<strong>in</strong>ion. I believe that<br />
you do make a gift purchase after consider<strong>in</strong>g what you th<strong>in</strong>k the recipient enjoys. I should th<strong>in</strong>k<br />
there are some networks that you trust more than others.<br />
In this chapter, we‟ll explore the nature of probabilistic th<strong>in</strong>k<strong>in</strong>g. You‟ll also notice the phrase<br />
“<strong>Decision</strong> Theory” <strong>in</strong> the title. Instead of focus<strong>in</strong>g on the trite probability questions <strong>in</strong>volv<strong>in</strong>g<br />
situations that we don‟t ever encounter, we‟ll concern ourselves with real-world situations where<br />
probabilistic reason<strong>in</strong>g will help us make a decision.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 81
3.1 The Idea of Probability<br />
In this section, we‟ll address what probability is (and isn‟t).<br />
Example 1: A weather report by the National Weather Service (NWS) stated on July 31, 2011<br />
that, overnight, there was a 50% chance of precipitation <strong>in</strong> the 85225 zip code <strong>in</strong> which<br />
Chandler-Gilbert <strong>Community</strong> College is located. What does this mean<br />
(SOURCE: www.crh.noaa.gov/)<br />
SOLUTION: This is actually quite a loaded statement. One might want to say that, out of 100<br />
times, it will ra<strong>in</strong> 50 times. This is a very mislead<strong>in</strong>g approach <strong>for</strong> a couple of different reasons.<br />
First off, what is meant by “times” We are only concerned with one time: overnight on July 31,<br />
2011.<br />
A probability is actually a measure of how likely someth<strong>in</strong>g is to occur <strong>in</strong> the long-run. That is,<br />
if someth<strong>in</strong>g were to be repeated <strong>in</strong> trials over and over aga<strong>in</strong> then, theoretically, the specified<br />
outcome would occur a certa<strong>in</strong> percentage of time. Importantly, it must be noted that the<br />
conditions under which we are measur<strong>in</strong>g a probability must be <strong>in</strong> place <strong>in</strong> order <strong>for</strong> the<br />
probability to be a valid measure.<br />
In our case, NWS states that, under the exact same environmental conditions tak<strong>in</strong>g place<br />
throughout the night of July 31, 2011, it would be expected to ra<strong>in</strong> 50% of the time.<br />
The graph below shows a hypothetical scenario <strong>in</strong> which there is a 50% chance of precipitation<br />
under the set of conditions that occurred on the above night. Notice that it ra<strong>in</strong>ed on the <strong>in</strong>itial<br />
day and so immediately the proportion (or probability) of ra<strong>in</strong>y days is 100%. As the same<br />
conditions occur on different days, sometimes it ra<strong>in</strong>s and sometimes it does not. Hav<strong>in</strong>g noted<br />
that, any given day has a 50% chance of precipitation. We notice that the proportion is quite<br />
unstable at first, jump<strong>in</strong>g from 100%, down to nearly 40%; However, as many days with this<br />
same set of conditions pass (<strong>in</strong> the long-run), we notice that the proportion becomes more stable<br />
and approaches the theoretical probability of 50%.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 82
Proportion of Ra<strong>in</strong>y Days<br />
1.2<br />
Proportion of Ra<strong>in</strong>y Days Under July 31,2011 Overnight Conditions<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 20 40 60 80 100 120 140<br />
Day with Specific Conditions<br />
Graph: Based on a random simulation <strong>in</strong>volv<strong>in</strong>g the true probability of a 50% chance of precipitation<br />
and what occurs <strong>in</strong> the long-run.<br />
As an <strong>in</strong>terest<strong>in</strong>g note, NWS has sophisticated helium “balloons” that they send up <strong>in</strong>to the air to<br />
measure properties such as w<strong>in</strong>d speed and direction, humidity, and barometric pressure. Then<br />
physics is used based on theories of fluid mechanics to make the prediction.<br />
Among many others that we could beg<strong>in</strong> to state, there is one other major misconception about<br />
probability: that if the probability that it ra<strong>in</strong>s is said to be very small and yet it ra<strong>in</strong>s, then the<br />
probability must be wrong. This is <strong>in</strong>correct. Probability is a measure of uncerta<strong>in</strong>ty. As <strong>in</strong> the<br />
case of meteorology, the predictions are scientific and are based upon prior data. Just because it<br />
has only ra<strong>in</strong>ed, say, 10% of the time on days like today, this is not to say that it won‟t ra<strong>in</strong>. In<br />
fact, it very well might! The moral of the story is that probability talks about likelihood. Only <strong>in</strong><br />
the <strong>in</strong>stance of 0% and 100% probabilities is anyth<strong>in</strong>g guaranteed. If there are situations <strong>in</strong> which<br />
someth<strong>in</strong>g either never happens or always happens, then we‟re probably not concerned about<br />
understand<strong>in</strong>g probabilities.<br />
Probability<br />
Probability is a measure of uncerta<strong>in</strong>ty, typically expressed as a number between 0 (0%) and 1<br />
(100%), that describes how likely it is that an event will or will not occur under a specified set of<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 83
conditions <strong>in</strong> the long-run.<br />
Measur<strong>in</strong>g Probabilities<br />
While probability is considerably more complicated than we‟ll let on, the basic idea is that a<br />
probability can be calculated by consider<strong>in</strong>g the number of times some event occurs relative to<br />
the total number of “trials,” or observable situations. In simpler terms, it is the number of<br />
“successes” out of the total number of trials.<br />
Calculat<strong>in</strong>g Probability<br />
The probability that event occurs, denoted ( ), is the ratio (or fraction) of successes divided<br />
by the number of trials. Mathematically, we write the number of times occurs by ( ) and the<br />
total number of trials as ( ). That is,<br />
( )<br />
( )<br />
( )<br />
This <strong>for</strong>mula works when all elements <strong>in</strong> the sample space are equiprobable, that is, each<br />
<strong>in</strong>dividual outcome <strong>in</strong> the sample space has the same probability of occurr<strong>in</strong>g as any other<br />
outcome.<br />
As a note the () notation stands <strong>for</strong> “the number of ways” the event <strong>in</strong> parenthesis can occur.<br />
The <strong>in</strong> the denom<strong>in</strong>ator stands <strong>for</strong> sample space or the total number of th<strong>in</strong>gs/situations/trials<br />
be<strong>in</strong>g considered <strong>in</strong> the experiment.<br />
Example 2: In a 2009 study of high-fructose corn syrup (HFCS), a corn-based sweetener used <strong>in</strong><br />
a wide variety of foods, beverages, and condiments, 20 samples of HFCS were analyzed. Of<br />
those, n<strong>in</strong>e of them were found to conta<strong>in</strong> mercury by researchers. Based on the results of this<br />
study, f<strong>in</strong>d the probability that a random sample of HFCS conta<strong>in</strong>s mercury and expla<strong>in</strong> what this<br />
result means.<br />
SOURCE: http://www.wash<strong>in</strong>gtonpost.com/wpdyn/content/article/2009/01/26/AR2009012601831.html<br />
SOLUTION: The event <strong>in</strong> this scenario is that mercury is found. Out of the total 20 trials, n<strong>in</strong>e of<br />
them conta<strong>in</strong>ed mercury. There<strong>for</strong>e,<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 84
This means that if samples of HFCS were to be sampled randomly and repeatedly, it would be<br />
found that 45% of all samples would conta<strong>in</strong> traces of mercury. This does not guarantee that<br />
exactly 45 samples out of 100 will conta<strong>in</strong> mercury.<br />
Example 3: In July 2011, temperatures <strong>in</strong> Gilbert, Arizona were above 100 every day<br />
(SOURCE: www.weather.com). Based on this data, a researcher concludes that the probability of<br />
above 100 temperatures <strong>in</strong> Arizona is 100%. Comment on his f<strong>in</strong>d<strong>in</strong>gs.<br />
SOLUTION: S<strong>in</strong>ce temperatures <strong>in</strong> July 2011 were above 100 31 days of the 31 days <strong>in</strong> the<br />
month, it is fair to make the experimental observation that approximately 100% of all days <strong>in</strong><br />
July 2011 have temperatures exceed<strong>in</strong>g 100 , <strong>in</strong> the long-run (there have been days <strong>in</strong> the past<br />
when temperatures were below 100 ); However, because we know that temperatures are<br />
periodic, or that they go from low to high and back to low over the course of a year, 100% is not<br />
a good estimate <strong>for</strong> temperatures <strong>in</strong> Arizona, <strong>in</strong> general (temperatures are reasonably never above<br />
100 <strong>in</strong> January!).<br />
This example truly stresses the importance of critical th<strong>in</strong>k<strong>in</strong>g when us<strong>in</strong>g probabilities. It is<br />
often that probabilities are used and abused <strong>in</strong> the media, education, and <strong>in</strong> politics, just to name<br />
a few. We want to make sure that we are as specific as possible.<br />
It will often be considerably helpful to display probabilities <strong>in</strong> a tabular <strong>for</strong>m, that is, through the<br />
use of tables. This type of table is called a cont<strong>in</strong>gency table. This not only helps to organize<br />
data, but to simultaneously see the big picture. Let‟s consider an example.<br />
Example 4: In a 1950 study that considered 1,418 hospital patients <strong>in</strong> London (half of each) with<br />
and without lung cancer and whether or not they smoked over the course of their lives, the<br />
follow<strong>in</strong>g was found:<br />
Smoker/Lung Cancer Yes No<br />
Yes 688 650<br />
No 21 59<br />
Assum<strong>in</strong>g this data can be used as a representation of the entire population of London residents,<br />
analyze the data by discuss<strong>in</strong>g the follow<strong>in</strong>g:<br />
a. What is the probability that a randomly selected participant with<strong>in</strong> this study develops<br />
lung cancer<br />
b. Provided that a person was a smoker, what is the probability that he has lung cancer<br />
c. Provided that a person was not a smoker, what is the probability that he has lung cancer<br />
d. Given that a person has lung cancer, what is the probability that he smokes<br />
SOLUTION:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 85
When answer<strong>in</strong>g these questions, it is fairly useful to fully organize the data by provid<strong>in</strong>g all<br />
totals:<br />
Smoker/Lung Cancer Yes No Smoker TOTALS<br />
Yes 688 650 1,338<br />
No 21 59 80<br />
Lung Cancer TOTALS: 709 709 1,418<br />
1. S<strong>in</strong>ce there is a total of 1,418 <strong>in</strong>dividuals be<strong>in</strong>g considered and, of those, 709 developed<br />
lung cancer,<br />
( )<br />
We must be careful <strong>in</strong> us<strong>in</strong>g this probability as it doesn‟t really reveal anyth<strong>in</strong>g about the<br />
l<strong>in</strong>k between lung cancer and smok<strong>in</strong>g, s<strong>in</strong>ce 709 patients with lung cancer and 709<br />
without lung cancer were chosen to participate <strong>in</strong> the study to beg<strong>in</strong> with. This is a<br />
probability that was fixed by the researchers.<br />
2. There is a total of 1,338 <strong>in</strong>dividuals <strong>in</strong> the study that smoke (we are limited to the<br />
smokers only, per the way the question is stated). Of those <strong>in</strong>dividuals, 688 have lung<br />
cancer.<br />
( )<br />
Slightly over half of the patients who are smokers developed lung cancer. This number is<br />
frighten<strong>in</strong>gly large. Be<strong>for</strong>e we jump the gun <strong>in</strong> assum<strong>in</strong>g that smok<strong>in</strong>g is the culprit here,<br />
we should probably consider what happens with nonsmokers.<br />
3. There is a total of 80 nonsmokers <strong>in</strong> the group. Of them, 21 developed lung cancer.<br />
( )<br />
Slightly more than one-fourth of non-smokers developed lung cancer. This number<br />
appears to be significantly less severe than <strong>for</strong> the smokers. We speculate (but did not<br />
prove) that smok<strong>in</strong>g <strong>in</strong>creases the likelihood that one will develop lung cancer.<br />
4. There are 709 patients with lung cancer. Of these, 688 smoke.<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 86
Are we confident <strong>in</strong> accus<strong>in</strong>g a lung cancer patient of be<strong>in</strong>g a smoker Accord<strong>in</strong>g to this<br />
data, perhaps.<br />
The moral of the story is: analyze the situation from a variety of lenses. What appears to be true<br />
might be an illusion of what we see immediately! Sometimes, however, it is about what the<br />
naked eye does not detect. This is what makes good analysts.<br />
Homework Problems - 3.1<br />
1. A classmate of yours was absent when this section was discussed. Expla<strong>in</strong> to her what a<br />
probability is <strong>in</strong> your own words.<br />
2. In a study per<strong>for</strong>med by Cambridge University <strong>in</strong> the United K<strong>in</strong>gdom, it was found that,<br />
“One out of three people is overwhelmed by the latest breakthroughs <strong>in</strong> technology.”<br />
(SOURCE: http://www.gev.com/2011/07/study-one-out-of-three-people-overwhelmedby-technology/).<br />
Primarily, <strong>in</strong>dividuals are overwhelmed by how much <strong>in</strong><strong>for</strong>mation is<br />
available through the use of social networks and smartphones, to name just two. Expla<strong>in</strong><br />
what is meant by this and expla<strong>in</strong> <strong>in</strong> terms of probabilistic reason<strong>in</strong>g.<br />
3. In a 2007 survey conducted by DDB Worldwide, an <strong>in</strong>ternationally known advertis<strong>in</strong>g<br />
company, the follow<strong>in</strong>g question was asked of a random group of 217 participants: “Is<br />
consistency <strong>in</strong> brand<strong>in</strong>g becom<strong>in</strong>g any more or less important” The follow<strong>in</strong>g table<br />
displays the results:<br />
Response Number of respondents<br />
More important 143<br />
Less important 74<br />
F<strong>in</strong>d the probability that a respondent believes that consistency <strong>in</strong> brand<strong>in</strong>g is:<br />
a. More important, then expla<strong>in</strong> what this means.<br />
b. Less important, then expla<strong>in</strong> what this means.<br />
4. The probability that a visit to a primary care physician‟s (PCP) office results <strong>in</strong> neither<br />
lab work nor referral to a specialist is 35%. Of those com<strong>in</strong>g to a PCP‟s office, 30% are<br />
referred to specialists and 40% require lab work.<br />
Determ<strong>in</strong>e the probability that a visit to a PCP‟s office results <strong>in</strong> both lab work and<br />
referral to a specialist. (Video Solution)<br />
5. A public health researcher exam<strong>in</strong>es the medical records of a group of 937 men who died<br />
<strong>in</strong> 1999 and discovers that 210 of the men died from causes related to heart disease.<br />
Moreover, 312 of the 937 men had at least one parent who suffered from heart disease,<br />
and, of these 312 men, 102 died from causes related to heart disease.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 87
Determ<strong>in</strong>e the probability that a man randomly selected from this group died of causes<br />
related to heart disease, provided that neither of his parents suffered from heart disease.<br />
(PROBLEM SOURCE: SOA/CAS Exam P Sample Questions, Page 5) (Video Solution)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 88
3.2 Jo<strong>in</strong>t Probability<br />
In the previous section, we began comput<strong>in</strong>g probability us<strong>in</strong>g some fairly basic ideas. In<br />
calculat<strong>in</strong>g probabilities, we made a huge assumption: that the found number represents what<br />
will occur <strong>in</strong> the long-run. For <strong>in</strong>stance, if we conduct a study and f<strong>in</strong>d that out of 100 people, 94<br />
respond positively to a new energy dr<strong>in</strong>k, can we conclude the dr<strong>in</strong>k is effective <strong>in</strong> provid<strong>in</strong>g<br />
added energy<br />
The answer to this question is<br />
humbl<strong>in</strong>g: it depends upon how the<br />
data was collected. Suppose the<br />
participants are all college students<br />
who tend to consume a large amount of<br />
caffe<strong>in</strong>e as it is. Would it be fair <strong>for</strong> the<br />
advertisement to say, “There is a 94%<br />
chance that this energy dr<strong>in</strong>k will<br />
energize you” Not necessarily, s<strong>in</strong>ce<br />
the result only appeared to be valid <strong>in</strong> a<br />
sample of college students. This means that the population must be specified <strong>for</strong>m which the<br />
sample was taken. In this case, the population is the set of all college students and the sample is<br />
the 100 students who were selected. Thus, perhaps the advertisement should say, “Are you a<br />
college student If so, there is a 94% chance that this energy dr<strong>in</strong>k will energize you” That is,<br />
provided that this sample was a random sample and not a group of college students hand-picked<br />
from the respective population.<br />
Okay, so you have a data sample collected from a specific population and your goal is to now<br />
talk about probabilities.<br />
Example 1: Imag<strong>in</strong>e that you work <strong>for</strong> a market<strong>in</strong>g agency and your goal is to determ<strong>in</strong>e the<br />
effectiveness of two different brand<strong>in</strong>g approaches to a new l<strong>in</strong>e of cloth<strong>in</strong>g. The first approach<br />
<strong>in</strong>volves establish<strong>in</strong>g a group of Facebook followers by giv<strong>in</strong>g <strong>in</strong>centives <strong>for</strong> discounts on<br />
cloth<strong>in</strong>g by becom<strong>in</strong>g a friend of the company. The company hypothesizes that see<strong>in</strong>g the<br />
company logo under on their Facebook account each week,<br />
they will ga<strong>in</strong> a strong familiarity and com<strong>for</strong>t level with<br />
the company‟s product. The second approach <strong>in</strong>volves<br />
hir<strong>in</strong>g Hollywood actors to endorse the product at film<br />
festivals and celebrity appearances. The company then<br />
tracks the degree of success of the brand<strong>in</strong>g tactic by<br />
measur<strong>in</strong>g the number of retail outlets that agree to stock<br />
the product based on the brand<strong>in</strong>g used. They f<strong>in</strong>d that, of<br />
the 6 companies exposed to Tactic 1 (T1), 5 agreed to stock<br />
the product. Of the 7 companies exposed to Tactic 2 (T2), 5<br />
agreed to stock the product.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 89
Because of the amount of resources <strong>in</strong>volved <strong>in</strong> sell<strong>in</strong>g the product to retail stores, a s<strong>in</strong>gle<br />
market<strong>in</strong>g analyst can only reach out to about 15 bus<strong>in</strong>ess per month; However, if successfully<br />
sold, the result is a high level of profit <strong>for</strong> the cloth<strong>in</strong>g company, which, <strong>in</strong> turn, means you<br />
might get that raise after all.<br />
SOLUTION: Let‟s start with a simpler question, and first consider T1. We f<strong>in</strong>d that the<br />
probability of a successful sale is:<br />
( )<br />
This means that we should expect 80% of all companies to sell the cloth<strong>in</strong>g l<strong>in</strong>e, <strong>in</strong> the long-run.<br />
Suppose that a market<strong>in</strong>g analyst is to offer T1 to two different companies. He would like to<br />
know, what is the probability that both companies agree to sell the product Is the answer 80%<br />
Un<strong>for</strong>tunately, no. There is an 80% chance that each company agrees to sell the cloth<strong>in</strong>g l<strong>in</strong>e.<br />
We should expect that the probability that both sign-on is less.<br />
We know that about 8 out of 10 times, Company 1 (C1) will sign-on and that 8 out of 10 times<br />
Company 2 (C2) will sign on. Let‟s compare the possibilities by us<strong>in</strong>g a tabular approach:<br />
Company 2<br />
Choices<br />
Company 1<br />
Choices<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
N<br />
N<br />
Y Y Y Y Y Y Y Y N N<br />
Each cell <strong>in</strong> the table represents a particular comb<strong>in</strong>ation of the C1‟s choice and C2‟s choice. So,<br />
the 1-1 entry (remember, this means first row, first column) of the table is the situation <strong>in</strong> which<br />
it does <strong>in</strong>deed turn out that C1 and C2 agree to sell the cloth<strong>in</strong>g l<strong>in</strong>e. The question was, what is<br />
the probability that both sign-on S<strong>in</strong>ce the def<strong>in</strong>ition of probability is the ratio of the number of<br />
ways the event can occur divided by the total number of possible outcomes, let‟s do a bit of<br />
count<strong>in</strong>g by highlight<strong>in</strong>g important features of the table:<br />
Company 2<br />
Choices<br />
Y Y Y Y Y Y Y Y N N<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 90
Company 1<br />
Choices<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
N<br />
N<br />
The shaded region represents the number of ways <strong>in</strong> which we can get both companies to sign<br />
on. This region is 8 x 8, which creates 64 possibilities. The total number of possibilities is simply<br />
the total number of cells <strong>in</strong> the table. S<strong>in</strong>ce the table is 10 x 10, we have100 possibilities.<br />
So,<br />
( )<br />
This is, as speculated, less than the probability that only one company signs on. Let‟s consider<br />
what we really did here:<br />
( )<br />
( )<br />
Notice that<br />
( ) ( )<br />
( ) ( )<br />
Or, <strong>in</strong> short,<br />
( ) ( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 91
Example 3: The idea of red-light cameras has been disputed quite often <strong>in</strong> Arizona and all across<br />
the United States. While unable to f<strong>in</strong>d any specific details, the author will assume that red-light<br />
runners have about a 70% chance of be<strong>in</strong>g caught by a red-light camera on any given <strong>in</strong>stance.<br />
Suppose that on a given day, two cars run through an <strong>in</strong>tersection dur<strong>in</strong>g separate red lights,<br />
sett<strong>in</strong>g off the camera. What is the probability that both drivers are<br />
caught<br />
SOLUTION: We can fairly assume that the first driver be<strong>in</strong>g caught<br />
and the second driver be<strong>in</strong>g caught (call<strong>in</strong>g these events and ,<br />
respectively) constitute events that do not affect one another. Thus,<br />
( ) ( ) ( )<br />
There is a 49% chance that both drivers are caught. This is about the likelihood of gett<strong>in</strong>g heads<br />
on the toss of a co<strong>in</strong>.<br />
Example 5: In a crop of corn, the Food & Drug Adm<strong>in</strong>istration<br />
(FDA) f<strong>in</strong>ds that two of the 20 bushels of corn are potentially<br />
contam<strong>in</strong>ated with E. coli. Suppos<strong>in</strong>g that two bushels have<br />
already gone out <strong>for</strong> shipment to county marketplaces, how likely<br />
is it that both of the contam<strong>in</strong>ated bushels have gone out<br />
SOLUTION: The question asks about the probability that both<br />
have been shipped, that is, the first contam<strong>in</strong>ated bushel and the<br />
second contam<strong>in</strong>ated bushel. We will refer to these events as simply and . We will first<br />
write the “and” probability <strong>in</strong> the <strong>for</strong>m of dependent events and will then determ<strong>in</strong>e whether or<br />
not a dependency exists (see Independence Property box above).<br />
( ) ( ) ( )<br />
We know that ( ) . Now, s<strong>in</strong>ce the first probability “removes” one of the two<br />
contam<strong>in</strong>ated bushels and one bushel out of the 20 available, the probability of shipp<strong>in</strong>g a second<br />
bushel is slightly changed to:<br />
( )<br />
Thus, the events are <strong>in</strong>deed dependent, and so the probability becomes:<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 92
There is less than a 1% chance that both contam<strong>in</strong>ated bushels went out.<br />
Does this outcome satisfy the farm produc<strong>in</strong>g these bushels of corn Th<strong>in</strong>k<strong>in</strong>g <strong>in</strong> more detail, the<br />
ma<strong>in</strong> concern is actually <strong>in</strong> regards to one or more (at least one) contam<strong>in</strong>ated bushel go<strong>in</strong>g out!<br />
In order to address how to f<strong>in</strong>d this, it is useful to th<strong>in</strong>k about the follow<strong>in</strong>g, perhaps obvious,<br />
characteristic.<br />
Basic Properties of Probability (Kolmogorov Axioms)<br />
1) A particular event is: guaranteed to not occur, is guaranteed to occur, or lies somewhere<br />
between these extremes.<br />
2) In a given situation, or sample space, the likelihood of someth<strong>in</strong>g occurr<strong>in</strong>g (however<br />
small or <strong>in</strong>significant), is guaranteed.<br />
3) The summed probabilities of all the possible events <strong>in</strong> a situation constitute the entire, or<br />
the whole of all possibilities.<br />
Mathematically, suppose that a sample space consists of n events<br />
above verbal rules translate <strong>in</strong>to:<br />
. Then, the<br />
1) For any arbitrary event between events 1 and n, let‟s call this event , then:<br />
( )<br />
2) Us<strong>in</strong>g to denote the sample space,<br />
( )<br />
3) Summ<strong>in</strong>g the probabilities gives 100% of all possible outcomes:<br />
( ) ( ) ( ) ( )<br />
These basic properties are often referred to as the Kolmogorov axioms, named after the<br />
mathematician Andrey Kolmogorov. An axiom can be thought of as a necessary assumption. For<br />
<strong>in</strong>stance, when physicists develop new concepts <strong>in</strong> physics, they assume that gravity follows<br />
certa<strong>in</strong> properties. Thus, they have gravity axioms.<br />
The Kolmogorov axioms are extremely important <strong>in</strong> probability and the development of new<br />
ideas.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 93
In fact, recall Example 5 deal<strong>in</strong>g with the contam<strong>in</strong>ated corn crop. What are all the possibilities<br />
<strong>for</strong> shipp<strong>in</strong>g out two bushels from the total of 20 Let‟s list them out:<br />
0 contam<strong>in</strong>ated bushels and 2 uncontam<strong>in</strong>ated bushels ship (call it<br />
) )<br />
1 contam<strong>in</strong>ated and 1 uncontam<strong>in</strong>ated bushels ship (call it )<br />
2 contam<strong>in</strong>ated bushels ship (call it )<br />
Are there any others Not unless there is a possibility we have not considered. S<strong>in</strong>ce two bushels<br />
are guaranteed to go out, the outcome must fall <strong>in</strong>to one of the three categories listed.<br />
Let‟s calculate the probability <strong>for</strong> each of these by hand:<br />
( ) ( )<br />
( ) ( )<br />
( ): there are two possibilities; either the first is contam<strong>in</strong>ated and the second is not, or<br />
vice versa. We must consider both outcomes below:<br />
o ( )<br />
( ) ( )<br />
o ( )<br />
( ) ( )<br />
These two possibilities give 9.5% + 9.5% = 19% of the sample space.<br />
( ) (from previous calculation)<br />
(NOTE: Importantly, summ<strong>in</strong>g these three probabilities gives 1, as stated <strong>in</strong> the axioms!)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 94
We can now see that the situation <strong>in</strong> which there is at least one contam<strong>in</strong>ated bushel will occur<br />
of the time. This is much higher than when we concerned ourselves with<br />
both go<strong>in</strong>g out! This is quite a frighten<strong>in</strong>g situation!<br />
Needless to say, this was a lot of work; however, we can use the axioms to simplify the amount<br />
of work we commit to ourselves.<br />
Accord<strong>in</strong>g to axiom 2:<br />
( ) ( ) ( )<br />
Our earlier statement <strong>in</strong>volved want<strong>in</strong>g to know the likelihood that at least one contam<strong>in</strong>ated<br />
bushel went out. That only <strong>in</strong>volves and ! Solv<strong>in</strong>g <strong>for</strong> the sum of these two probabilities:<br />
That is,<br />
( ) ( ) ( )<br />
( ) ( )<br />
( )<br />
This is the same number we achieved tak<strong>in</strong>g the long route! We only had to f<strong>in</strong>d the probability<br />
of shipp<strong>in</strong>g 0 bushels, which is a little bit of work as compared to a lot of work!<br />
Probability of At Least One…<br />
Given any number of events <strong>in</strong>volv<strong>in</strong>g quantities, the probability of at least one <strong>in</strong> quantity is 1<br />
m<strong>in</strong>us the probability of 0 <strong>in</strong> quantity. That is:<br />
( ) ( )<br />
Mathematically, let subscripts<br />
denoted<br />
. Then,<br />
represent quantity, where correspond<strong>in</strong>g events are<br />
( ) ( ) ( ) ( )<br />
Homework Problems - 3.2<br />
1. In 2009 the H1N1 virus, commonly referred to as the “Sw<strong>in</strong>e Flu,” reportedly <strong>in</strong>fected an<br />
estimated 10% of New Yorkers (SOURCE:<br />
http://www.reuters.com/article/2009/08/30/us-flu-newyork-idUSTRE57T26Y20090830).<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 95
Suppose that an emergency room <strong>in</strong> New York City has two <strong>in</strong>dividuals with flu-like<br />
symptoms. (Video Solution)<br />
a. What condition(s) do you believe would make it appropriate to assume<br />
<strong>in</strong>dependence <strong>in</strong> this situation<br />
b. By us<strong>in</strong>g the tabular approach and assum<strong>in</strong>g <strong>in</strong>dependence, f<strong>in</strong>d the probability<br />
that both people have the H1N1 virus.<br />
c. By us<strong>in</strong>g the “and” rule, verify that you get the same answer that you found <strong>in</strong><br />
Part b.<br />
d. F<strong>in</strong>d the probability that neither of these <strong>in</strong>dividuals has the H1N1 virus.<br />
e. F<strong>in</strong>d the probability that at least one of them has the H1N1 virus.<br />
f. Exposure to flu germs <strong>for</strong> even a short period of time can significantly <strong>in</strong>crease<br />
one‟s chances of catch<strong>in</strong>g the flu. Suppose that if one is exposed to an <strong>in</strong>dividual<br />
with the flu virus, their chance of becom<strong>in</strong>g <strong>in</strong>fected is 15 percentage po<strong>in</strong>ts<br />
higher than normal. F<strong>in</strong>d the probability that both <strong>in</strong>dividuals have the flu virus.<br />
2. Many fire stations handle emergency calls <strong>for</strong> medical assistance as well as calls<br />
request<strong>in</strong>g firefight<strong>in</strong>g equipment. A particular station says that the probability that an<br />
<strong>in</strong>com<strong>in</strong>g call is <strong>for</strong> medical assistance is .85. This can be expressed as P(call is <strong>for</strong><br />
medical assistance) = .85.<br />
a. Give a relative frequency <strong>in</strong>terpretation of the given probability. That is, <strong>in</strong>terpret<br />
what the number .85 means based on the def<strong>in</strong>ition of probability.<br />
b. What is the probability that a call is not <strong>for</strong> medical assistance<br />
c. Assum<strong>in</strong>g that successive calls are <strong>in</strong>dependent of one another (i.e., know<strong>in</strong>g that<br />
one call is <strong>for</strong> medical assistance doesn't <strong>in</strong>fluence our assessment of the<br />
probability that the next call will be <strong>for</strong> medical assistance), calculate the<br />
probability that both of the two successive calls will be <strong>for</strong> medical assistance.<br />
d. Still assum<strong>in</strong>g <strong>in</strong>dependence, calculate the probability that <strong>for</strong> two successive<br />
calls, the first is <strong>for</strong> medical assistance and the second is not <strong>for</strong> medical<br />
assistance.<br />
e. Still assum<strong>in</strong>g <strong>in</strong>dependence, calculate the probability that exactly one of the next<br />
two calls will be <strong>for</strong> medical assistance. (H<strong>in</strong>t: There are two different<br />
possibilities that you should consider. The one call <strong>for</strong> medical assistance might<br />
be the first call, or it might be the second call.)<br />
f. Do you th<strong>in</strong>k it is reasonable to assume that the requests made <strong>in</strong> successive calls<br />
are <strong>in</strong>dependent Expla<strong>in</strong>.<br />
3. "N.Y. Lottery Numbers Come Up 9-1-1 on 9/11" was the headl<strong>in</strong>e of an article that<br />
appeared <strong>in</strong> the San Francisco Chronicle (September 13, 2002). More than 5600 people<br />
had selected the sequence 9-1-1 on that date, many more than is typical <strong>for</strong> that sequence.<br />
A professor at the University of Buffalo is quoted as say<strong>in</strong>g, "I'm a bit surprised, but I<br />
wouldn't characterize it as bizarre. It's randomness. Every number has the same chance of<br />
com<strong>in</strong>g up. People tend to read <strong>in</strong>to these th<strong>in</strong>gs. I'm sure that whatever numbers come up<br />
tonight, they will have some special mean<strong>in</strong>g to someone, somewhere." The New York<br />
state lottery uses balls numbered 0-9 circulat<strong>in</strong>g <strong>in</strong> 3 separate b<strong>in</strong>s. To select the w<strong>in</strong>n<strong>in</strong>g<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 96
sequence, one ball is chose at random from each b<strong>in</strong>. What is the probability that the<br />
sequence 9-1-1 would be the one selected on any particular day<br />
4. On August 8, 2011, the Dow Jones Industrial fell 635 po<strong>in</strong>ts (5.5%) to 10,810 po<strong>in</strong>ts,<br />
represent<strong>in</strong>g the 6 th worst po<strong>in</strong>t loss ever experienced. On that day, President Obama‟s<br />
approval rat<strong>in</strong>gs also suffered tremendously; only 22% of the nation‟s voters “Strongly<br />
Approve” of how he is per<strong>for</strong>m<strong>in</strong>g <strong>in</strong> the presidential role (SOURCE:<br />
http://www.rasmussenreports.com/public_content/politics/obama_adm<strong>in</strong>istration/daily_pr<br />
esidential_track<strong>in</strong>g_poll).<br />
Suppose presidential hopeful Randall Terry (Democrat) speaks at a rally shortly<br />
thereafter and assumes that his approval rat<strong>in</strong>g as a candidate will likely closely mirror<br />
President Obama‟s. Suppose there are 40 sw<strong>in</strong>g voters (voters that are “on the fence”<br />
about who to vote <strong>for</strong>). (Video Solution)<br />
a. What is the probability that all 40 voters will strongly approve of Terry‟s plan<br />
b. What is the probability that none of the 40 voters will strongly approve of Terry‟s<br />
plan<br />
c. What is the probability that at least one voter will approve of Terry‟s plan<br />
5. The follow<strong>in</strong>g case study is reported <strong>in</strong> the article "Park<strong>in</strong>g Tickets and Miss<strong>in</strong>g<br />
Women," which appears <strong>in</strong> an early edition of the book <strong>Statistics</strong>: A Guide to the<br />
Unknown. In a Swedish trial on a charge of overtime park<strong>in</strong>g, a police officer testified<br />
that he had noted the position of the two air valves on the tires of a parked car: To the<br />
closest hour, one valve was at the 1 o' clock position and the other was at the 6 o' clock<br />
position. After the allowable time <strong>for</strong> park<strong>in</strong>g <strong>in</strong> that zone had passed, the policeman<br />
returned, noted the valves were <strong>in</strong> the same position, and ticketed the car. The owner of<br />
the car claimed that he had left the park<strong>in</strong>g place <strong>in</strong> time and had returned later. The<br />
values just happened by chance to be <strong>in</strong> the same positions. An "expert" witness<br />
computed the probability of this occurr<strong>in</strong>g as (1/12)(1/12) = 1/144.<br />
a. What reason<strong>in</strong>g did the expert use to arrive at the probability of 1/144<br />
b. Can you spot the error(s) <strong>in</strong> the reason<strong>in</strong>g that leads to the stated probability of<br />
1/144 What effect does this error(s) have on the probability of occurrence Do<br />
you th<strong>in</strong>k that 1/144 is larger or smaller that the correct probability of occurrence<br />
6. Jeanie is a bit <strong>for</strong>getful, and if she doesn't make a "to do" list, the probability that she<br />
<strong>for</strong>gets someth<strong>in</strong>g she is supposed to do is .1. Tomorrow she <strong>in</strong>tends to run three errands,<br />
and she fails to write them on her list.<br />
a. What is the probability that Jeanie <strong>for</strong>gets all three errands What assumptions did<br />
you make to calculate this probability<br />
b. What is the probability that Jeanie remembers at least one of the three errands<br />
c. What is the probability that Jeanie remembers the first errand but not the second<br />
or third<br />
7. One of the myths most commonly believed by students on multiple choice exams is that,<br />
as long as they always use letter „C‟ as their guess, they <strong>in</strong>crease their chances of<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 97
guess<strong>in</strong>g correctly. This, of course, is absurd, s<strong>in</strong>ce there is not usually a set pattern used<br />
by <strong>in</strong>structors <strong>in</strong> pair<strong>in</strong>g correct answers with certa<strong>in</strong> letters (certa<strong>in</strong>ly not <strong>for</strong> me,<br />
anyhow).<br />
Suppose that a multiple-choice quiz has two problems on it and that the student has no<br />
idea how to answer them, so he guesses. Each problem has letters A-E correspond<strong>in</strong>g to<br />
the answers to choose from. Us<strong>in</strong>g count<strong>in</strong>g techniques discussed <strong>in</strong> class, f<strong>in</strong>d and<br />
expla<strong>in</strong> how you found the follow<strong>in</strong>g: (Video Solution)<br />
a. What is the probability that both guesses are correct<br />
b. What is the probability that both guesses are <strong>in</strong>correct<br />
c. What is the probability that he receives a 50% on the test<br />
d. How likely is it that he gets at least one problem correct<br />
e. What is the probability that he receives a 90% on the exam (assume no partial<br />
credit is possible)<br />
f. How did the idea of “count<strong>in</strong>g tables” allow you to answer these questions<br />
without hav<strong>in</strong>g to do additional work <strong>for</strong> each subsequent table<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 98
3.3 Probability of Unions<br />
Imag<strong>in</strong>e that you toss a fair, two-sided quarter. You let it land and take a look at the side fac<strong>in</strong>g<br />
up. What is the probability that you see heads or tails (assume the toss will be ignored if it<br />
happens to land on its side)<br />
You can probably see fairly quickly that the outcome desired is guaranteed; when a co<strong>in</strong> is<br />
tossed, it will result <strong>in</strong> one of two outcomes: heads or tails. If someone <strong>in</strong> a bet were to tell you<br />
that he will w<strong>in</strong> if the toss of a co<strong>in</strong> results <strong>in</strong> heads or tails, then you could probably tell him,<br />
“Congratulations!”<br />
Add<strong>in</strong>g to our <strong>in</strong>tuition (no pun <strong>in</strong>tended), we will write the situation <strong>in</strong> the <strong>for</strong>m of a<br />
mathematical probability. The sample space will have two outcomes:<br />
Then,<br />
( )<br />
S<strong>in</strong>ce we know that<br />
( ) ( )<br />
So, we can gladly write:<br />
( ) ( ) ( )<br />
Simple enough! We feel pretty satisfied and so we hope to<br />
tackle another problem:<br />
Example 1: A large company offers a self-<strong>in</strong>sured health<br />
<strong>in</strong>surance policy to its employees to help them reduce premium and copay costs. Us<strong>in</strong>g its<br />
historical data from the last two years, the company analyst considers the risk status of the<br />
employees (low or high) based on preexist<strong>in</strong>g conditions, and the type of claim filed (physical<br />
health or mental health). He f<strong>in</strong>ds that 70% of employees have filed a mental health claim and<br />
that 40% of employees have been categorized as high risk. Further, he f<strong>in</strong>ds that 20% of<br />
employees are low risk and have filed a physical health claim. The company only <strong>in</strong>sures the<br />
first claim. All claims thereafter are paid <strong>for</strong> by a third-party <strong>in</strong>surer.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 99
For report<strong>in</strong>g purposes, he would like to f<strong>in</strong>d the probability that a randomly selected employee<br />
(or an employee that is to be hired <strong>in</strong> the future) is high risk or will file a mental health claim. As<br />
he is writ<strong>in</strong>g his report, he reaches a speed bump:<br />
Lett<strong>in</strong>g ,<br />
( ) ( ) ( )<br />
He quickly realizes that this probability is <strong>in</strong>valid because a probability cannot be greater than 1,<br />
or 100%. What happened<br />
SOLUTION:<br />
We first organize his data <strong>in</strong>to a table to help us better see what is happen<strong>in</strong>g:<br />
Claim\Risk Low High<br />
Physical .20<br />
Mental .70<br />
.40<br />
The probabilities outside of the boxes represent totals <strong>for</strong> mental health claims and <strong>for</strong> high risk<br />
claims. The probability <strong>in</strong> the 1-1 entry of the table represents the probability of be<strong>in</strong>g low risk<br />
and fil<strong>in</strong>g a physical health claim. S<strong>in</strong>ce we know that this data represents all of those who have<br />
filed claims, we know that 100% have filed one type or the other. Additionally, each employee<br />
considered falls <strong>in</strong>to one of the two risk categories. So we fill <strong>in</strong> more details:<br />
Claim\Risk Low High<br />
Physical .20 .30<br />
Mental .70<br />
.60 .40<br />
We can also proceed to fill <strong>in</strong> the boxes <strong>in</strong> the table, s<strong>in</strong>ce each person falls <strong>in</strong>to exactly one of<br />
the four positions (low physical, low mental, high physical, high mental):<br />
Claim\Risk Low High<br />
Physical .20 .10 .30<br />
Mental .40 .30 .70<br />
.60 .40<br />
Now, the analyst added to second row total with the second column total, as highlighted <strong>in</strong> the<br />
table below:<br />
Claim\Risk Low High<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 100
Physical .20 .10 .30<br />
Mental .40 .30 .70<br />
.60 .40<br />
The problem seems to be that the .40 and the .70 both <strong>in</strong>clude the probability of High Risk and<br />
Mental Claim! In other words, it is be<strong>in</strong>g counted twice, hence the end probability that is great<br />
than 1.<br />
Instead, let‟s add up the <strong>in</strong>dividuals box probabilities as illustrated <strong>in</strong> the table below:<br />
Claim\Risk Low High<br />
Physical .20 .10 .30<br />
Mental .40 .30 .70<br />
.60 .40<br />
We f<strong>in</strong>d that ( ) , which is a number that rests between<br />
0% and 100%. We conclude that, <strong>in</strong> fact, there is an 80% chance that a claim-fil<strong>in</strong>g employee is<br />
high risk or files a mental claim (or both!!).<br />
While this does not seem like a huge amount of work, suppose that we <strong>in</strong>stead had three types of<br />
claims and 3 different statuses. It would probably be convenient to have some sort of<br />
mathematical approach to the solution.<br />
Let‟s go back to the table <strong>in</strong> which the double-count occurred:<br />
Claim\Risk Low High<br />
Physical .20 .10 .30<br />
Mental .40 .30 .70<br />
.60 .40<br />
We are free to add the two probabilities, ( ) and ( ), but we must be sure to take out the .30<br />
one time, so that it is s<strong>in</strong>gle-counted and not double-counted:<br />
( )<br />
This is the same answer as be<strong>for</strong>e! Notice what we really did:<br />
( )<br />
( ) ( ) ( )<br />
Regardless of the context/application of the probability, this issues can be resolved as shown.<br />
Probability of One Event “Or” the Other<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 101
Given two events, and , the probability that one or the other occurs is the sum of the<br />
<strong>in</strong>dividual probabilities with the double-count removed once. Mathematically,<br />
( ) ( ) ( ) ( )<br />
Typically,<br />
is used (called a union) to replace the word “or”, mak<strong>in</strong>g the above equation<br />
( ) ( ) ( ) ( )<br />
At the beg<strong>in</strong>n<strong>in</strong>g of this section, we addressed a co<strong>in</strong>-toss<strong>in</strong>g problem that <strong>in</strong>volve the<br />
summation of the probability of heads and the probability of tails. Let‟s see why we could get<br />
away with not subtract<strong>in</strong>g away the double-count. We use the “Or” probability set-up:<br />
( ) ( ) ( ) ( )<br />
We already know that the first two probabilities on the right-hand side, but what is the third<br />
probability value Let‟s analyze its mean<strong>in</strong>g:<br />
( )<br />
Of course, it is impossible to get both heads and tails <strong>in</strong> one toss of a co<strong>in</strong>! Any impossible<br />
outcome has a probability of 0%. That is:<br />
So,<br />
( )<br />
( ) ( ) ( ) ( )<br />
We simply “lucked-out” when this problem worked-out accord<strong>in</strong>g to our <strong>in</strong>tuition. In general,<br />
you need only to remember the “Or” probability <strong>for</strong>mula <strong>for</strong> the reasons given to solve any<br />
problem <strong>in</strong>volv<strong>in</strong>g the occurrence of one outcome or another.<br />
Example 2: It is often <strong>in</strong>terest<strong>in</strong>g to note how political preference (Democrat or Republican)<br />
varies with<strong>in</strong> a married couple. Suppose that <strong>in</strong> a survey of<br />
160 couples it is found that 60 of the couples agree on a<br />
preference to vote Democrat and 40 are such that the<br />
husband votes Democrat and the wife votes Republican. The<br />
total number of wives that vote Democrat is 90. What is the<br />
probability that the couple has a husband or a wife that is<br />
Republican<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 102
SOLUTION: We first arrange this <strong>in</strong><strong>for</strong>mation <strong>in</strong> a table:<br />
Husband\Wife Democrat Republican<br />
Democrat 60 40<br />
Republican<br />
90 160<br />
Note that the bottom-right corner represents the table total.<br />
We know that the number of husbands vot<strong>in</strong>g democrat is<br />
. This means that the<br />
number of husbands vot<strong>in</strong>g Republican is<br />
. Additionally, we conclude that the<br />
number of couples where the husband votes Republican and the wife votes Democrat is<br />
. We fill this <strong>in</strong><strong>for</strong>mation <strong>in</strong>:<br />
Husband\Wife Democrat Republican<br />
Democrat 60 40 100<br />
Republican 30 60<br />
90 160<br />
This allows us to fill <strong>in</strong> the rema<strong>in</strong><strong>in</strong>g details <strong>in</strong> the table:<br />
Husband\Wife Democrat Republican<br />
Democrat 60 40 100<br />
Republican 30 30 60<br />
90 70 160<br />
We convert the totals <strong>in</strong>to percentages by divid<strong>in</strong>g each cell entry by the total number of couples,<br />
160:<br />
Husband\Wife Democrat Republican<br />
Democrat .375 .25 .625<br />
Republican .1875 .1875 .375<br />
.5625 .4375<br />
Let<br />
So,<br />
( ) ( ) ( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 103
We f<strong>in</strong>d that there is a 62.5% chance that <strong>in</strong> a couple either the husband votes Republican, the<br />
wife votes Republican, or both vote Republican.<br />
At this po<strong>in</strong>t you might be wonder<strong>in</strong>g why we don‟t simply draw out the table and ignore the<br />
mathematical <strong>for</strong>mulas. When possible, tables are extremely useful, but they might not always be<br />
available. Consider the follow<strong>in</strong>g example.<br />
Example 3: Test<strong>in</strong>g has determ<strong>in</strong>ed that a particular ballistic missile has an<br />
80% chance of hitt<strong>in</strong>g its <strong>in</strong>tended target. Suppose that an enemy jet<br />
approaches a military base and so two missiles are fired at the <strong>in</strong>com<strong>in</strong>g jet.<br />
What is the probability that this threat is elim<strong>in</strong>ated<br />
SOLUTION: This is the probability that one or both missiles hit the target.<br />
We only have one probability, so fill<strong>in</strong>g out a table would not be possible.<br />
Let<br />
We want to know<br />
( ) ( ) ( ) ( )<br />
We already know the first two probabilities on the right hand-side (.80), but we are not given<br />
<strong>in</strong><strong>for</strong>mation on ( ). We can fairly assume that the outcome of one missile has no (or<br />
very m<strong>in</strong>imal) impact on the outcome of another missile, and so we assume the events are<br />
<strong>in</strong>dependent. This allows us to write:<br />
( ) ( ) ( )<br />
And so,<br />
( )<br />
We conclude that there is a 96% chance that the enemy jet is elim<strong>in</strong>ated.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 104
Homework Problems - 3.3<br />
1. A gam<strong>in</strong>g <strong>in</strong>vestor is consider<strong>in</strong>g becom<strong>in</strong>g a f<strong>in</strong>ancial partner <strong>in</strong> a new cas<strong>in</strong>o. In<br />
decid<strong>in</strong>g to go <strong>in</strong> on the deal, he reviews gam<strong>in</strong>g revenues <strong>for</strong> previous years. From<br />
experience and <strong>in</strong>dustry research, he decides that the gam<strong>in</strong>g <strong>in</strong>dustry tends to be<br />
successful when total gross revenues <strong>for</strong> card rooms are above $1 million or when gross<br />
revenues <strong>for</strong> lotteries are above $20 billion. Between 2000 and 2009, he found that 50%<br />
of the time, both sectors have been successful and that 0% of the time only card tables<br />
were successful (and lotteries were not). Lotteries were unsuccessful 30% of the time<br />
(SOURCE: 2011 U.S. Statistical Abstract, Table 1258). What is the probability that the<br />
<strong>in</strong>vestor‟s conditions will be met In your professional op<strong>in</strong>ion, is it likely that he will<br />
decide to become a partner <strong>in</strong> the proposal (Video Solution)<br />
2. A researcher conducts a study on a total of 600 cats to determ<strong>in</strong>e whether or not they tend<br />
to be adaptive to danger and whether or not their time to respond to those dangers is fast<br />
enough to avoid harm. The animals were exposed to non-harmful stimuli to assist <strong>in</strong><br />
answer<strong>in</strong>g the researcher‟s questions. In his report he details that, “207 non-adaptive cats<br />
were studied and, of them, 180 were found to have response times that were simply not<br />
fast enough. By comparison, a total of 300 cats were both adaptive and had response<br />
times that were fast enough.” How likely is it that a cat is adaptive to environmental<br />
physical dangers or has a response time that is fast enough (Video Solution)<br />
3. In the March 3, 2011 episode of the Dr. Oz Show entitled “Dangerous Doctors: Is Your<br />
MD Hazardous to Your Health” Dr. Oz mentioned that 20% of the time doctors order<br />
scans to protect themselves from a lawsuit. Dr. Oz also said, “Up to 1/3 of all tests and<br />
treatments are entirely unnecessary.” (Video Solution)<br />
a. Two patients are given orders <strong>for</strong> scans from a particular doctor. What is the<br />
probability that one patient or the other were given scans to protect the doctor<br />
aga<strong>in</strong>st a lawsuit<br />
b. One patient is given orders <strong>for</strong> two different tests/treatments. What is the<br />
probability that one or both of them was/were unnecessary<br />
c. A patient is prescribed a scan and a blood test. What is the probability that an<br />
unnecessary prescription was made, through the patient‟s eyes<br />
4. In all of his Fall 2010 classes, Milos discovered that 44% of his students earned a „B‟ or<br />
better on their homework average. He also discovered that 50% of his students had a „B‟<br />
or better homework average or a „B‟ or better overall grade <strong>in</strong> the class (SOURCE:<br />
Milos‟ Fall 2010 Grade Spreadsheet). If 30% of all his students received a „B‟ or better<br />
homework average and a „B‟ or better class grade, what percentage of his students earned<br />
a „B‟ or better <strong>in</strong> the class (Video Solution)<br />
5. In all of his Fall 2010 classes, Milos discovered that the percentage of all students that<br />
earned a „C‟ or better homework average, 87% of these students earned a „C‟ or better<br />
f<strong>in</strong>al class grade. 70% of all students <strong>in</strong> his classes earned a „C‟ or better homework<br />
average or earned a „C‟ or better f<strong>in</strong>al class grade (SOURCE: Milos‟ Fall 2010 Grade<br />
Spreadsheet), while only 49% earned a „C‟ or better on homework and as a f<strong>in</strong>al class<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 105
grade (some still did well <strong>in</strong> the class, but maybe failed to turn <strong>in</strong> homework). What is the<br />
probability that a randomly selected student <strong>in</strong> his class earned a „C‟ or better f<strong>in</strong>al class<br />
grade (Video Solution)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 106
3.4 Conditional Probability<br />
In many cases, a probability depends on what we already know. For <strong>in</strong>stance, would we believe<br />
that the likelihood of a car accident changes, provided that the roads are slick from snow We<br />
would probably agree that the likelihood <strong>in</strong>creases if we already know the road conditions.<br />
Suppose a fair, two-sided co<strong>in</strong> is tossed. You are told that the outcome is not a head. What is the<br />
likelihood that the outcome is tails<br />
The answer is probably obvious… if you know the outcome was not heads, and the only two<br />
possibilities are heads and tails, then there is a 100% chance the outcome is tails.<br />
This is a conditional probability. That is, if<br />
Further, to <strong>in</strong>dicate that the outcome is not one of the above, we often put a bar on top of the<br />
event name:<br />
Then,<br />
̅<br />
̅<br />
( )<br />
However, given that we know the outcome was not tails, the probability of heads jumped to 1.<br />
We might write:<br />
( ̅)<br />
Instead of us<strong>in</strong>g the word “given” we often use a vertical l<strong>in</strong>e (called a “pipe”), |. That is,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 107
( ̅)<br />
Conditional Probability<br />
The conditional probability of event provided that already occurred is written as<br />
( )<br />
And implies that the likelihood of may be different, know<strong>in</strong>g that already took place.<br />
Example 1: Due to wars at sea, shipwrecks, and other such disasters, there are (roughly)<br />
around 3,000,000 sunken vessels <strong>in</strong> the all of the seas <strong>in</strong> the world! Suppose an area of the ocean<br />
is mapped out due to the historic ships that have wrecked <strong>in</strong> that area. There is speculation that,<br />
of the estimated 20 ships <strong>in</strong> that region, 11 are orig<strong>in</strong>al pirate ships. Given that a pirate ship is the<br />
first of the 20 recovered, what is the probability that the next one found will also be a pirate ship<br />
SOLUTION:<br />
We would like to f<strong>in</strong>d the probability that a pirate ship is found, given that one pirate ship has<br />
already been removed. If one ship is removed, there are 19 ships left. S<strong>in</strong>ce the ship removed<br />
was a pirate ship, there are only 10 rema<strong>in</strong><strong>in</strong>g. That is,<br />
( )<br />
Note that this is different than,<br />
Why<br />
( )<br />
This probability has no condition placed on it. It assumes the very basic <strong>in</strong><strong>for</strong>mation: 20 ships, 11<br />
pirate ships. So,<br />
( )<br />
The conditional probability, <strong>in</strong> this case, is different than the unconditional probability.<br />
Example 2: Determ<strong>in</strong>e whether or not the follow<strong>in</strong>g situations represent and as<br />
<strong>in</strong>dependent or dependent events.<br />
a) : It ra<strong>in</strong>s <strong>in</strong> Chandler today<br />
There is a car accident <strong>in</strong> Chandler<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 108
) : The Arizona Card<strong>in</strong>als make it to the playoffs<br />
Subway runs out of whole wheat bread<br />
c) : Dow Jones Industrial reports an enormous loss<br />
Microsoft stocks plummet<br />
d) ( ) ( ) ( )<br />
e) ( ) ( ) ( ) )<br />
f) ( ) ( ) ( ) )<br />
SOLUTION:<br />
a) Dependent; ra<strong>in</strong> likely greatens the likelihood <strong>for</strong> accidents<br />
b) Independent; these events probably don‟t have any impact on one another<br />
c) Dependent; Microsoft is part of the Dow Jones Industrial and so there is a strong<br />
relationship between the two<br />
d) Independent; we see that the likelihood of does not change given that has occurred<br />
– it is still .75<br />
e) Dependent; the likelihood of does change given that has occurred – it drops to .3<br />
f) If the product of the two given events does equal the probability of and , then the<br />
events are <strong>in</strong>dependent, as this would mean that ( ) is .75, which is the same as<br />
( ). We see that , and so we conclude that the events are <strong>in</strong>dependent.<br />
Example 3: An aircraft radar system detects 30 aircraft <strong>in</strong> a 100-mile radius. Of these, 18 are<br />
ally planes, 6 are cargo planes, and 6 are enemy planes. Given that a plane approach<strong>in</strong>g the radar<br />
is ruled out as be<strong>in</strong>g an enemy plane, what is the probability that it is a cargo plane<br />
SOLUTION: First off, def<strong>in</strong>e:<br />
We want to know,<br />
( ̅)<br />
S<strong>in</strong>ce it is not an enemy plane, it must be one of the rema<strong>in</strong><strong>in</strong>g 24 aircraft. Of those, 6 are cargo<br />
planes, so<br />
( ̅)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 109
Example 4: Suppose that Company 1 (C1) and Company 2 (C2) are<br />
competitors <strong>in</strong> the cloth<strong>in</strong>g bus<strong>in</strong>ess. In fact, they both have locations<br />
with<strong>in</strong> Chandler Fashion Center Mall. Given previous bus<strong>in</strong>ess<br />
experience, the market<strong>in</strong>g analyst knows that each company has an<br />
80% chance of agree<strong>in</strong>g to sell a particular l<strong>in</strong>e of cloth<strong>in</strong>g; However, if<br />
C1 agrees to sell the cloth<strong>in</strong>g l<strong>in</strong>e, C2 wants to stay competitive and so<br />
def<strong>in</strong>itely purchases the cloth<strong>in</strong>g l<strong>in</strong>e. How is the probability that both<br />
will agree affected by this new knowledge<br />
SOLUTION: In this situation, the decision of C2 is dependent (conditional) upon the decision<br />
of C1. Consider a table <strong>in</strong> which C2‟s choices will reflect the decision of C1.<br />
Company 2<br />
Choices When<br />
C1 Agrees<br />
Company 1<br />
Choices<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
Y<br />
N<br />
N<br />
Y Y Y Y Y Y Y Y Y Y<br />
( )<br />
The difference is that C2‟s decisions are all to agree, provided that C1 has agreed. If C1 does not<br />
agree, then we‟re not really sure how C2 will act, but we don‟t really care, s<strong>in</strong>ce the probability<br />
we are <strong>in</strong> search of is when both companies agree!<br />
Here we have:<br />
( ) ( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 110
We could just as well have written,<br />
So as to be us<strong>in</strong>g the decimal <strong>for</strong>m <strong>in</strong>stead of the tabular fractions.<br />
If you look back at the reason<strong>in</strong>g here, you‟ll notice that we have bolded the word “dependent.”<br />
In previous sections, we didn‟t have to worry about dependency, s<strong>in</strong>ce we assumed that the<br />
choices of C1 and C2 were <strong>in</strong>dependent, that is, one outcome did not affect the other, and vice<br />
versa.<br />
How do we know whether events are dependent or <strong>in</strong>dependent Often times this is based upon<br />
some knowledge of the situation or, perhaps, our <strong>in</strong>tuition. Let‟s set up the important ideas here<br />
and then we‟ll look at a few examples of dependence versus <strong>in</strong>dependence.<br />
Probability of Two Events Occurr<strong>in</strong>g Simultaneously<br />
Given two events, and , then<br />
If and are <strong>in</strong>dependent events, then<br />
( ) ( ) ( )<br />
( ) ( ) ( )<br />
where<br />
is a symbol to represent the word “and”. We use this <strong>in</strong> mathematics often.<br />
And if and are dependent events, then<br />
Or, as it is often written<br />
( ) ( ) ( )<br />
( ) ( ) ( )<br />
( ) ( ) ( )<br />
In either <strong>in</strong>stance, the end result <strong>in</strong>volves multiplication.<br />
NOTE: and are generic names and thus can be attached to an event <strong>in</strong> an arbitrary order.<br />
As an <strong>in</strong>terest<strong>in</strong>g note, we can make the follow<strong>in</strong>g conclusion:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 111
Independence Property<br />
Given two events, and , if ( ) ( ), then does not depend on , and so the<br />
dependence <strong>for</strong>mula reduces to:<br />
( ) ( ) ( )<br />
( ) ( ) ( )<br />
This result is important, because it allows you to only have to remember the “and” rule <strong>for</strong><br />
dependent events. If the next event does not depend on the prior event, then the end probability is<br />
just a product of the two <strong>in</strong>dividual probabilities.<br />
Though the ideas presented above might at first seem confus<strong>in</strong>g, you‟ll notice that the idea of<br />
jo<strong>in</strong>t probabilities has not changed. The only new caution is to take care to acknowledge whether<br />
the events are <strong>in</strong>dependent or not. We‟ll consider a few more examples below.<br />
Example 5: The probability that a resistor and capacitor both fail <strong>in</strong> a portable electronic<br />
device <strong>in</strong> the fifth year of use is 0.95%. The probability that the resistor fails is 1.22% and the<br />
probability that the capacitor fails is 1%. Are the events <strong>in</strong>dependent If they are not<br />
<strong>in</strong>dependent, what is the probability that the capacitor fails given that the resistor fails<br />
SOLUTION:<br />
Let<br />
If the two events are <strong>in</strong>dependent, then the product of unconditional probabilities should give us<br />
the provided jo<strong>in</strong>t probability.<br />
We have that,<br />
( )<br />
( )<br />
If they are <strong>in</strong>dependent events, then<br />
( )<br />
However, the jo<strong>in</strong>t probability under <strong>in</strong>dependence is 0.0122%, not 0.95%.<br />
Thus,<br />
( ) ( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 112
That is, the probability that the capacitor fails is dependent upon the resistor fail<strong>in</strong>g. Fill<strong>in</strong>g <strong>in</strong><br />
what we know:<br />
Divid<strong>in</strong>g gives,<br />
( )<br />
( )<br />
Thus, there is a 77.9% chance the capacitor fails if the resistor fails. The resistor is an <strong>in</strong>tegral<br />
part <strong>in</strong> this device. The likelihood of the capacitor fail<strong>in</strong>g <strong>in</strong>creases, if the resistor fails first.<br />
The above examples br<strong>in</strong>gs up a useful result.<br />
Calculat<strong>in</strong>g the Conditional Probability of A given B<br />
S<strong>in</strong>ce ( ) ( ) ( )<br />
We have that,<br />
( )<br />
( )<br />
( )<br />
Example 6: In a demographic study of a small, it is found that 5% of the adult residents are<br />
unemployed and liv<strong>in</strong>g at or below poverty level. A total of 8% are unemployed. What is the<br />
probability that a person <strong>in</strong> this town is liv<strong>in</strong>g at or below the poverty level, given that they are<br />
unemployed Interpret the mean<strong>in</strong>g of your answer.<br />
SOLUTION:<br />
Lett<strong>in</strong>g = a person lives at or below the poverty level and = a person is unemployed, we<br />
would like to know, ( )<br />
We have that ( ) ( ) . Thus:<br />
( )<br />
This says that, if a person is unemployed, there is a 62.5% chance they are liv<strong>in</strong>g at or below the<br />
poverty level. We would probably expect this figure to be quite high.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 113
Conditional probability is quite useful when used <strong>in</strong> the correct way. The counter<strong>in</strong>tuitive<br />
problem below will allow us to shed light on how important it really is to th<strong>in</strong>k about<br />
dependencies.<br />
Example 7: As part of a narcotics checkpo<strong>in</strong>t, officers randomly search freight trucks <strong>for</strong><br />
shipments of illegal drugs. The officers search a small number of crates <strong>in</strong> the trucks that are<br />
chosen <strong>for</strong> random <strong>in</strong>spection. Suppose that, unbeknownst to the officers, there are two trucks<br />
ahead, one of which conta<strong>in</strong>s one crate with illegal drugs. This truck has a total of 8 crates, while<br />
the truck without drugs has a total of 5 crates. One of the two trucks will be randomly chosen.<br />
What is the probability that the officers f<strong>in</strong>d the drugs<br />
SOLUTION: At first, it is tempt<strong>in</strong>g to say that the probability is , however this is not accurate.<br />
The probability that the officers f<strong>in</strong>d the crate with drugs is dependent on them choos<strong>in</strong>g the<br />
correct truck first!<br />
Let<br />
Two th<strong>in</strong>gs must happen: they must choose the correct truck and they must choose the correct<br />
crate. Randomly choos<strong>in</strong>g one of the two trucks is equiprobable, ( ) . If the correct truck is<br />
chosen, then the probability of choos<strong>in</strong>g the correct crate is , that is, ( )<br />
( ) ( ) ( )<br />
Why is it not valid to say 1/13 It might appear that probability is simply pull<strong>in</strong>g a “fast one” on<br />
our <strong>in</strong>tuition.<br />
A simple way to th<strong>in</strong>k about it is as follows: there is not just one random process here. If all the<br />
crates were <strong>in</strong> the same truck, there would <strong>in</strong>deed be a 1/13 chance that we‟d get the right crate.<br />
However, there are two random processes here. If you don‟t choose the correct truck, then<br />
choos<strong>in</strong>g the correct crate is impossible. The likelihood of the second random process lead<strong>in</strong>g to<br />
the correct crate is <strong>in</strong>deed deeply affected by the outcome of the first random process!<br />
Example 8: Reconsider Example 7:: Let‟s say that the second truck had two crates with<br />
shipments of drugs. As be<strong>for</strong>e, one of the two trucks will be randomly chosen. What is the<br />
probability that the officers f<strong>in</strong>d the drugs<br />
SOLUTION:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 114
This can happen <strong>in</strong> one of two ways:<br />
<br />
<br />
the truck with 8 crates ( ) is selected and the one correct crate is chosen OR<br />
the truck with 5 crates ( ) is selected and one of the two correct crates is chosen<br />
We will first create a small tree diagram show<strong>in</strong>g the possible outcomes.<br />
The beauty of this diagram is that it displays the conditional probabilities on the right “stems” of<br />
the tree <strong>for</strong> each <strong>in</strong>itial choos<strong>in</strong>g of the truck.<br />
The probability that drugs are found would thus be:<br />
Truck 1:<br />
Truck 2:<br />
S<strong>in</strong>ce these are dist<strong>in</strong>ct outcomes and cannot both occur (there is no overlap <strong>in</strong> the events), it is<br />
okay to add them<br />
Thus, there is a 37% chance that drugs are found between the two trucks. Aga<strong>in</strong>, note that the<br />
probability is not simply , as our <strong>in</strong>tuition might falsely lead us to believe.<br />
To <strong>for</strong>malize the tree above,<br />
Let<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 115
( )<br />
( ) ( ) ( )<br />
S<strong>in</strong>ce only one truck will be chosen, the probability of f<strong>in</strong>d<strong>in</strong>gs drugs <strong>in</strong> T1 and T2 is 0.<br />
( ) ( ) ( ) ( )<br />
( ) ( ) ( ) ( )<br />
Summ<strong>in</strong>g these together yields<br />
, as with the tree diagram.<br />
Homework Problems - 3.4<br />
1. A deck of standard play<strong>in</strong>g cards has 52 cards. There are four suits: clubs, diamonds,<br />
hearts, and spades. There are two colors of cards – red and black. Diamonds and hearts<br />
are red, and clubs and spades are black. The cards are labeled A (Ace), 1-10, J (Jack), Q<br />
(Queen), and K (k<strong>in</strong>g). To better visualize, consider the illustration below:<br />
Suppose you are given various conditions and that you must determ<strong>in</strong>e the probability of<br />
the specified draw on the next card. Use the card descriptions above to f<strong>in</strong>d that<br />
probability that: (Video Solution)<br />
a. Given that one Jack is removed, a Jack is drawn<br />
b. Given that all red cards are removed, a black card is drawn<br />
c. Given that a red Queen is removed, a red Queen is drawn<br />
d. Given that all red Queens are removed, a black Queen is drawn<br />
e. Given that all K<strong>in</strong>gs are removed, a red card is drawn<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 116
f. All numerical red cards are removed, a k<strong>in</strong>g is drawn<br />
g. A red k<strong>in</strong>g is removed, a black k<strong>in</strong>g is drawn<br />
2. An auto <strong>in</strong>surance company f<strong>in</strong>ds that there is an 18% chance that a teenager gets <strong>in</strong>to a<br />
car accident between ages 16 and 19. There is a 34% chance that a teenager gets a traffic<br />
ticket dur<strong>in</strong>g this same age range. They f<strong>in</strong>d that the chance of gett<strong>in</strong>g <strong>in</strong>to a car accident<br />
and gett<strong>in</strong>g a traffic ticket (not necessarily because of the accident) is 10%. (Video<br />
Solution)<br />
a. Based on the probabilities provided, are the two events <strong>in</strong>dependent Per<strong>for</strong>m a<br />
calculation to justify your answer.<br />
b. Given that a teenager gets <strong>in</strong>to an accident, what is the probability that he gets a<br />
traffic ticket<br />
c. Why did the probability change <strong>in</strong> this way, as compared to the unconditional<br />
probability of gett<strong>in</strong>g a traffic ticket<br />
d. Given that a teenager gets a traffic ticket, what is the probability that he gets <strong>in</strong>to<br />
an accident<br />
e. Expla<strong>in</strong>, <strong>in</strong> practical terms, what your answer <strong>in</strong> d) means.<br />
3. Let , , and be events <strong>in</strong> a sample space. Do the follow<strong>in</strong>g: a) expla<strong>in</strong> whether or<br />
not the events are <strong>in</strong>dependent or dependent, and b) answer the questions below regard<strong>in</strong>g<br />
these events with the <strong>in</strong><strong>for</strong>mation provided. Assume the first event listed <strong>in</strong> each<br />
probability statement occurs first (e.g. ( ) means occurs first). (Video<br />
Solution)<br />
a. ( )<br />
b. ( )<br />
c. ( )<br />
( )<br />
( )<br />
( )<br />
( )<br />
( )<br />
( )<br />
4. Gregor Mendel was a monk who, <strong>in</strong> 1865, suggested a theory of <strong>in</strong>heritance based on the<br />
science of genetics. He identified heterozygous <strong>in</strong>dividuals <strong>for</strong> flower color that had two<br />
alleles (one r = recessive white color allele and one R = dom<strong>in</strong>ant red color allele). When<br />
these <strong>in</strong>dividuals were mated, ¾ of the offspr<strong>in</strong>g were observed to have red flowers and<br />
¼ had white flowers. The table summarizes this mat<strong>in</strong>g; each parent gives one of its<br />
alleles to <strong>for</strong>m the gene of the offspr<strong>in</strong>g.<br />
Parent 2<br />
Parent 1 r R<br />
r rr rR<br />
R Rr RR<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 117
We assume that each parent is equally likely to give either of the two alleles and that, if<br />
either one or two of the alleles <strong>in</strong> a pair is dom<strong>in</strong>ant (R), the offspr<strong>in</strong>g will have red<br />
flowers. (Problem source: Mathematical <strong>Statistics</strong> with Applications, 6 th Ed., Wackerly,<br />
et al.) (Video Solution)<br />
a. What is the probability that an offspr<strong>in</strong>g has one recessive allele, given that the<br />
offspr<strong>in</strong>g has red flowers<br />
b. What is the probability that an offspr<strong>in</strong>g has one dom<strong>in</strong>ant allele, given that the<br />
offspr<strong>in</strong>g has white flowers<br />
c. What is the probability that an offspr<strong>in</strong>g has white flowers, given that it has one<br />
recessive allele<br />
d. What is the probability that an offspr<strong>in</strong>g has white flowers, given that it has one<br />
dom<strong>in</strong>ant allele<br />
e. What is the probability that an offspr<strong>in</strong>g has red flowers, given that it has one<br />
dom<strong>in</strong>ant allele<br />
5. There are 5 candidates <strong>for</strong> 2 town council positions. Three of them are <strong>for</strong> the removal of<br />
a landfill just outside of the city limits. The same candidate cannot fill both seats. (Video<br />
Solution)<br />
a. What is the probability that one randomly chosen candidate <strong>in</strong> the group is <strong>for</strong> the<br />
removal of the landfill<br />
b. Given that one of the positions is filled with a candidate <strong>in</strong> favor of the landfill<br />
removal, what is the probability that the second candidate chosen is also <strong>in</strong> favor<br />
c. What is the probability that two candidates <strong>in</strong> favor of the landfill removal are<br />
chosen<br />
d. What is the probability that only one seat is filled by a candidate <strong>in</strong> favor of the<br />
landfill removal<br />
e. What is the probability that at least one seat is filled by a candidate <strong>in</strong> favor of the<br />
landfill removal<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 118
3.5 Comb<strong>in</strong>ations and Permutations<br />
Recall from Section 3.2 the problem faced by a corn grow<strong>in</strong>g bus<strong>in</strong>ess: the FDA determ<strong>in</strong>es that<br />
two of the 20 bushels are potentially contam<strong>in</strong>ated with E. coli. Two bushels had been shipped<br />
out and the question was: what is the probability that both bushels that were shipped to the local<br />
grocer were uncontam<strong>in</strong>ated<br />
We wrote the simultaneous probability as<br />
( )<br />
( ) ( )<br />
Due to the fact that one of the uncontam<strong>in</strong>ated bushels was removed from the “pool”, there was<br />
now only a 17/19 chance that the second uncontam<strong>in</strong>ated bushel would be pulled. In short, we<br />
wrote:<br />
( )<br />
We notice that the numerator and denom<strong>in</strong>ator both have a product of two sequential numbers.<br />
Had they shipped, say, four bushels, the probability that all four were uncontam<strong>in</strong>ated would be:<br />
As you might imag<strong>in</strong>e, this pattern cont<strong>in</strong>ues.<br />
How pa<strong>in</strong>ful, though, would it be to have to multiply eight or n<strong>in</strong>e probabilities of this nature<br />
together You could certa<strong>in</strong>ly do it, but you might th<strong>in</strong>k, “It sure would be nice to take advantage<br />
of this pattern!” Well, we‟re <strong>in</strong> luck!<br />
Let‟s def<strong>in</strong>e an important term:<br />
A factorial is a descend<strong>in</strong>g product of whole numbers down to 1, beg<strong>in</strong>n<strong>in</strong>g at a specified whole<br />
number. To start with a generic whole number, , we denote this product by , and write:<br />
( ) ( )<br />
Example 1: F<strong>in</strong>d .<br />
SOLUTION: By def<strong>in</strong>ition of factorial, we write<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 119
This def<strong>in</strong>ition is great, but it still does not resolve our crisis: how do we multiply on a specific<br />
number of sequential whole numbers<br />
Here‟s a little trick: write the factorial out, then divide out the factors that are not needed. For us,<br />
this means:<br />
⏟<br />
⏟<br />
But this is the same th<strong>in</strong>g as:<br />
In a similar way, we can write the denom<strong>in</strong>ator of our probability by:<br />
Be<strong>for</strong>e we push this too far and get ourselves <strong>in</strong>to a trap, let‟s consider a different example with a<br />
smaller sample space.<br />
Suppose that there are only 3 bushels of corn and that only one is contam<strong>in</strong>ated with E. coli.<br />
Aga<strong>in</strong>, let‟s say that two are shipped out. Then,<br />
( )<br />
If you recall the tabular approach to th<strong>in</strong>k<strong>in</strong>g about this, we might show the possibilities <strong>for</strong><br />
uncontam<strong>in</strong>ated bushels, U1 and U2, and the way <strong>in</strong> which they can appear:<br />
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
U2<br />
We know that the pairs (U1, U1) and (U2, U2) <strong>for</strong> the 1 st and 2 nd bushels cannot be possible,<br />
s<strong>in</strong>ce that particular bushel is removed from the population. So, we denote that <strong>in</strong> the table by<br />
black<strong>in</strong>g-out those cells:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 120
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
U2<br />
Perfect! So we see the rema<strong>in</strong><strong>in</strong>g two possibilities, right Well, actually, is there a difference<br />
between (U2, U1) and (U1, U2) Not unless those two bushels are actually different than one<br />
another! So, black<strong>in</strong>g out either one of these pairs leaves:<br />
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
U2<br />
One possibility!<br />
You might be wonder<strong>in</strong>g why we‟re bother<strong>in</strong>g with this if we‟ve already found the probability.<br />
This is a good th<strong>in</strong>g to wonder.<br />
Recall that a probability is the number of ways an event can happen divided by the total number<br />
of outcomes. To be consistent with this def<strong>in</strong>ition, we really should be putt<strong>in</strong>g 1 <strong>in</strong> the<br />
numerator. Does that mean we miscomputed the probability Not <strong>in</strong> this particular example, but<br />
it can happen.<br />
To make our denom<strong>in</strong>ator consistent, let‟s look at the total number of possibilities <strong>for</strong> select<strong>in</strong>g<br />
bushels, add<strong>in</strong>g <strong>in</strong> the contam<strong>in</strong>ated bushel, C:<br />
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
C<br />
U2 C<br />
Aga<strong>in</strong>, it is not possible to select the same pair twice, so we black-out the diagonals:<br />
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
C<br />
U2 C<br />
Are we done Not unless we feel that (U2, U1) is different than (U1, U2). We notice that the<br />
three cells to the right of our blacked out diagonal are duplicates of those to the left. Thus we can<br />
cross them out, as well:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 121
1 st Bushel<br />
2 nd Bushel<br />
U1<br />
U1<br />
U2<br />
C<br />
U2 C<br />
This leaves us with three possibilities. So, our probability should be:<br />
( )<br />
Wait! This is the same as our earlier calculation of<br />
( )<br />
S<strong>in</strong>ce we get the same answer, one might th<strong>in</strong>k that it must not matter which approach we take.<br />
Many times, it doesn‟t; however, “many” is not satisfy<strong>in</strong>g enough, s<strong>in</strong>ce this leaves us prone to<br />
mistakes under different circumstances.<br />
Let‟s analyze the full situation two different ways. We found that if we don‟t elim<strong>in</strong>ate order<br />
differences, then we can write the probability as:<br />
If we did (correctly) elim<strong>in</strong>ate order differences, notice that we cut the number of possibilities <strong>in</strong><br />
half, that is, divided by 2. You‟ll notice that 2 is the same th<strong>in</strong>g as . So, let‟s divide out<br />
the number of duplicates from top and bottom:<br />
And<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 122
Which, <strong>in</strong> its f<strong>in</strong>al state gives:<br />
This does look rather complicated, but remember that it follows from some fairly simple th<strong>in</strong>gs<br />
that we have built up on. Also notice that both the top fraction and the bottom fraction have .<br />
Ah, yes! So that‟s why the order-not-elim<strong>in</strong>ated and order-elim<strong>in</strong>ated answers are the same:<br />
⏟<br />
⏟<br />
While this works out beautifully <strong>in</strong> this example, it is not always true, and so we must take care<br />
to observe whether order difference is important. We will see examples later where this<br />
difference will come <strong>in</strong>to play, but those situation are a bit more advanced.<br />
Let‟s simplify this horrid notation a bit. Suppose that there are a total of<br />
are to be drawn.<br />
items and of those<br />
Permutation – Order Does Matter<br />
If order is not to be elim<strong>in</strong>ated (<strong>in</strong> cases where order is important), then the number of ways to<br />
select th<strong>in</strong>gs from the given is called a permutation and is denoted:<br />
( )<br />
NOTE: ( ) , that is, factorial is not distributable!! Subtract first, then use<br />
factorial.<br />
For our numerator, we had selected 2 uncontam<strong>in</strong>ated bushels from a total of 18 uncontam<strong>in</strong>ated<br />
bushels. Accord<strong>in</strong>g to our new notation, this can be written as:<br />
( )<br />
And this is precisely what we have written <strong>for</strong> the numerator!<br />
For our denom<strong>in</strong>ator, we had selected 2 (general) bushels from a total of 20 (general) bushels,<br />
s<strong>in</strong>ce we want to know the total number of ways 2 objects can come out of 20.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 123
( )<br />
And this is precisely what we have written <strong>for</strong> the denom<strong>in</strong>ator!<br />
In simplified notation,<br />
( )<br />
Calculator Cl<strong>in</strong>ic – Us<strong>in</strong>g Permutations<br />
To evaluate a permutation,<br />
1. first enter <strong>in</strong> your home screen<br />
2. Go to and move to the left to the PRB tab.<br />
3. Select 2: nPr. This will return you to your home screen.<br />
4. Enter and press ENTER<br />
TIP: Sometimes the value of the numerator or denom<strong>in</strong>ator is so large that the computer<br />
throws an overflow error. It is advisable to enter the entire probability <strong>in</strong>, numerator and<br />
denom<strong>in</strong>ator to avoid this potential problem.<br />
Let‟s now consider the case where it is important to<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 124
Comb<strong>in</strong>ation – Order Does NOT Matter (Elim<strong>in</strong>ated)<br />
If order is to be elim<strong>in</strong>ated (<strong>in</strong> cases where order is not important), then the number of ways to<br />
select th<strong>in</strong>gs from the given is called a comb<strong>in</strong>ation and is denoted:<br />
( )<br />
NOTE: ( ) , that is, factorial is not distributable!! Subtract first, then use<br />
factorial. Additionally, the factorial of a product is not the product of factorials, that is,<br />
.<br />
For our numerator, we had selected 2 uncontam<strong>in</strong>ated bushels from a total of 18 uncontam<strong>in</strong>ated<br />
bushels, elim<strong>in</strong>at<strong>in</strong>g the number of repeats, which was 2, or . Accord<strong>in</strong>g to our new notation,<br />
this can be written as:<br />
( )<br />
And this is precisely what we have written <strong>for</strong> the numerator!<br />
For our denom<strong>in</strong>ator, we had selected 2 (general) bushels from a total of 20 (general) bushels,<br />
s<strong>in</strong>ce we want to know the total number of ways 2 objects can come out of 20, order aside.<br />
( )<br />
And this is precisely what we have written <strong>for</strong> the denom<strong>in</strong>ator!<br />
In simplified notation,<br />
( )<br />
Calculator Cl<strong>in</strong>ic – Us<strong>in</strong>g Comb<strong>in</strong>ations<br />
Follow the steps <strong>for</strong> f<strong>in</strong>d<strong>in</strong>g permutations, but <strong>in</strong> Step 3, use 3: nCr <strong>in</strong>stead.<br />
Example 2: Every week, Cori stops at Chipotle Mexican Grill <strong>for</strong><br />
lunch with his colleagues. Each time, he drops a bus<strong>in</strong>ess card <strong>in</strong>to<br />
the fishbowl <strong>for</strong> a chance to w<strong>in</strong> lunch <strong>for</strong> his entire office. After the<br />
seventh visit, Cori beg<strong>in</strong>s to wonder his chances of w<strong>in</strong>n<strong>in</strong>g. He<br />
estimates that there are approximately 40 cards <strong>in</strong> the bowl. If two<br />
were to be drawn, what is the probability Cori w<strong>in</strong>s both draws<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 125
SOLUTION: We first th<strong>in</strong>k about what it is that we need to know. Per the question asked, the<br />
event is that the first and second cards drawn are both Cori‟s.<br />
This event occurs when the 2 cards drawn both come out of the 7 he has put <strong>in</strong> thus far. S<strong>in</strong>ce the<br />
order <strong>in</strong> which his two cards are drawn don‟t matter (as the prize is the same), we would like to<br />
know the value of<br />
The sample space is simply the total number of outcomes. Two cards will be drawn from the<br />
stack of 40, and s<strong>in</strong>ce order doesn‟t matter<br />
Thus, the probability of this event is<br />
( )<br />
There is about a 3% chance that both of the cards drawn are Cori‟s.<br />
Example 3: Probability is often used <strong>in</strong> police <strong>in</strong>vestigations to help<br />
determ<strong>in</strong>e probable cause. Suppose that <strong>in</strong> a gang-related report it was<br />
stated that three gang members were spotted. In an <strong>in</strong>terrogation room,<br />
20 gang members are suspects, three of whom are certa<strong>in</strong> to have<br />
committed the crime. A detective has a suspicion that the three came<br />
from a gang of which 5 of its members are present. Just by chance,<br />
how likely is it that the three members came from the gang he believes to be beh<strong>in</strong>d the<br />
crime Does this give him what you might consider “probable cause” to pursue the group<br />
SOLUTION: The event is that the three crim<strong>in</strong>als come from a group of five particular gang<br />
members. There are<br />
The total number of way three-crim<strong>in</strong>al groups that can be <strong>for</strong>med out of the suspects is<br />
This means,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 126
( )<br />
There is only a .9% chance that the three gang members all come from the presumed gang. The<br />
detective should consider more evidence to narrow down the search results be<strong>for</strong>e mak<strong>in</strong>g<br />
assumptions.<br />
Example 4: A bus<strong>in</strong>ess creates a new system to keep track of client relations, such that<br />
<strong>in</strong><strong>for</strong>mation about the client and a particular orders placed can be accessed by a nonrepeat<strong>in</strong>g,<br />
four character or digit number. For <strong>in</strong>stance, KA23 and<br />
AK23 are possible codes. Any code conta<strong>in</strong><strong>in</strong>g only letters<br />
will be reserved <strong>for</strong> large clients. How many such codes of<br />
non-repeat<strong>in</strong>g letters can they make available, and<br />
assum<strong>in</strong>g all such codes will eventually be used up what<br />
percentage of the company‟s clients will be considered<br />
large clients<br />
SOLUTION: There are 26 letters <strong>in</strong> the alphabet and, of<br />
those, four will comprise a s<strong>in</strong>gle, large-client code. There<br />
are<br />
different codes without the same<br />
letters be<strong>in</strong>g repeated, but where order does matter.<br />
In order to know what percentage (or probability) of the total number of possible codes this<br />
represents, we need to compute the total number of codes that can be <strong>for</strong>med, where no letter or<br />
number is repeated, but where order does matter. This is precisely what permutations are <strong>for</strong>.<br />
S<strong>in</strong>ce there are 26 letters and 10 numbers, a total of 36 different “symbols” can be selected from.<br />
The number of permutations is<br />
total different codes 1 without the same letters or numbers be<strong>in</strong>g repeated, but<br />
where order does matter.<br />
So, the percentage/probability, then, is:<br />
( )<br />
We conclude that 25% of all clients (the large clients) will have completely alphabetical codes.<br />
1 Notice that the <strong>in</strong>crease <strong>in</strong> the number of possibilities after <strong>in</strong>creas<strong>in</strong>g the size of the sample space is not<br />
proportional to the <strong>in</strong>crease amount. The growth is actually exponential, not l<strong>in</strong>ear.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 127
Example 5: In Example 4:, it was necessary that letters and numbers were not to be repeated.<br />
Recalculate the number of large client codes and the percentage of them by assum<strong>in</strong>g that<br />
numbers and letters actually can be repeated.<br />
SOLUTION: Recall that a permutation or a comb<strong>in</strong>ation is <strong>in</strong>tended to handle situation <strong>in</strong> which<br />
repeats are not allowed. Recall from the beg<strong>in</strong>n<strong>in</strong>g of this section that to f<strong>in</strong>d the number of ways<br />
<strong>in</strong> which two bushels of corn could be selected from a crop of 20 (and after one is selected, the<br />
sample space reduces <strong>in</strong> size), we wrote:<br />
In this situation, we are allow<strong>in</strong>g repeats. For the number of ways to <strong>for</strong>m a 4-letter code, we<br />
have 26 possibilities <strong>for</strong> each digit. That is 26 <strong>for</strong> the first, the second, the third, and the fourth.<br />
Cross<strong>in</strong>g all of these possibilities gives:<br />
Which we expect to be larger than <strong>in</strong> the previous example s<strong>in</strong>ce we are allow<strong>in</strong>g repeats.<br />
Similarly, the number of letter/number codes that are possible can be calculated by not<strong>in</strong>g<br />
that, <strong>in</strong> general, each piece of the code has 36 possibilities. So,<br />
The percentage/probability is<br />
( )<br />
The percentage changes to 27% of all codes will conta<strong>in</strong> only letters.<br />
Moral of the Story with Count<strong>in</strong>g<br />
’<br />
determ<strong>in</strong><strong>in</strong>g some key pieces of <strong>in</strong><strong>for</strong>mation:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 128
1. Are repeats/replacements allowed If yes, permutations/comb<strong>in</strong>ations are likely<br />
the <strong>in</strong>correct approach.<br />
2. Does order matter If yes, permutations should be used. If no, comb<strong>in</strong>ations should<br />
be used.<br />
You Might Be Wonder<strong>in</strong>g:<br />
You might be wonder<strong>in</strong>g why we must divide by to remove all repeats. This was<br />
probably somewhat obvious when work<strong>in</strong>g with two objects. Say there are 5 objects to<br />
select from. One is now gone, so <strong>for</strong> the second selection there are only 4. We proceed to<br />
cross out everyth<strong>in</strong>g along the and to the right of the diagonal s<strong>in</strong>ce they are either not<br />
possible or are s ’<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
Object 1 Object 2 Object 3 Object 4<br />
We have essentially multiplied the first five possibilities by the next number of possibilities,<br />
which is only 4 (this is accounted <strong>for</strong> by cross<strong>in</strong>g out the diagonals, s<strong>in</strong>ce this subtracts out<br />
five possibilities to give ), and then divided that result by 2, s<strong>in</strong>ce half of the table is a<br />
repeat. That is,<br />
What happens when we select a third object We extend the above table as a multiple of 3,<br />
s<strong>in</strong>ce there are three objects left. Each table represents a pair<strong>in</strong>g with one of the three<br />
rema<strong>in</strong><strong>in</strong>g objects, as shown <strong>in</strong> the upper-left corner:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 129
OBJECT 1 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 2 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 3 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
In the first table, we can cross out the first column (and first row, if it were there), s<strong>in</strong>ce it is<br />
not possible to select object 1 <strong>for</strong> a third time. In the second table, we can cross out the<br />
second column/row and <strong>in</strong> the third table we can cross out the third column/row <strong>for</strong> the<br />
same reason as table 1.<br />
OBJECT 1 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 2 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 3 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 130
Object 4<br />
Object 5<br />
Also notice that the second column of table 1 and the last three rows of table are the same<br />
(1, 2, 3), (1, 2, 4), and (1, 2, 5). For a similar reason, the third column of table 1 can be<br />
crossed out, s<strong>in</strong>ce it is a repeat of what we have <strong>in</strong> column 1 of table 3.<br />
OBJECT 1 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 2 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 3 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
Noth<strong>in</strong>g else <strong>in</strong> table 1 can be elim<strong>in</strong>ated, s<strong>in</strong>ce (1, 4, 5) cannot be found <strong>in</strong> either of the two<br />
rema<strong>in</strong><strong>in</strong>g tables (this is a unique characteristic of the bottom, right-most entry).<br />
In table 2, we will try to elim<strong>in</strong>ate any entries that can be found <strong>in</strong> table 3. These<br />
elim<strong>in</strong>ations will <strong>in</strong>volve any entries that conta<strong>in</strong> Object 3. We can do so with the (2, 1, 3)<br />
entry and the third column:<br />
OBJECT 1 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 2 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 131
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
OBJECT 3 Object 1 Object 2 Object 3 Object 4<br />
Object 1<br />
Object 2<br />
Object 3<br />
Object 4<br />
Object 5<br />
Now, notice that we have 10 white spots left. This happens to be exactly one-third of what<br />
we had after we tripled the table. That is,<br />
⏟<br />
⏟<br />
Which can be simplified to,<br />
( )<br />
Select<strong>in</strong>g items allows this process to repeat, ad nauseam, any number of times.<br />
Mathematicians discovered that this tabular process could be reduced <strong>in</strong>to the <strong>for</strong>mula we<br />
“ ” general case (where we<br />
allow to be any value between 0 and the number of items we have to choose from), which<br />
tends to be discussed <strong>in</strong> more theoretical mathematics courses such as Discrete<br />
Mathematical Structures (our MAT227).<br />
Homework Problems - 3.5<br />
1. If possible, give an imag<strong>in</strong>ary (but realistic) scenario <strong>for</strong> each of the follow<strong>in</strong>g. If not<br />
possible, state why.<br />
a.<br />
b.<br />
c.<br />
d.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 132
e.<br />
2. Your classmate was absent when permutations and comb<strong>in</strong>ations. Expla<strong>in</strong> when he<br />
should and when he should not use permutations and comb<strong>in</strong>ations. (Video<br />
Solution)<br />
3. A police officer has been brought be<strong>for</strong>e the court on accusations of racial profil<strong>in</strong>g.<br />
This occurs when a person of a particular race has been pulled over or deta<strong>in</strong>ed by<br />
the police due to his race. The officer stopped 2 vehicles out of 10 that passed by<br />
through a freeway tollbooth. Both of the suspects were Asian and there were a total<br />
of 3 Asian drivers <strong>in</strong> the 10. (Video Solution)<br />
a. In how many ways could 2 drivers have been selected from the 10<br />
b. In how many ways could 2 Asian drivers have been selected from the 3<br />
c. How likely is it that the 2 selected drivers would both have been Asian if the<br />
stops were truly random<br />
4. In the United States, 20 out of the 50 states spend more than 50% of their state park<br />
and recreation areas revenue on keep<strong>in</strong>g the state park operable (SOURCE: 2012<br />
U.S. Statistical Abstract). Suppose a survey of 10 states is to be conducted next year<br />
to see if anyth<strong>in</strong>g has changed. (Video Solution)<br />
a. In how many ways can 10 states be selected <strong>for</strong> the survey<br />
b. In how many ways can 10 states be drawn so that all 10 are operat<strong>in</strong>g on<br />
more than 50% of their state park and recreation areas revenue<br />
c. What is the probability that all 10 of the states drawn are operat<strong>in</strong>g on more<br />
than 50% of their state park and recreation areas revenue<br />
5. Ten pieces of furniture are to be arranged <strong>in</strong> a long row <strong>in</strong> a furniture store. In how<br />
many ways can all 10 be arranged (Video Solution)<br />
6. At Chandler-Gilbert <strong>Community</strong> College high-school math competitions, students<br />
enter <strong>in</strong>to a raffle to w<strong>in</strong> various prizes, <strong>in</strong>clud<strong>in</strong>g a graph<strong>in</strong>g calculator. There are<br />
typically around 200 students. Suppose there are 5 different types of calculators to<br />
be given out and that the best is saved <strong>for</strong> last. (Video Solution)<br />
a. In how many ways can the prizes be distributed among the 200 students<br />
b. Suppose a school has 5 attendees. In how many ways can all 5 students from<br />
this school w<strong>in</strong> a calculator<br />
c. What is the probability that all 5 students from this school w<strong>in</strong>s a calculator<br />
7. A frequent concern of cautious consumers is the idea of the last four digits of a credit<br />
card number be<strong>in</strong>g displayed on receipts. Suppose a consumer has a Visa, which has a<br />
total of 16-digits, each of which can be between 0 and 9. For the sake of simplicity,<br />
suppose any comb<strong>in</strong>ation is possible. A customer left the follow<strong>in</strong>g receipt ly<strong>in</strong>g around<br />
and is now concerned about his identity: (Video Solution)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 133
a. First, how many different credit-card numbers are possible with 16 digits<br />
b. How many different credit-cards numbers can be arranged with 6781 as the last<br />
four digits<br />
c. On any one guess by a potential thief, what is the probability that he correctly<br />
guesses this person‟s credit card number<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 134
3.6 Expected Value<br />
Imag<strong>in</strong>e that you are an<br />
<strong>in</strong>surance salesperson with<br />
many years of experience. A<br />
new client has requested that<br />
your bus<strong>in</strong>ess provide him<br />
with auto <strong>in</strong>surance. He is 20<br />
years old and has never been<br />
<strong>in</strong> an accident be<strong>for</strong>e.<br />
Consider<strong>in</strong>g age alone, you<br />
look at <strong>in</strong>dustry data and f<strong>in</strong>d<br />
that, as recently as 2008,<br />
there was about a 15% chance that someone his age would get <strong>in</strong>to an accident (SOURCE: U.S.<br />
Statistical Abstract, Table 1113). Us<strong>in</strong>g your own expertise you f<strong>in</strong>d that, of your 20 year-old<br />
clients, the typical accident payment <strong>for</strong> his particular make and model of vehicle is about<br />
$3,200. He br<strong>in</strong>gs <strong>for</strong>ward a quote from another <strong>in</strong>surance agency <strong>for</strong> a $100/month premium<br />
with no deductible (noth<strong>in</strong>g to pay when an accident does occur except the runn<strong>in</strong>g premium).<br />
The question is, do you <strong>in</strong>sure him<br />
Let‟s look at the possibilities <strong>in</strong> a tabular <strong>for</strong>m. S<strong>in</strong>ce there‟s a 15% chance the driver will get<br />
<strong>in</strong>to an accident, there is an 85% chance he won‟t (s<strong>in</strong>ce it either does happen or it doesn‟t). If<br />
there is no accident, then the <strong>in</strong>surance company receives $1200 <strong>for</strong> the entire year. If an<br />
accident does occur, the <strong>in</strong>surer pays out $3200 (hence a negative effect), but still receives the<br />
year‟s premiums. Thus, the net difference is $2000, which the <strong>in</strong>surer is responsible <strong>for</strong>.<br />
Action Likelihood Monetary Value to Insurer<br />
Accident 15%<br />
No Accident 85%<br />
If we now consider 100 years, it is expected that 15 of those years there would be an accident<br />
and 85 of them there would be no accident, assum<strong>in</strong>g the constant probability. That means the<br />
<strong>in</strong>surer would pay $2000 a total of 15 times and receive $1200 a total of 85 times. Let‟s consider<br />
the net difference:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 135
This amount looks very good! In fact, on average, the company received<br />
. This customer is def<strong>in</strong>itely profitable to the company, <strong>in</strong> the long-run. Of course,<br />
we know that an accident could occur the first year, <strong>in</strong> which an $800 loss would be <strong>in</strong>curred<br />
right away.<br />
Notice what we really did here. We took the sum of the amounts and divided by 100:<br />
( )<br />
By properties of a common denom<strong>in</strong>ator we can write:<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
In reality, we multiplied each monetary value by its respective probability. This idea is<br />
known as expected value, s<strong>in</strong>ce it is what we expect to happen <strong>in</strong> the long-run.<br />
Expected Value and Random Variable<br />
Expected value is the expected, or average, quantity that should occur <strong>in</strong> the long-run,<br />
provided that each quantity occurs with a certa<strong>in</strong> probability.<br />
Suppose there are quantities, , each of which occurs with a certa<strong>in</strong><br />
probability, , respectively, then the expected value, denoted , - is<br />
, -<br />
A capital , , is used to denote what is called a discrete random variable, a variable that<br />
takes on one of (a natural number of) values with a certa<strong>in</strong> probability. This value is<br />
def<strong>in</strong>ed by what it measures <strong>in</strong> the given situation.<br />
Importantly,<br />
, that is, we must account <strong>for</strong> 100% of all possible<br />
outcomes <strong>in</strong> order <strong>for</strong> the expected value to be mean<strong>in</strong>gful.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 136
An expected value is actually not someth<strong>in</strong>g terribly new. To see this more explicitly,<br />
suppose a student earns three test scores: 95%, 80%, and 85%. Then the average<br />
percentage is:<br />
Observe that we can use properties of fractions to separate the sum as follows:<br />
( ) ( ) ( )<br />
While one-third <strong>in</strong> this situation is not a probability (s<strong>in</strong>ce the scores have already been<br />
) “ ” -third of the overall<br />
class grade.<br />
Example 1:<br />
A company sells consumer electronics, such as televisions, stereos, and<br />
computers. For each product, the company offers the consumer<br />
a warranty that protects any problems that might occur with<strong>in</strong><br />
the first two years, with the exception of accidental damage<br />
and theft. For a particular television that runs $1200, it offers a<br />
2-year warranty <strong>for</strong> $ ’<br />
determ<strong>in</strong>es that 3% of these televisions malfunction each year.<br />
Is the company offer<strong>in</strong>g the warranty at a profitable price<br />
Expla<strong>in</strong> your answer and def<strong>in</strong>e the random variable.<br />
SOLUTION: We should determ<strong>in</strong>e what will happen, on average. We first see that the<br />
warranty is a 2-year warranty and the defect rate is <strong>for</strong> one year. If 3% malfunction each<br />
year, then 6% of all televisions are expected to malfunction with<strong>in</strong> the first two years.<br />
This means that the company will make $175 with a 94% probability and will lose $1200-<br />
$175=$1025 with a 6% probability, s<strong>in</strong>ce it will still receive the payment, but will have to<br />
either replace the product or offer a credit to the consumer.<br />
Lett<strong>in</strong>g<br />
, -, is<br />
, then the expected amount to be ga<strong>in</strong>ed, or<br />
, - ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 137
This means that, after sell<strong>in</strong>g this product <strong>for</strong> a while, it should earn an average of $103<br />
from each consumer that purchases the warranty. This is a profitable outcome.<br />
Example 2: The Arizona Lottery has a number of different lottery games that a person<br />
can play. One <strong>in</strong> particular is Fantasy 5. The rules of the game are simple: pay $1 per<br />
ticket and select five numbers between 1 and 41. Five numbers are then selected at<br />
random. If you correctly selected two or more of these numbers, then you are<br />
considered a w<strong>in</strong>ner. The follow<strong>in</strong>g table describes the likelihood of w<strong>in</strong>n<strong>in</strong>g:<br />
(SOURCE: www.arizonalottery.com)<br />
The estimated jackpot <strong>for</strong> the Wednesday, August 17, 2011 lottery was $54,000. Is the<br />
game <strong>in</strong> your favor Why or why not<br />
SOLUTION:<br />
We must first consider the fact that these prizes do not take <strong>in</strong>to account that $1 was lost to<br />
purchase the ticket; we should subtract $1 from each of the prizes. Additionally, we note<br />
that the probabilities do not add to 1:<br />
The rema<strong>in</strong>der of the time, it is simply the case that $1 is lost:<br />
We rebuild the table to show all of the values and probabilities:<br />
53,999 499 4 0 -1<br />
( ) 1/749,398 1/4163 1/119 1/11 9,004/10,000<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 138
Where<br />
The expected value is:<br />
, - ( ) ( ) ( ) ( ) ( )<br />
This means that if one were to play time-after-time, tak<strong>in</strong>g <strong>in</strong>to consideration the small<br />
likelihood of w<strong>in</strong>n<strong>in</strong>g occasionally, one would be expected to lose, on average, $0.67 per<br />
ticket.<br />
’<br />
Notice that we represented the outcomes by us<strong>in</strong>g a table, <strong>in</strong> which we listed the outcomes,<br />
or the <strong>in</strong>dividual along with the probability that this occurs, ( ). This is one way <strong>in</strong><br />
which to display a probability distribution, or how all probabilities are distributed among<br />
the various outcomes.<br />
Example 3: A fair, six-sided die is tossed repeatedly. The number of dots that are fac<strong>in</strong>g<br />
up after each throw is recorded. Def<strong>in</strong>e the random variable, f<strong>in</strong>d its probability<br />
distribution, and f<strong>in</strong>d and <strong>in</strong>terpret the expected value of the random variable.<br />
SOLUTION: We def<strong>in</strong>e the random variable,<br />
The different values that can take on are , s<strong>in</strong>ce we know there are six<br />
sides. S<strong>in</strong>ce this is a fair die, each of these six outcomes has an equally likely chance of<br />
appear<strong>in</strong>g, so ( ) , <strong>for</strong> all values, of . Our probability distribution is thus,<br />
1 2 3 4 5 6<br />
( ) 1/6 1/6 1/6 1/6 1/6 1/6<br />
The expected value is the sum of the products of each outcome value and its associated<br />
probability.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 139
Average Die Roll Outcome<br />
, - ( ) ( ) ( ) ( ) ( ) ( )<br />
The average value of a die that is repeatedly tossed will be 3.5. If we were to conduct a<br />
simulation we would probably see someth<strong>in</strong>g similar as <strong>in</strong> the <strong>in</strong>troductory section of this<br />
chapter:<br />
6<br />
Average Die Roll Outcome<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
0 20 40 60 80 100 120 140<br />
Number of Times Die Has Been Tossed<br />
As time passes, we see that the average roll becomes more stable and seems to e approach<strong>in</strong>g<br />
3.5, as we have shown mathematically.<br />
Example 4: In hopes of understand<strong>in</strong>g the directions <strong>in</strong> which married couples are naturally<br />
<strong>in</strong>cl<strong>in</strong>ed to walk at an outdoor mall <strong>in</strong> Arizona, a market<strong>in</strong>g group conducts a study. It is the<br />
experience of the mall that men and women tend to walk <strong>in</strong> different directions once they<br />
park (and catch<strong>in</strong>g up later). The first question is how many <strong>in</strong>dividuals with<strong>in</strong> a couple can<br />
they expect to start their walk through a street that has one or more cloth<strong>in</strong>g stores<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 140
SOLUTION: We first note that there are three paths out of five with one or more cloth<strong>in</strong>g stores.<br />
We assume there are two people per couple and that each takes a different <strong>in</strong>itial route. The<br />
random variable we are <strong>in</strong>terested <strong>in</strong> is:<br />
The random variable can take on values,<br />
take a cloth<strong>in</strong>g store route, only one does, or both do.<br />
, s<strong>in</strong>ce it is possible that neither of them<br />
We need to f<strong>in</strong>d the probability <strong>for</strong> each of the three events.<br />
<strong>in</strong>dividuals tak<strong>in</strong>g a route with a cloth<strong>in</strong>g store would occur when, from the three cloth<strong>in</strong>g<br />
store routes, none are selected, and both routes without cloth<strong>in</strong>g stores are selected. We then<br />
must compare this to the number of ways two routes can be chosen from five. That is,<br />
( )<br />
( )( )<br />
( )<br />
Similarly, <strong>for</strong> , we want to know how many ways one cloth<strong>in</strong>g-store route and one noncloth<strong>in</strong>g-store<br />
route can be selected. That is,<br />
( )<br />
( )( )<br />
( )<br />
For<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 141
( )<br />
( )( )<br />
( )<br />
Our probability distribution is:<br />
0 1 2<br />
( ) 1/10 6/10 3/10<br />
We can see that the probabilities sum to 1, which helps to imply that we have accounted <strong>for</strong> all<br />
possibilities.<br />
The number of <strong>in</strong>dividuals expected to take a cloth<strong>in</strong>g store route is an expected value of this<br />
distribution,<br />
, - ( ) ( ) ( )<br />
Thus, it can be expected that, on average, at least one person from the couple will walk along a<br />
route that conta<strong>in</strong>s a cloth<strong>in</strong>g store.<br />
One additional way to represent a probability distribution is by us<strong>in</strong>g a probability histogram.<br />
A histogram looks similar to a bar graph, except that it has a numerical horizontal axis and<br />
measures the probability along the vertical axis. Additionally, the bars touch <strong>in</strong> order to show<br />
cont<strong>in</strong>uity, where applicable. For the above situation, we would expect to see:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 142
Probability<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
Cloth<strong>in</strong>g Store Route Probabilities<br />
0 1 2<br />
Number of Individuals<br />
This is a convenient visual way to view the distribution of probabilities. It is clear to us that it is<br />
quite unlikely that neither of the <strong>in</strong>dividuals <strong>in</strong> the couple will walk a route without a cloth<strong>in</strong>g<br />
store.<br />
Homework Problems - 3.6<br />
1. While work<strong>in</strong>g <strong>in</strong> downtown Phoenix, the author tracked m<strong>in</strong>utes that the Blue L<strong>in</strong>e<br />
bus go<strong>in</strong>g through downtown Phoenix, AZ was late <strong>in</strong> arriv<strong>in</strong>g at a specific bus stop. He<br />
discovered the follow<strong>in</strong>g: (Video Solution)<br />
On time 1 2 3 4<br />
( ) 0.53 0.25 0.18 0.03 0.01<br />
a. Construct a probability histogram.<br />
b. What does the probability histogram reveal<br />
c. F<strong>in</strong>d and <strong>in</strong>terpret the expected value of the random variable.<br />
(SOURCE: Author‟s data)<br />
2. A Geico auto <strong>in</strong>surance policy <strong>for</strong> a 21-year-old Chandler male driver of a 2012 BMW<br />
M5 with no previous tickets has a semi-annual premium of $312.41. In the <strong>in</strong>stance of an<br />
accident, there is a $1,000 deductible that the policyholder must pay be<strong>for</strong>e <strong>in</strong>surance will<br />
cover the damages (SOURCE: www.geico.com). The vehicle costs about $115,000 to<br />
replace. From past experience, suppose Geico knows there is a 2.5% chance (annually)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 143
that this situation will result <strong>in</strong> an accident. F<strong>in</strong>d the expected payout <strong>for</strong> Geico and<br />
comment on its profitability <strong>in</strong> a situation like this. (Video Solution)<br />
3. An <strong>in</strong>surance policy pays $100 per day <strong>for</strong> up to 3 days of hospitalization and $50 per<br />
day <strong>for</strong> each day of hospitalization thereafter. (Video Solution)<br />
The number of days of hospitalization, , is a random variable with probability given by<br />
the function<br />
( ) {<br />
a. Def<strong>in</strong>e the random variable.<br />
b. Give the probability distribution <strong>for</strong> by us<strong>in</strong>g a probability histogram.<br />
c. What does the probability histogram tell you about hospitalization<br />
d. Determ<strong>in</strong>e the expected payment <strong>for</strong> hospitalization under this policy.<br />
(SOURCE: Society of Actuaries (SOA), Spr<strong>in</strong>g 2003 Exam P, #36)<br />
4. You work on a dairy farm and are <strong>in</strong> charge of quality control <strong>for</strong> eggs. Your primary<br />
concern is that broken eggs do not go out. You know from past experience that about<br />
25% of the outgo<strong>in</strong>g boxes conta<strong>in</strong> one or more broken eggs (based on compla<strong>in</strong>ts). If a<br />
local restaurant purchases 4 boxes of eggs from you, what is the expected number of<br />
boxes with broken eggs that this vendor should receive (Video Solution)<br />
5. At a major seafood restaurant, shrimp fettucc<strong>in</strong>i is a popular dish. The company is<br />
consider<strong>in</strong>g add<strong>in</strong>g a family-sized fettucc<strong>in</strong>i dish, but would first like to make sure that it<br />
will be a profitable endeavor. The company randomly surveys customers that who<br />
purchase the orig<strong>in</strong>al $14.99 dish and f<strong>in</strong>ds that 15% would purchase the larger family<br />
dish. What should they charge <strong>for</strong> the family-sized dish so that average revenue from<br />
shrimp fettucc<strong>in</strong>i will be $17.00 (Video Solution)<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 144
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 145
Chapter 4<br />
Discrete Probability Distributions<br />
It might seem paradoxical to say that uncerta<strong>in</strong>ty occurs <strong>in</strong> certa<strong>in</strong> ways, but the truth is that it<br />
does – assum<strong>in</strong>g certa<strong>in</strong> assumptions are satisfied. As we build a probability distribution,<br />
whether <strong>in</strong> the <strong>for</strong>m of a table or histogram, we can often times save ourselves a lot of labor by<br />
focus<strong>in</strong>g on the type of experiment that lay be<strong>for</strong>e us. The purpose of this chapter is to<br />
(hopefully) simplify some of our ef<strong>for</strong>ts.<br />
4.1 The B<strong>in</strong>omial Distribution<br />
1.1.1 Why Probability Distributions Are Useful<br />
Suppose a friend of yours, let‟s call him Kyle, tells you that his brother is 6-feet, 9-<strong>in</strong>ches tall.<br />
You are most likely wide-eyed and surprised by what he just told you.<br />
Why is this<br />
You likely have some idea of how tall people generally are. You would probably consider a<br />
height of 6-feet, 9-<strong>in</strong>ches to be uncommon <strong>in</strong> the environment you‟re used to. In fact, you might<br />
even go as far as to call this height an outlier, or a value that falls outside the usual data range.<br />
How can you be absolutely sure that this height is uncommon What if you live <strong>in</strong> a region that<br />
tends to have shorter people<br />
The statistician would say that it would be nice to see a probability distribution associated with<br />
heights of all people liv<strong>in</strong>g <strong>in</strong> the region, state, country, or cont<strong>in</strong>ent on which you live. She<br />
would argue that, if you are try<strong>in</strong>g to describe the people <strong>in</strong> the U.S. based on people liv<strong>in</strong>g <strong>in</strong><br />
Arizona, you are draw<strong>in</strong>g from a biased sample.<br />
While we will not discuss cont<strong>in</strong>uous random variables here (variables that can take on any<br />
number <strong>in</strong> a specified range), we will show a theoretical distribution <strong>for</strong> heights <strong>in</strong> the U.S.<br />
below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 146
For men, we see that the most frequently occurr<strong>in</strong>g height is near 70 <strong>in</strong>ches (5-feet, 10-<strong>in</strong>ches). It<br />
is very uncommon to have someone who is 80 <strong>in</strong>ches tall (6-feet, 9-<strong>in</strong>ches). This type of<br />
<strong>in</strong><strong>for</strong>mation allows us to conclude that your brother‟s friend is <strong>in</strong>deed very tall.<br />
You might be wonder<strong>in</strong>g how we know that the shapes of the distributions should look like bells.<br />
This is based on the data collection process. It is not unlikely <strong>in</strong> nature <strong>for</strong> distributions to have a<br />
heavily loaded center with lower frequencies out towards the left and right tails. While the<br />
histogram of all heights might not have a perfect bell shape as we <strong>in</strong>dicate, hav<strong>in</strong>g this shape<br />
allows us to use mathematics to model the curve.<br />
Although many variables do take on a cont<strong>in</strong>uous set of values, we will beg<strong>in</strong> with discrete<br />
random variables, as these are slightly simpler to describe.<br />
1.1.2 The B<strong>in</strong>omial Distribution<br />
When we talk about any variable that can take on a f<strong>in</strong>ite (as opposed to <strong>in</strong>f<strong>in</strong>ite) number of<br />
possibilities, we are deal<strong>in</strong>g with a discrete random variable.<br />
Specifically, a b<strong>in</strong>omial random variable is one that takes on one of two possible values, as<br />
<strong>in</strong>dicated by the prefix “bi.” We will simply refer to the outcome as either a “success” or a<br />
“failure.”<br />
Consider this example: let‟s say that you and a friend are toss<strong>in</strong>g a co<strong>in</strong> (s<strong>in</strong>ce this is one of the<br />
most excit<strong>in</strong>g th<strong>in</strong>gs to do). Your friend tosses 9 heads out of 10 tosses. Curious about this, you<br />
beg<strong>in</strong> to analyze the results – how likely is that this type of event could take place<br />
By lett<strong>in</strong>g and represent the events that a head/tail is fac<strong>in</strong>g up on a co<strong>in</strong> toss, respectively,<br />
we know that one possible way <strong>in</strong> which this can happen is:<br />
The probability of this particular sequence of 9 heads and 1 tail is:<br />
( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 147
This is def<strong>in</strong>itely a small probability, but it is not the only way <strong>in</strong> which this can happen. The tail<br />
can occur first, second, third, fourth, etc., with heads all around it. Another one would be:<br />
The probability of this sequence is the same: 9 heads, 1 tail. This is okay, s<strong>in</strong>ce the probability of<br />
toss<strong>in</strong>g a certa<strong>in</strong> sequence does not affect the probability of gett<strong>in</strong>g a head or tail on the next toss.<br />
So,<br />
( ) ( ) ( ) ( ) ( ) ( )<br />
Not surpris<strong>in</strong>gly, there are 8 more places <strong>for</strong> the tail to have appeared. We‟ll summarize <strong>in</strong> the<br />
table below:<br />
Arrangement of 9 , 1<br />
Probability<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
( ) ( )<br />
S<strong>in</strong>ce these are 10 dist<strong>in</strong>ct ways of gett<strong>in</strong>g this outcome, each with probability 0.000977 (that is,<br />
each takes up 0.0977% of the entire sample space), the probability of gett<strong>in</strong>g 9 heads and 1 tail<br />
is:<br />
( )<br />
As suspected, this particular event is not very likely.<br />
What if we complicated the problem a little more and asked, what would be the probability of<br />
hav<strong>in</strong>g two tails mixed up <strong>in</strong> 10 total tosses<br />
This gets more complicated, s<strong>in</strong>ce the two tosses could occur one after another, two tosses apart,<br />
three tosses apart, etc. To simplify our lives, it can be shown that the total number of ways <strong>in</strong><br />
which a b<strong>in</strong>ary “success” can occur is by f<strong>in</strong>d<strong>in</strong>g the follow<strong>in</strong>g comb<strong>in</strong>ation:<br />
. /<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 148
So, we had 10 trials and wanted to know the number if ways <strong>in</strong> which 9 heads (successes) can be<br />
<strong>in</strong>cluded <strong>in</strong> the mix. We have:<br />
. /<br />
Then, we simply need to f<strong>in</strong>d the probability of just one of those arrangements and multiply it by<br />
the number of different arrangements.<br />
S<strong>in</strong>ce we def<strong>in</strong>ed a head result<strong>in</strong>g as a success, then, what we just calculated was:<br />
. / ( ) ( )<br />
At first glance, it might seem a little confus<strong>in</strong>g that the second exponent is the number of trials<br />
less the number of successes.<br />
Why is this<br />
Suppose there are 10 trials and you want 6 successes. This necessarily means that the other 4<br />
trials would result <strong>in</strong> failures. This is precisely , or the number of trials less the<br />
number of successes.<br />
Let‟s make this <strong>for</strong>mula easier to consider. First off, let‟s def<strong>in</strong>e some variables:<br />
Let<br />
Now, <strong>in</strong> any event, success and failure make up the whole sample space. That is:<br />
S<strong>in</strong>ce they make up the sample space,<br />
So,<br />
( ) ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 149
( ) ( )<br />
We rewrite our <strong>for</strong>mula with the above def<strong>in</strong>ed components:<br />
. / ( )<br />
This is known as the b<strong>in</strong>omial probability density function, or b<strong>in</strong>omial pdf.<br />
To make this more clear, we first def<strong>in</strong>e a random variable, . In the case of a b<strong>in</strong>omial<br />
experiment (one <strong>in</strong> which there are two possible outcomes <strong>for</strong> each trial), the set list<strong>in</strong>g all<br />
possible values that can be achieved (between 0 and the number of trials).<br />
For example, if<br />
<strong>in</strong> co<strong>in</strong> tosses, then * +. That is, between 0 and 10 heads can possibly<br />
be achieved <strong>in</strong> 10 tosses of the co<strong>in</strong> (though not all have the same probability). To <strong>in</strong>dicate a<br />
b<strong>in</strong>omial pdf calculation, we often write:<br />
The probability that takes on successes is . / ( ) , or,<br />
( ) . / ( )<br />
We summarize a b<strong>in</strong>omial pdf below, along with the necessary assumptions to use this.<br />
B<strong>in</strong>omial Probability Density Function (pdf)<br />
If the follow<strong>in</strong>g assumptions are met:<br />
1) An experiment is carried out with trials,<br />
2) Each trial can result <strong>in</strong> only one of two possible values: a success or a failure,<br />
3) The probability of a success <strong>in</strong> each trial is (it is always the same), and<br />
4) Each trial is <strong>in</strong>dependent of all other trials (the outcome of one trial <strong>in</strong> no way affects the<br />
outcome of any other trial),<br />
then the experiment is a b<strong>in</strong>omial experiment and the probability of<br />
calculated by<br />
successes can be<br />
( ) . / ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 150
Example 1:<br />
A fair-two sided co<strong>in</strong> is tossed 10 times. The goal is to get 8 heads.<br />
a) In how many different ways can this event occur<br />
b) Verify that all assumptions are met to conduct a b<strong>in</strong>omial experiment.<br />
c) What is the probability of this event<br />
SOLUTION:<br />
a) S<strong>in</strong>ce there are 10 events and 8 successes desired, there are:<br />
b)<br />
. /<br />
1) There are trials<br />
2) Each outcome is either a head (success) or a tail (failure)<br />
3) The probability of success on any trial is<br />
4) One toss does not <strong>in</strong>fluence the outcome of any other toss<br />
Thus, all assumptions have been met.<br />
c)<br />
( ) . / ( ) ( )<br />
Thus, there is about a 4.3% of toss<strong>in</strong>g 8 heads <strong>in</strong> 10 tosses.<br />
The fact that the probability of gett<strong>in</strong>g 8 heads <strong>in</strong> 10 tosses is higher than gett<strong>in</strong>g 9 heads <strong>in</strong> 10<br />
tosses should not surprise us. Gett<strong>in</strong>g 9 heads is a rather extreme request. Gett<strong>in</strong>g 8 heads, while<br />
still extreme, is a bit more likely.<br />
Let‟s now build the probability distribution histogram <strong>for</strong> . We first display the probabilities <strong>in</strong><br />
a table below by apply<strong>in</strong>g the b<strong>in</strong>omial pdf:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 151
Probability<br />
Successes Probability<br />
0 0.001<br />
1 0.010<br />
2 0.044<br />
3 0.117<br />
4 0.205<br />
5 0.246<br />
6 0.205<br />
7 0.117<br />
8 0.044<br />
9 0.010<br />
10 0.001<br />
Does this match our expectations The table <strong>in</strong>dicates that gett<strong>in</strong>g 5 heads has the highest<br />
likelihood of all 11 possible events. Even more importantly, the probability of gett<strong>in</strong>g between 4<br />
and 6 heads <strong>in</strong> 10 tosses is<br />
. The probability of gett<strong>in</strong>g very few<br />
or many successes gets to be very unlikely. This data is displayed <strong>in</strong> the histogram below:<br />
0.300<br />
Toss<strong>in</strong>g X Heads <strong>in</strong> 10 Tosses<br />
0.250<br />
0.200<br />
0.150<br />
0.100<br />
0.050<br />
0.000<br />
1 2 3 4 5 6 7 8 9 10 11<br />
Successes<br />
This further validates our argument above.<br />
Additionally, note that the sum of all event probabilities sums to 1. This is necessary and<br />
important <strong>in</strong> describ<strong>in</strong>g the distribution.<br />
Sum of Success Probabilities <strong>in</strong> a B<strong>in</strong>omial Experiment<br />
With trials <strong>in</strong> a b<strong>in</strong>omial experiment, the sum of the probabilities of 0 up to successes must<br />
constitute the sample space and hence equal 1.<br />
That is,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 152
( ) ( ) ( ) ( )<br />
Example 2: A fair, 6-sided die is rolled 8 times. The goal is to roll a 1 or a 2 four times dur<strong>in</strong>g<br />
the experiment.<br />
SOLUTION:<br />
a) Is this a b<strong>in</strong>omial experiment<br />
b) In how many different ways can this event occur<br />
c) What is the probability of this event<br />
a) A success is classified as roll<strong>in</strong>g a 1 or a 2. A failure is classified as roll<strong>in</strong>g a 3, 4, 5, or 6.<br />
Thus, . There are trials and the probability of a success is always , s<strong>in</strong>ce<br />
the 8 outcomes are <strong>in</strong>dependent. Thus, this is <strong>in</strong>deed a b<strong>in</strong>omial experiment.<br />
b) It is possible to have a success occur <strong>in</strong> . / different ways.<br />
c) Let be the number of successes possible. Then * +.<br />
( ) . / ( ) ( )<br />
. / ( ) ( )<br />
There is about a 17% chance of gett<strong>in</strong>g a 1 or 2 on four out of 8 die rolls.<br />
A question that follows from Example 2: is, what does the distribution look like Let‟s develop<br />
the distribution <strong>in</strong> tabular <strong>for</strong>m first. To do this, we calculate b<strong>in</strong>omial probabilities <strong>for</strong> each of<br />
the 9 possible outcomes (anywhere between 0 and 8 successes possible).<br />
Successes Probability<br />
0 0.039<br />
1 0.156<br />
2 0.273<br />
3 0.273<br />
4 0.171<br />
5 0.068<br />
6 0.017<br />
7 0.002<br />
8 0.000<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 153
Probability<br />
We see clearly that the number of successes with the highest probability is 2 or 3. The histogram<br />
follows:<br />
0.300<br />
0.250<br />
0.200<br />
Roll<strong>in</strong>g a 1 or 2 <strong>in</strong> 8 Die Rolls<br />
0.150<br />
0.100<br />
0.050<br />
0.000<br />
1 2 3 4 5 6 7 8 9<br />
Successes<br />
Notice that this distribution is not symmetric. It is said to have to be skewed to the right, s<strong>in</strong>ce<br />
the distribution has its probabilities heavily concentrated towards the left and so has a tail to the<br />
right (hence the name)<br />
Distribution Types<br />
There are three s<strong>in</strong>gle-peaked (called unimodal) distributions, as illustrated below:<br />
1.1.3 Expected Value<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 154
Expected Value of a B<strong>in</strong>omial Random Variable<br />
It can be shown that the expected value of , or the average number of successes we expect to<br />
see, given that is a b<strong>in</strong>omial random variable, is:<br />
( )<br />
Example 3: Prist<strong>in</strong>e Air Condition<strong>in</strong>g uses a digital phonebook to call homeowners <strong>in</strong> a large<br />
city regard<strong>in</strong>g a $55.99 A/C ma<strong>in</strong>tenance special. In an hour, a telemarketer can make about<br />
10 calls. If the probability that a randomly called homeowner signs up <strong>for</strong> the ma<strong>in</strong>tenance<br />
special is 0.40,<br />
a. what is the probability that telemarketer gets at least 80% of his hourly customers<br />
to sign up<br />
b. Represent this probability <strong>in</strong> a histogram.<br />
c. F<strong>in</strong>d and expla<strong>in</strong> the expected value of the random variable.<br />
SOLUTION:<br />
a) We first need to determ<strong>in</strong>e whether or not this is a b<strong>in</strong>omial probability. S<strong>in</strong>ce the<br />
probability of success is 0.40 on every one of 10 trials and we assume that the size of the<br />
population does not significantly impact the percentage of success (as remov<strong>in</strong>g one<br />
potential customer from the pool reduces the size of the callable population), we conclude<br />
that this is a b<strong>in</strong>omial experiment. Thus, the number of called homeowners that<br />
accept the offer.<br />
We want to know the probability of gett<strong>in</strong>g bus<strong>in</strong>ess from 8, 9, or all 10 of the called<br />
<strong>in</strong>dividuals. We want:<br />
( ) ( ) ( )<br />
because each of these accounts <strong>for</strong> disjo<strong>in</strong>t pieces of the sample space.<br />
With and , we have:<br />
. / ( ) ( ) . / ( ) ( ) . / ( ) ( )<br />
Thus, there is only about a 1.23% chance that the A/C company gets the bus<strong>in</strong>ess of 80%<br />
or more of the homeowners called.<br />
b) The histogram is below. The probability we are look<strong>in</strong>g at is the sum of probabilities after<br />
7 successes:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 155
c) The expected value is, , - ( ) . Thus, we expect that each hour 4 out of 10<br />
homeowners accept the ma<strong>in</strong>tenance offer.<br />
Homework Problems –4.1<br />
1. Determ<strong>in</strong>e whether or not each of the follow<strong>in</strong>g experiments represents a b<strong>in</strong>omial<br />
experiment. (Video Solution)<br />
a. A die is rolled 20 times and the number of 6‟s is counted.<br />
b. A die is rolled until ten 6‟s show up.<br />
c. In a stream with 1,500 fish, 700 are Ra<strong>in</strong>bow Trout. A total of 20 fish are caught<br />
and the number of Ra<strong>in</strong>bow Trout is counted.<br />
d. About 10% of the U.S. population is suspected to have a <strong>for</strong>m of bacteria. A<br />
sample of 100 people is drawn from the population and the number of people with<br />
the stra<strong>in</strong> of bacteria is counted.<br />
e. A brand of LED light bulb has a 0.5% chance of go<strong>in</strong>g out prior to the advertised<br />
life of 30,000 hours. In the test<strong>in</strong>g phase, 850 bulbs are sampled <strong>for</strong> quality<br />
assurance. The number of bulbs that don‟t die prior to the 30,000 hour life is<br />
counted.<br />
2. Suppose the outcome of random variable is conducted with trials each with<br />
<strong>in</strong>dependent probability of success, . (Video Solution)<br />
a. Is this a b<strong>in</strong>omial experiment<br />
b. What is the probability that<br />
c. What is the probability that<br />
d. What is the probability that<br />
e. What is the probability that<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 156
f. What is the probability that<br />
g. What is , - Does it co<strong>in</strong>cide with the result<strong>in</strong>g that has the highest<br />
probability<br />
3. In prepar<strong>in</strong>g <strong>for</strong> a New Year‟s Eve celebration, police look at past records <strong>for</strong> arrests due<br />
driv<strong>in</strong>g under the <strong>in</strong>fluence (DUI). In the U.S., 10.5% of arrests made are <strong>for</strong> DUI<br />
(SOURCE: U.S. Statistical Abstract, Table 324). If it is expected that each police officer<br />
makes 10 arrests, what is the probability that all arrests result <strong>in</strong> DUI‟s (Video Solution)<br />
4. Pancreatic cancer is a vicious killer. The 5-year survival rate between 2001 and 2007 was<br />
only 5.9%, mean<strong>in</strong>g that the majority of people with pancreatic cancer die with<strong>in</strong> 5-years<br />
of contract<strong>in</strong>g the cancer. In a group of 25 patients, 5 survive beyond. How likely is such<br />
an event Assume that the survival of one person is <strong>in</strong>dependent of another person.<br />
(SOURCE: U.S. Statistical Abstract, Table 182). (Video Solution)<br />
5. A new herbal dr<strong>in</strong>k blend is be<strong>in</strong>g compared to an older blend via a bl<strong>in</strong>d taste-test<br />
comparison. Four judges will taste each of the two dr<strong>in</strong>ks and will state their preference.<br />
It is anticipated that both blends are equally impressive. (Video Solution)<br />
a. F<strong>in</strong>d the probability distribution <strong>for</strong> the number of judges that vote <strong>in</strong> favor of the<br />
new blend.<br />
b. Construct a probability histogram.<br />
c. What is the probability that at least two of the judges prefer the new blend<br />
d. What is the expected value of this distribution and what is its real-world mean<strong>in</strong>g<br />
6. Goranson and Hall (1980) expla<strong>in</strong> that the probability of detect<strong>in</strong>g a crack <strong>in</strong> an airplane<br />
w<strong>in</strong>g is the product of , the probability of <strong>in</strong>spect<strong>in</strong>g a plane with a w<strong>in</strong>g crack; , the<br />
probability of <strong>in</strong>spect<strong>in</strong>g the detail <strong>in</strong> which the crack is located; and , the probability<br />
of detect<strong>in</strong>g the damage. (Problem Source: Mathematical <strong>Statistics</strong> with Applications, 6 th<br />
Ed., Wackerly, et. al.) (Video Solution)<br />
a. What assumptions justify the multiplication of these probabilities<br />
b. Suppose and <strong>for</strong> a certa<strong>in</strong> fleet of planes. If three planes<br />
are <strong>in</strong>spected from this fleet, f<strong>in</strong>d the probability that a w<strong>in</strong>g crack will be<br />
detected on at least one of them.<br />
c. F<strong>in</strong>d the probability distribution <strong>for</strong> the number of planes <strong>in</strong> this fleet with<br />
detected w<strong>in</strong>g cracks.<br />
d. Construct a probability histogram.<br />
e. What is the expected value of this distribution and what is its real-world mean<strong>in</strong>g<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 157
Probability (Relative Frequency)<br />
Chapter 5<br />
Cont<strong>in</strong>uous Probability Distributions<br />
Up until this po<strong>in</strong>t, we have only considered distribution that have discrete values – non-negative<br />
<strong>in</strong>tegers. There are many variables, however, that are cont<strong>in</strong>uous <strong>in</strong> nature. In fact, almost every<br />
variable you studied <strong>in</strong> algebra and calculus was cont<strong>in</strong>uous!<br />
Take, <strong>for</strong> example, heights of NBA basketball players, hourly wage, response time of a database<br />
server, temperature, depth of a lake, the value of a share of Intel stock, and the lifespan of a car<br />
eng<strong>in</strong>e, to name just a very few. These are all variables that can take on <strong>in</strong>f<strong>in</strong>itely many values,<br />
even with<strong>in</strong> a limited range. For example, the response time of a database could be 0 seconds and<br />
1 second. It could be 0.01 seconds, 0.00001 seconds, or 0.98727495 seconds.<br />
5.1 The Ideas Beh<strong>in</strong>d the Cont<strong>in</strong>uous Distribution<br />
5.1.1 Conceptual Approach to Cont<strong>in</strong>uous Distributions<br />
Th<strong>in</strong>k back to a discrete distribution. The probability of a particular value was found by<br />
observ<strong>in</strong>g the height of the relative frequency bar. While relative frequency represents the<br />
percentage of observations found to have the value specified, it can also be thought of as a<br />
probability, if we feel that it accurately models predictions that we might use it <strong>for</strong>. Consider the<br />
example below show<strong>in</strong>g the number of children <strong>in</strong> a classroom of 30 that are likely to likely to<br />
have the flu.<br />
Number of Children with Flu <strong>in</strong> a Class<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0.4<br />
0.2<br />
0.14<br />
0.16<br />
0.1 0.1<br />
0 1 2 3 4 5<br />
Number of Children w/Flu<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 158
Probability<br />
For <strong>in</strong>stance, we see that the probability that any 2 children <strong>in</strong> a classroom have the flu is 0.2.<br />
Let‟s call this random variable<br />
# of children <strong>in</strong> a classroom of 30 that have the flu.<br />
Then, we will write the probability that any 2 children have the flu as:<br />
( )<br />
This reads, “the probability that the number of children that have the flue is 2”<br />
The output of this statement is:<br />
( )<br />
What would it mean to say ask: What is ( )<br />
This is ask<strong>in</strong>g us to f<strong>in</strong>d the probability that 2 or fewer children have the flu. In other words,<br />
what is the probability that 0, 1, or 2 children have the flu. To answer this, we simply add the bar<br />
heights correspond<strong>in</strong>g to .<br />
( )<br />
Thus, there is a 74% chance that 2 or fewer children <strong>in</strong> a class of 30 children have the flu.<br />
With cont<strong>in</strong>uous distributions, we cannot simply read the “height of the bar!” For <strong>in</strong>stance<br />
consider the follow<strong>in</strong>g cont<strong>in</strong>uous probability distribution that shows the likelihood of various<br />
wait times <strong>in</strong> l<strong>in</strong>e at a fast-food restaurant:<br />
0.25<br />
0.2<br />
0.15<br />
Time Speng Wait<strong>in</strong>g <strong>in</strong> L<strong>in</strong>e<br />
0.1<br />
0.05<br />
0<br />
0 1 2 3 4 5<br />
M<strong>in</strong>utes<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 159
In this case: m<strong>in</strong>utes spent wait<strong>in</strong>g <strong>in</strong> l<strong>in</strong>e is a cont<strong>in</strong>uous random variable. The reason is<br />
that a person doesn‟t wait a whole-number of m<strong>in</strong>utes! It is perfectly okay <strong>for</strong> a person to wait<br />
1.42 m<strong>in</strong>utes, <strong>for</strong> example.<br />
In this example, suppose we wish to f<strong>in</strong>d ( ), that is, the probability that the wait time is<br />
2-and-a-half m<strong>in</strong>utes. At first glance, we might simply decide to locate 2.5 m<strong>in</strong>utes and assess<br />
the probability output. We would f<strong>in</strong>d:<br />
( )<br />
If this were the case, wouldn‟t it be the case that all wait times have a probability of 0.2 Based<br />
on the graph, of course. This, however, would be a logical pitfall: if there are <strong>in</strong>f<strong>in</strong>itely many<br />
different wait times between 0 and 5 m<strong>in</strong>utes, then the sum of all probabilities would be a sum of<br />
<strong>in</strong>f<strong>in</strong>itely many 0.2‟s. In other words, it is only possible <strong>for</strong> the wait times to have <strong>in</strong>dividual<br />
probabilities of 0.2 if the times were discrete. When we deal with cont<strong>in</strong>uous random variables,<br />
we should actually consider the vertical axis to be density <strong>in</strong>stead of probability. In and of itself,<br />
density is not a mean<strong>in</strong>gful value, however, <strong>in</strong> conjunction what we will mention next, it will<br />
prove to be useful.<br />
Without go<strong>in</strong>g <strong>in</strong>to too much detail, an <strong>in</strong>terval of densities is designed <strong>in</strong> such a way that the<br />
area under the function is 1, or 100%. Let‟s reconsider the above graph:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 160
Density<br />
0.25<br />
Time Speng Wait<strong>in</strong>g <strong>in</strong> L<strong>in</strong>e<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0 1 2 3 4 5<br />
M<strong>in</strong>utes<br />
We notice<br />
. The region underneath the blue l<strong>in</strong>e is rectangular. Visually:<br />
To f<strong>in</strong>d the area of a rectangle, we must simply take<br />
And, so we are able to confirm that<br />
store has experienced.<br />
represents all possible wait times this particular<br />
As you might guess, if we wish to f<strong>in</strong>d the probability of a range of values, we would simply f<strong>in</strong>d<br />
the probability between those two values of time.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 161
Density<br />
One question does rema<strong>in</strong>, however: what is the probability that the wait time is exactly<br />
2.5 m<strong>in</strong>utes<br />
The answer might not come as too much of a surprise: the probability is 0!<br />
The probability of a s<strong>in</strong>gle value <strong>in</strong> a cont<strong>in</strong>uous distribution is 0, s<strong>in</strong>ce there are <strong>in</strong>f<strong>in</strong>itely many<br />
possible values. Thus, 2.5 represents 1 of <strong>in</strong>f<strong>in</strong>itely many values. Take and you get 0!<br />
We can only f<strong>in</strong>d the probability of a non-zero range of values <strong>for</strong> a cont<strong>in</strong>uous random variable!<br />
Cont<strong>in</strong>uous Random Variables<br />
A cont<strong>in</strong>uous random variable is a random variable that has <strong>in</strong>f<strong>in</strong>itely many possible values<br />
with<strong>in</strong> a range of real numbers.<br />
As a result, the probability that a cont<strong>in</strong>uous random variable takes on any one specific value is<br />
0.<br />
Probability Density Function (PDF)<br />
The PDF of a cont<strong>in</strong>uous random variable is a cont<strong>in</strong>uous function such that the total area<br />
between the function and the horizontal axis is 1. The function‟s <strong>in</strong>put values are the values of<br />
the random variable, while the output values are densities. Densities are <strong>in</strong>dividually mean<strong>in</strong>gless<br />
values designed so that the total area equals 1.<br />
Reconsider the above wait-times example:<br />
0.25<br />
0.2<br />
0.15<br />
Time Spent Wait<strong>in</strong>g <strong>in</strong> L<strong>in</strong>e<br />
0.1<br />
0.05<br />
0<br />
0 1 2 3 4 5<br />
M<strong>in</strong>utes<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 162
Suppose we wish to f<strong>in</strong>d (<br />
), that is, the probability that the wait<strong>in</strong>g time is<br />
between 2.5 and 3.5 m<strong>in</strong>utes. To f<strong>in</strong>d this, we simply f<strong>in</strong>d the area under the PDF between 2.5<br />
and 3.5 m<strong>in</strong>utes:<br />
The area of the rectangular region is:<br />
Thus,<br />
( )<br />
We can expect to wait between 2.5 and 3.5 m<strong>in</strong>utes with a 20% chance. Thus, approximately one<br />
<strong>in</strong> five visits, our wait-time will be somewhere with<strong>in</strong> this <strong>in</strong>terval.<br />
Similarly, suppose we wish to know:<br />
( )<br />
This is the probability that the wait-time is between 0.3 and 4.4 m<strong>in</strong>utes. We identify this region<br />
below:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 163
The area of this region is:<br />
Thus, there is an 82% chance that the wait-time is between 0.3 and 4.4 m<strong>in</strong>utes.<br />
5.1.2 Uni<strong>for</strong>m Distribution<br />
Cont<strong>in</strong>uous Uni<strong>for</strong>m Distribution<br />
When the PDF of a random variable is a constant, we call this a uni<strong>for</strong>m distribution. That is,<br />
values of the random variable are uni<strong>for</strong>mly distributed.<br />
The PDF of a random variable, , whose values are <strong>in</strong> the <strong>in</strong>terval<br />
is:<br />
( ) {<br />
The expected value of this random variable is:<br />
( )<br />
The variance of this random variable is:<br />
( )<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 164
Density<br />
Result<strong>in</strong>g <strong>in</strong> a standard deviation of:<br />
√ ( )<br />
Example 1: The amount of revenue that a farmers market generates on a given Saturday is<br />
uni<strong>for</strong>mly distributed between $5,000 and $22,000.<br />
SOLUTION:<br />
a. F<strong>in</strong>d the PDF <strong>for</strong> this random variable.<br />
b. F<strong>in</strong>d the probability that the between $6,000 and $8,000 is generated.<br />
c. F<strong>in</strong>d the expected value of this random variable and expla<strong>in</strong> its real-world<br />
mean<strong>in</strong>g.<br />
d. F<strong>in</strong>d the standard deviation of this random variable and expla<strong>in</strong> its real-world<br />
mean<strong>in</strong>g.<br />
a. The lower limit is and the upper limit is . Thus,<br />
( )<br />
This is constant function is only valid <strong>for</strong> values between 5000 and 22000. It is valued as<br />
0 everywhere else.<br />
0.00007<br />
0.00006<br />
0.00005<br />
0.00004<br />
0.00003<br />
0.00002<br />
0.00001<br />
0<br />
Revenue PDF<br />
5000 22000<br />
Revenue ($)<br />
b. We want ( ). The probability will be the length times the width.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 165
We get:<br />
( )<br />
There is about a 12% chance that revenue earned will fall between $6,000 and $8,000.<br />
c. The expected value will be:<br />
This is a simple average. Thus, on average, the farmers market will make $13,500 on a<br />
given Saturday.<br />
d. The standard deviation will be:<br />
On average, revenue will vary by $4,908 less or more than the mean.<br />
√<br />
5.1.3 Other Distributions<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 166
Density<br />
Without go<strong>in</strong>g <strong>in</strong>to detail here, cont<strong>in</strong>uous random variables have PDF‟s with area between the<br />
function and the horizontal axis equal to 1. Clearly, densities will have to be positive, as it is not<br />
possible to have negative probabilities.<br />
As an example, a distribution might look like this:<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 1 2<br />
Random Variable Values<br />
Practically speak<strong>in</strong>g, it appears to be most probable that the random variable will take on a value<br />
around 1. It is less likely that the random variable will take on values close to 0 or close to 2.<br />
This might be handy <strong>in</strong> situations where such criteria is desired.<br />
Notice that the area is also 1. If you divide the triangle <strong>in</strong>to 2 and use the area of a triangle<br />
<strong>for</strong>mula . /:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 167
Then the sum of the two triangular areas is:<br />
In this next section, we will focus our attention on the most commonly used cont<strong>in</strong>uous random<br />
variable: the normally distributed random variable.<br />
Homework Problems –5.1<br />
The first two questions below <strong>in</strong>volve discrete random variables. The aim of these questions is to<br />
get you th<strong>in</strong>k<strong>in</strong>g <strong>in</strong> terms of the probabilities of ranges of values.<br />
1. A pizza shop sells pizzas <strong>in</strong> four different sizes. The 1000 most recent orders <strong>for</strong> a s<strong>in</strong>gle<br />
pizza gave the follow<strong>in</strong>g proportions <strong>for</strong> the various sizes:<br />
With denot<strong>in</strong>g the size of a pizza <strong>in</strong> a s<strong>in</strong>gle-pizza order, the given table is an<br />
approximation to the population distribution of .<br />
a. Construct a probability (relative frequency) histogram to represent the<br />
approximate distribution of this variable.<br />
b. Approximate ( ).<br />
c. Approximate ( ).<br />
d. F<strong>in</strong>d the expected value of .What does this value mean<br />
e. What is the approximate probability that is with<strong>in</strong> 2 <strong>in</strong>. of this expected (mean)<br />
value<br />
2. Airl<strong>in</strong>es sometimes overbook flights. Suppose that <strong>for</strong> a plane with 100 seats, an airl<strong>in</strong>e<br />
takes 110 reservations. Def<strong>in</strong>e the variable as the number of people who actually show<br />
up <strong>for</strong> a sold-out flight. From past experience, the population distribution of is given <strong>in</strong><br />
the follow<strong>in</strong>g table:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 168
a. What is the probability that the airl<strong>in</strong>e can accommodate everyone who shows up<br />
<strong>for</strong> the flight<br />
b. What is the probability that not all passengers can be accommodated<br />
3. A particular professor never dismisses class early. Let denote the amount of time past<br />
the hour (<strong>in</strong> m<strong>in</strong>utes) that elapses be<strong>for</strong>e the professor dismisses class. Suppose that the<br />
density curve shown <strong>in</strong> the follow<strong>in</strong>g figure is an appropriate model <strong>for</strong> the probability<br />
distribution of :<br />
0.20<br />
0.15<br />
0.10<br />
0.05<br />
2 4 6 8 10<br />
a. F<strong>in</strong>d the probability density function (PDF) <strong>for</strong> this random variable.<br />
b. What is the probability that at most 5 m<strong>in</strong>utes elapse be<strong>for</strong>e dismissal<br />
c. F<strong>in</strong>d ( ). Expla<strong>in</strong> what your answer means.<br />
d. F<strong>in</strong>d the expected value of this distribution and expla<strong>in</strong> its real-world mean<strong>in</strong>g.<br />
e. F<strong>in</strong>d the standard deviation of this distribution and expla<strong>in</strong> its real-world mean<strong>in</strong>g.<br />
f. What is the probability that <strong>in</strong>structor let‟s out class with<strong>in</strong> one standard deviation<br />
of the average overtime<br />
4. A delivery service charges a special rate <strong>for</strong> any package that weighs less than 1 lb. Let<br />
denote the weight of a randomly selected parcel that qualifies <strong>for</strong> this special rate. The<br />
probability distribution of is specified by the follow<strong>in</strong>g density curve:<br />
Density<br />
0.5 x<br />
1.5<br />
1.0<br />
0.5<br />
0.0 0.2 0.4 0.6 0.8 1.0 1.2<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 169
Use the fact that the figure can be broken up <strong>in</strong>to the area of a rectangle and the area of a<br />
triangle, where area of a triangle = ( )( ) and the area of a rectangle =<br />
( )( ).<br />
a. What is the probability that a randomly selected package of this type weighs at<br />
most 0.5 lb.<br />
b. What is the probability that a randomly selected package of this type weighs<br />
between 0.25 lb. and 0.5 lb.<br />
c. What is the probability that a randomly selected package of this type weighs at<br />
least 0.75 lb.<br />
d. The probability is def<strong>in</strong>ed on the <strong>in</strong>terval . Verify that the area under<br />
the curve <strong>in</strong> this region is 1.<br />
5. A plumb<strong>in</strong>g service is able to respond to off-site emergency calls uni<strong>for</strong>mly between 15<br />
and 45 m<strong>in</strong>utes.<br />
a. F<strong>in</strong>d the PDF <strong>for</strong> this random variable, .<br />
b. F<strong>in</strong>d ( )<br />
c. F<strong>in</strong>d ( )<br />
d. Why are both of the above probabilities the same<br />
e. F<strong>in</strong>d ( ).<br />
f. F<strong>in</strong>d and <strong>in</strong>terpret the real-world mean<strong>in</strong>g of the expected value.<br />
g. F<strong>in</strong>d and <strong>in</strong>terpret the real-world mean<strong>in</strong>g of the standard deviation.<br />
h. What is the probability that the service responds with<strong>in</strong> 1.5 standard deviations of<br />
the expected time<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 170
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 171
5.2 The Normal Distribution<br />
5.2.1 The Normal Distribution As a Natural Phenomena<br />
The normal distribution (pictured above), much like the uni<strong>for</strong>m distribution, is a cont<strong>in</strong>uous<br />
distribution. In fact, this distribution is def<strong>in</strong>ed <strong>for</strong> all real numbers. The curve runs from to<br />
. However, as you might observe, the most likely values occur close to where the density<br />
function peaks. Values that occur <strong>in</strong> either one of the “tails” are highly unlikely and, as it<br />
appears, the density function is very close to the horizontal axis as it extends farther to the left<br />
and to the right.<br />
Why do we use this distribution Much like the <strong>in</strong>famous appears <strong>in</strong> many natural places,<br />
many random variables tend to be normally distributed. That is to say, the bulk of values tend to<br />
occur near the mean and median (both of which are located directly <strong>in</strong> the center of the<br />
distribution, s<strong>in</strong>ce it is perfectly symmetric). For <strong>in</strong>stance, heights of <strong>in</strong>dividuals <strong>in</strong> the United<br />
States (roughly) follow a normal distribution – there are many people whose heights are near<br />
average. There are fewer extremely short and extremely tall people <strong>in</strong> the United States. Thus,<br />
we would say that the bulk of people are “normal” with respect to their heights.<br />
While certa<strong>in</strong>ly not all random variables are normally distributed, many are. Weights, IQ, newvehicle<br />
gas mileages (to name just a very few) are variables that have been known to follow a<br />
normal distribution. As we will later see, any distribution can “become” a normal distribution.<br />
This is a beautiful phenomenon that allows us to make some important conclusions (more on this<br />
idea <strong>in</strong> a later section).<br />
As be<strong>for</strong>e, the overall area under the normal curve is 1 (50% on either side of the mean/median,<br />
as <strong>in</strong> the image). To f<strong>in</strong>d the area, we would need to use some rather unusual shapes <strong>in</strong> order to<br />
apply the same methodology as be<strong>for</strong>e. The idea of an <strong>in</strong>tegral <strong>in</strong> calculus would actually allow<br />
us to f<strong>in</strong>d the area exactly, however, the normal curve is modeled by the follow<strong>in</strong>g pdf:<br />
( )<br />
√<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 172
As you can see, this is a difficult function to work with. Historically, tables have been developed<br />
with calculated areas, as the calculus was once quite difficult to do. In order to do this, it was<br />
often necessary to first convert the desired range of values to -scores. S<strong>in</strong>ce every normal<br />
distribution has a different mean and standard deviation, it would be impossible to create a table<br />
<strong>for</strong> every possible comb<strong>in</strong>ation. Instead, s<strong>in</strong>ce each normal distribution is of the same shape, it<br />
made sense to create just one table that represented a mean of and a standard deviation of<br />
. That is, we can th<strong>in</strong>k about every distribution as the number of standard deviations each<br />
score is from the mean. The mean is 0 standard deviations away from the mean (it is the mean!)<br />
and each unit represents 1 standard deviation. We can th<strong>in</strong>k about any distribution this way!<br />
Normal Distribution Expected Value and Variance<br />
A normal probability distribution can be modeled by the function<br />
( )<br />
√<br />
( )<br />
where the<br />
expected value is , def<strong>in</strong>ed as a standard mean,<br />
∑<br />
And variance is<br />
, def<strong>in</strong>ed as a standard variance,<br />
∑( )<br />
IMPORTANT NOTE: and represent the population mean and variance. represents the<br />
population size. Recall that the sample variance has a divisor of , so that it is an unbiased<br />
estimator of the population variance.<br />
Below is an example of what a typical table would look like. We call this a standard normal<br />
table, s<strong>in</strong>ce it requires that values between which we would like to know areas are<br />
“standardized.” This means they are converted to scores prior to us<strong>in</strong>g the table:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 173
As we notice, this table only shows positive scores. A similar table exists <strong>for</strong> negative scores,<br />
that is, <strong>for</strong> values that are less than the mean. The image tells us that each of the entries <strong>in</strong> the<br />
center of the table correspond to areas that are to the left of the score we would look up.<br />
1. In an Arizona town, suppose the heights of adult males is such that <strong>in</strong>ches and<br />
(so the standard deviation is the square root of this value, ). What is the<br />
probability that a male is shorter than 72 <strong>in</strong>ches (6 feet tall)<br />
SOLUTION: We wish to f<strong>in</strong>d ( ), where ( ). The normal<br />
distribution would look like the follow<strong>in</strong>g:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 174
We wish to know the area of the shaded region below:<br />
We first convert the value of 72 to a<br />
score:<br />
We round to two decimal places, s<strong>in</strong>ce the standard normal table can handle up to two decimal<br />
places. Any additional decimal places would not make a substantial difference.<br />
We locate by first locat<strong>in</strong>g 1.1 along the rows and 0.04 along the columns (s<strong>in</strong>ce 1.1 +<br />
0.04 = 1.14).<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 175
The value we f<strong>in</strong>d is 0.8729. This means that ( ) . There is an 87.29% chance<br />
that a randomly selected <strong>in</strong>dividual will be less than 72 <strong>in</strong>ches <strong>in</strong> height.<br />
What if we wanted to know an area to the right, such as ( ) The table does not provide<br />
these values. However, if we know that ( ) then the probability of a height<br />
greater than 72 must be the rema<strong>in</strong><strong>in</strong>g area, .<br />
Similarly, if we wish to f<strong>in</strong>d the area between two po<strong>in</strong>ts, we must get creative.<br />
Suppose we wish to know ( ). We first need to convert both endpo<strong>in</strong>ts to scores:<br />
and<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 176
We can easily f<strong>in</strong>d that the probability of a score less than 0.57 is: 0.7157<br />
The probability of a score less than 1.00 is: 0.8643<br />
The area between them is the difference <strong>in</strong> their areas:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 177
As technology progresses, there is a much lesser need <strong>for</strong> by-hand computations of the sort<br />
above. Instead, let us use the web applet from which the above pdf‟s came:<br />
http://www.rossmanchance.com/applets/NormalCalcs/NormalCalculations.html<br />
As you can see, we enter the mean and standard deviation <strong>in</strong> the first section. If we would like to<br />
plot two functions over one another, we could check the box and enter a second mean and<br />
standard deviation.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 178
In the second section, we can check up to two boxes, <strong>in</strong> the event that we would like to f<strong>in</strong>d an<br />
area between two po<strong>in</strong>ts. We can either enter values as z-scores or as raw data values ( ). To f<strong>in</strong>d<br />
the probability of a value greater than, we click the grey box to select:<br />
The probability of such an event is displayed <strong>in</strong> the “prob” box. If we have two values entered<br />
and both boxes checked, then the “probability between” these two values is displayed. Isn‟t this<br />
much more <strong>in</strong>tuitive and convenient than us<strong>in</strong>g tables<br />
NOTE: One limitation of the above applet is that<br />
a bit of f<strong>in</strong>agl<strong>in</strong>g.<br />
values rounded to two decimal places require<br />
Homework Problems – 5.2<br />
Use the applet mentioned <strong>in</strong> this section to complete these exercises. You are not required to use<br />
the standard normal table.<br />
1. In the United States, IQ‟s are normally distributed with and .<br />
a. What is the probability that a person has an IQ lower than 130<br />
b. What is the probability that a person has an IQ between 80 and 110<br />
c. What is the probability that a person has an IQ between 50 and 70<br />
d. What is the probability that a person has an IQ above 120<br />
2. In the UK, birth weights are approximately normally distributed with lbs. and<br />
lbs. (SOURCE: http://www.healthknowledge.org.uk).<br />
a. F<strong>in</strong>d and expla<strong>in</strong> the real-world mean<strong>in</strong>g of ( ).<br />
b. F<strong>in</strong>d and expla<strong>in</strong> the real-world mean<strong>in</strong>g of ( ).<br />
c. F<strong>in</strong>d and expla<strong>in</strong> the real-world mean<strong>in</strong>g of ( ).<br />
d. F<strong>in</strong>d and expla<strong>in</strong> the real-world mean<strong>in</strong>g of ( ).<br />
e. What weight is such that 20% of <strong>in</strong>fants weight less than this amount (HINT:<br />
You can still use the calculator applet.)<br />
3. In a recent years, Scholastic Aptitude Test (SAT) scores <strong>for</strong> all college-bound seniors <strong>in</strong><br />
the United States was such that po<strong>in</strong>ts and po<strong>in</strong>ts (SOURCE:<br />
http://www.collegeboard.com) .<br />
a. 50% of students scored less than how many po<strong>in</strong>ts<br />
b. 50% of students scored more than how many po<strong>in</strong>ts<br />
c. In order to be <strong>in</strong> the top 10% of SAT-takers, what score would one have to<br />
achieve<br />
d. What score do the lowest 10% score between<br />
e. The middle 50% of students scored between what two values<br />
4. Sketch a normal distribution and . Label the mean, standard deviations,<br />
standard deviations, and standard deviations.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 179
a. Determ<strong>in</strong>e the probability that an observation falls with<strong>in</strong> each of these standard<br />
deviation ranges.<br />
b. The Empirical Rule describes the probability of scores with<strong>in</strong> 1, 2, and 3 standard<br />
deviations of the mean. Do a web search on this topic and compare it to your<br />
answer <strong>in</strong> the above part. Are the results the same<br />
5. Suppose a distribution is such that and .<br />
a. What would happen to the distribution if was changed to 60<br />
b. What would happen to the distribution if was changed to 10 There are two<br />
effects to describe. Discuss why it makes practical sense that these two th<strong>in</strong>gs<br />
should happen to the curve.<br />
c. What would happen to the distribution if was changed to 2 There are two<br />
effects to describe. Discuss why it makes practical sense that these two th<strong>in</strong>gs<br />
should happen to the curve.<br />
d. Describe the effects, <strong>in</strong> general, of and on the shape and location of a normal<br />
distribution.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 180
Chapter 6<br />
Sampl<strong>in</strong>g Distributions and Estimation<br />
When it is only our dataset that is of <strong>in</strong>terest, we use descriptive statistics. This is precisely the<br />
trouble we have been up to so far! Often times, however, we cannot collect all elements <strong>in</strong> the<br />
population. Take, <strong>for</strong> example, a poll to gauge Americans‟ op<strong>in</strong>ion of a candidate <strong>in</strong> office.<br />
Certa<strong>in</strong>ly, you cannot sample all vot<strong>in</strong>g-age adults. This is easily resolved with a manageable<br />
random sample, but is further complicated by the follow<strong>in</strong>g idea: sampl<strong>in</strong>g variability!<br />
We will work to answer the follow<strong>in</strong>g question:<br />
How do we estimate true population parameters us<strong>in</strong>g a random sample, all the while tak<strong>in</strong>g <strong>in</strong>to<br />
account the fact that our sample statistic is variable from sample-to-sample<br />
This is the purpose of <strong>in</strong>ferential statistics and is a very important aspect of understand<strong>in</strong>g the<br />
structure of an underly<strong>in</strong>g population. With many advances <strong>in</strong> statistics, it is possible to make<br />
precise claims about our population.<br />
6.1 Sampl<strong>in</strong>g Distribution <strong>for</strong> ̅<br />
6.1.1 What is a Sampl<strong>in</strong>g Distribution<br />
The hard-cold truth is that, when work<strong>in</strong>g with statistical <strong>in</strong>ference, we likely have no idea what<br />
the underly<strong>in</strong>g probability distribution <strong>for</strong> the population looks like. If we did, then we wouldn‟t<br />
have to draw a random sample and would be nearly done with this course. S<strong>in</strong>ce we don‟t, we<br />
can‟t <strong>in</strong> good conscience assume that the distribution is normal. So, why spend time study<strong>in</strong>g<br />
such a distribution We will soon experience why.<br />
Let‟s start with an example that is concrete.<br />
Suppose we roll a die. Without too much ef<strong>for</strong>t, we can produce the probability distribution <strong>for</strong><br />
the population of all possible outcomes. Here it is:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 181
Probability<br />
0.18<br />
0.16<br />
0.14<br />
0.12<br />
0.10<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
0.00<br />
Probability Distribution <strong>for</strong> S<strong>in</strong>gle Die Roll<br />
1 2 3 4 5 6<br />
Die Value<br />
In words, the probability of gett<strong>in</strong>g any one face value on a die roll is about 0.17 or 1/6. The<br />
distribution is uni<strong>for</strong>m.<br />
If we found the expected value (the average), we would get:<br />
, - ( ) ( ) ( ) ( ) ( ) ( )<br />
(NOTE: This is the same as<br />
s<strong>in</strong>ce each event is equally likely)<br />
The variance of this population requires us to use the population standard deviation <strong>for</strong>mula<br />
(remember, division by occurs if we are deal<strong>in</strong>g with a sample, so that we have an<br />
unbiased estimate <strong>for</strong> the population standard deviation). That is:<br />
, -<br />
∑( )<br />
Us<strong>in</strong>g Excel we f<strong>in</strong>d that:<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
=VAR.P(A2:A7)<br />
which give:<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
2.916666667<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 182
Thus, the standard deviation would be √<br />
, mean<strong>in</strong>g that, on average, we would<br />
expect the die value to deviate by 1.708, or nearly 2 units from the average (1.5 to 5.5, which is<br />
pretty much 1 to 6).<br />
Thus, we have that:<br />
In reality, keep <strong>in</strong> m<strong>in</strong>d that we would often not know much about our population. We get the<br />
luxury of study<strong>in</strong>g someth<strong>in</strong>g we can fully expla<strong>in</strong>. This is all <strong>in</strong> an ef<strong>for</strong>t to better understand<br />
sampl<strong>in</strong>g distributions.<br />
Suppose we conducted an experiment of roll<strong>in</strong>g the die 10 times. For one random sequence, we<br />
might obta<strong>in</strong> the follow<strong>in</strong>g result:<br />
4 6<br />
3 4<br />
4 1<br />
3 4<br />
1 2<br />
Not surpris<strong>in</strong>gly, we get a fairly even spread of values 1 – 6. If we are to compute the average,<br />
we would obta<strong>in</strong> 3.2. That is if all rolls came up as the same number, each roll would be 3.2.<br />
Suppose we asked 19 other people to roll a die 10 times and to then report back to us the mean.<br />
Here is what we might f<strong>in</strong>d (based on a computer simulation of rolls):<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 183
First off, we notice there is sampl<strong>in</strong>g variability. Not every person obta<strong>in</strong>ed the same average<br />
outcome from 10 tosses each. This is expected, s<strong>in</strong>ce the process is a random one.<br />
The distribution of these means is called a sampl<strong>in</strong>g distribution.<br />
Sampl<strong>in</strong>g Distribution<br />
The distribution of sample statistics (such as ̅) computed from repeated sampl<strong>in</strong>g is called a<br />
sampl<strong>in</strong>g distribution.<br />
6.1.2 The Central Limit Theorem<br />
20 Means<br />
3.1<br />
3.3<br />
2.4<br />
3.5<br />
2.7<br />
2.9<br />
2.9<br />
3.6<br />
3<br />
4.7<br />
3.6<br />
3.2<br />
3.9<br />
2.8<br />
3.2<br />
3.3<br />
3.9<br />
3.3<br />
3.5<br />
3.1<br />
We do notice that the means tend to gravitate towards 3.5. Some, as expected, deviate from this<br />
value.<br />
Let us now consider a histogram <strong>for</strong> this sampl<strong>in</strong>g distribution of sample means:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 184
1.7 to 1.8<br />
1.8 to 1.9<br />
1.9 to 2<br />
2 to 2.1<br />
2.1 to 2.2<br />
2.2 to 2.3<br />
2.3 to 2.4<br />
2.4 to 2.5<br />
2.5 to 2.6<br />
2.6 to 2.7<br />
2.7 to 2.8<br />
2.8 to 2.9<br />
2.9 to 3<br />
3 to 3.1<br />
3.1 to 3.2<br />
3.2 to 3.3<br />
3.3 to 3.4<br />
3.4 to 3.5<br />
3.5 to 3.6<br />
3.6 to 3.7<br />
3.7 to 3.8<br />
3.8 to 3.9<br />
3.9 to 4<br />
4 to 4.1<br />
4.1 to 4.2<br />
4.2 to 4.3<br />
4.3 to 4.4<br />
4.4 to 4.5<br />
4.5 to 4.6<br />
4.6 to 4.7<br />
4.7 to 4.8<br />
4.8 to 4.9<br />
4.9 to 5<br />
5 to 5.1<br />
5.1 to 5.2<br />
5.2><br />
2.4 to 2.65<br />
2.65 to 2.9<br />
2.9 to 3.15<br />
3.15 to 3.4<br />
3.4 to 3.65<br />
3.65 to 3.9<br />
3.9 to 4.15<br />
4.15 to 4.4<br />
4.4 to 4.65<br />
4.65 to 4.9<br />
4.9 to 5.15<br />
5.15><br />
6<br />
Sampl<strong>in</strong>g Distribution of x-bar<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
This is quite <strong>in</strong>terest<strong>in</strong>g… we have obta<strong>in</strong>ed a distribution (of means) that appears somewhat<br />
bell-shaped.<br />
Suppose now that we had a total of 1000 people roll a die 10 times each, and to then compute the<br />
sample mean. Here is what a simulation of this process would look like:<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Sampl<strong>in</strong>g Distribution of x-bar<br />
Wow! Our distribution of means <strong>for</strong> 1000 <strong>in</strong>dividuals <strong>for</strong> experiments of 10 rolls each produces<br />
someth<strong>in</strong>g remarkably like a normal distribution. Additionally, it appears that the mean of this<br />
distribution is around 3.5!<br />
Let‟s try this aga<strong>in</strong>, but now, let‟s say that 1000 <strong>in</strong>dividuals each roll a die 20 times, and each<br />
<strong>in</strong>dividual computes a sample mean. This simulated event would produce the follow<strong>in</strong>g<br />
distribution of die-roll average:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 185
2.2 to 2.3<br />
2.3 to 2.4<br />
2.4 to 2.5<br />
2.5 to 2.6<br />
2.6 to 2.7<br />
2.7 to 2.8<br />
2.8 to 2.9<br />
2.9 to 3<br />
3 to 3.1<br />
3.1 to 3.2<br />
3.2 to 3.3<br />
3.3 to 3.4<br />
3.4 to 3.5<br />
3.5 to 3.6<br />
3.6 to 3.7<br />
3.7 to 3.8<br />
3.8 to 3.9<br />
3.9 to 4<br />
4 to 4.1<br />
4.1 to 4.2<br />
4.2 to 4.3<br />
4.3 to 4.4<br />
4.4 to 4.5<br />
4.5 to 4.6<br />
4.6 to 4.7<br />
4.7><br />
120<br />
Sampl<strong>in</strong>g Distribution of x-bar<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
The distribution looks a bit more normal. Upon closer <strong>in</strong>spection, we also see that the variability<br />
of these averages is smaller. That is:<br />
Approximate Range <strong>for</strong> Means of 10 Tosses: 2.1 to 5.2<br />
Approximate Range <strong>for</strong> Means of 20 Tosses: 2.5 to 4.6<br />
We notice that <strong>in</strong>creas<strong>in</strong>g the sample size ( ) has decreased the sampl<strong>in</strong>g distribution‟s<br />
variability.<br />
In fact, the standard deviation <strong>for</strong> the distribution of means computed from 10 and 20 tosses is<br />
about 0.52 and 0.38, respectively.<br />
Let‟s do one more experiment. Let‟s say that 1000 <strong>in</strong>dividuals each roll a die 30 times, and each<br />
<strong>in</strong>dividual computes the mean of his/her rolls. The sampl<strong>in</strong>g distribution of means would look<br />
like this (based on simulation):<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 186
2.4 to 2.5<br />
2.5 to 2.6<br />
2.6 to 2.7<br />
2.7 to 2.8<br />
2.8 to 2.9<br />
2.9 to 3<br />
3 to 3.1<br />
3.1 to 3.2<br />
3.2 to 3.3<br />
3.3 to 3.4<br />
3.4 to 3.5<br />
3.5 to 3.6<br />
3.6 to 3.7<br />
3.7 to 3.8<br />
3.8 to 3.9<br />
3.9 to 4<br />
4 to 4.1<br />
4.1 to 4.2<br />
4.2 to 4.3<br />
4.3 to 4.4<br />
4.4 to 4.5<br />
4.5 to 4.6<br />
4.6 to 4.7<br />
4.7 to 4.8<br />
4.8 to 4.9<br />
4.9><br />
140<br />
Sampl<strong>in</strong>g Distribution of x-bar<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
Aga<strong>in</strong>, we notice the bell-curved shape and the decreased range of means (about 2.6 to 4.4)!<br />
Let‟s summarize:<br />
Distribution Type<br />
Orig<strong>in</strong>al Die Values<br />
UNIFORM<br />
Sampl<strong>in</strong>g Distribution<br />
Of 10-Roll Means<br />
NORMAL<br />
Sampl<strong>in</strong>g Distribution<br />
Of 20-Roll Means<br />
NORMAL<br />
Sampl<strong>in</strong>g Distribution<br />
Of 30-Roll Means<br />
NORMAL<br />
Distribution Mean Distribution Standard Deviation<br />
3.5 1.7<br />
3.5 0.52<br />
3.5 0.38<br />
3.5 0.32<br />
We can very easily see that the expected value of the sampl<strong>in</strong>g distribution is the same as , the<br />
expected value of the population distribution. That is:<br />
, ̅-<br />
But, what is the relationship of the standard deviations of the means <strong>in</strong> relation to the standard<br />
deviation of the population of die roll value!<br />
This is not so clear. Statisticians, after much research, found that the standard deviation of each<br />
of the sampl<strong>in</strong>g distribution is related to the sample size <strong>in</strong> the follow<strong>in</strong>g way:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 187
, ̅-<br />
√<br />
For example,<br />
√<br />
That is very close to the 0.52 we obta<strong>in</strong>ed!<br />
Similarly, <strong>for</strong> our sample of size 20,<br />
√<br />
This one happens to be fairly spot-on!<br />
An f<strong>in</strong>ally, <strong>for</strong> our sample of size 30,<br />
This is aga<strong>in</strong> very close to our obta<strong>in</strong>ed 0.32!<br />
√<br />
The reason <strong>for</strong> this difference is simply due to randomness, and estimates can be improved more<br />
(if desired) by <strong>in</strong>creas<strong>in</strong>g the number of “<strong>in</strong>dividuals roll<strong>in</strong>g the die.”<br />
What we have observed here is <strong>for</strong>mally known as the Central Limit Theorem.<br />
Central Limit Theorem<br />
Regardless of the distribution of a random variable, , if we take repeated random samples from<br />
this distribution of and compute the mean, ̅, <strong>for</strong> each sample, then the follow<strong>in</strong>g will<br />
hold:<br />
1.) The distribution of ̅ will be approximately normal<br />
2.) , ̅-<br />
3.) , ̅-<br />
√<br />
(NOTE: A sample size of at least 30 is a rule-of-thumb and can vary slightly depend<strong>in</strong>g on the<br />
severity of skews and abnormalities <strong>in</strong> the distribution. For even severely skewed distributions,<br />
the approximate shape is typically normal.)<br />
6.1.3 Why the Central Limit Theorem<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 188
The Central Limit Theorem (CLT) has some very powerful, but subtle results.<br />
First of all, we do not need to understand the shape of the underly<strong>in</strong>g distribution from which we<br />
are sampl<strong>in</strong>g. This is an amaz<strong>in</strong>g result <strong>in</strong>-and-of itself, s<strong>in</strong>ce we usually have little to know<br />
<strong>in</strong><strong>for</strong>mation about the population itself (aga<strong>in</strong>, if we did, we wouldn‟t be wast<strong>in</strong>g our time with<br />
any of this!).<br />
Secondly, s<strong>in</strong>ce the result<strong>in</strong>g sampl<strong>in</strong>g distribution is approximately normally distributed, we can<br />
proceed to calculate probabilities us<strong>in</strong>g the normal distribution. This is also great, s<strong>in</strong>ce we<br />
already have the background <strong>in</strong> that process!<br />
Example 1: After experimentation, researchers believe that the mean lifespan of a stra<strong>in</strong> of<br />
bacteria is days with days. Due to the complexity of the bacteria, the shape<br />
of the distribution of bacteria lifespans is unknown. A sample of 60 bacteria stra<strong>in</strong>s is<br />
collected.<br />
a. Does the CLT apply here<br />
b. Calculate the probability that the sample mean lifespan, ̅, is less than 3 days.<br />
SOLUTION:<br />
a. S<strong>in</strong>ce the sample size is 60, we should be safe <strong>in</strong> assum<strong>in</strong>g that the sampl<strong>in</strong>g distribution<br />
of all means is normally distributed with mean and standard deviation √<br />
.<br />
b. We want ( ). Us<strong>in</strong>g our probability calculator<br />
Given the very small level of variability <strong>in</strong> the sampl<strong>in</strong>g distribution of lifespan means,<br />
we would consider observ<strong>in</strong>g an average smaller than 3 feasibly 0.<br />
6.1.4 Limitations of the CLT<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 189
One major oversight of our excitement with this idea is the notion that we would actually know<br />
the true population mean, , and the true population standard deviation, . If we have limited<br />
<strong>in</strong><strong>for</strong>mation about our population, then we certa<strong>in</strong>ly would not know these values. In the next<br />
parts of this chapter, we will learn how to use our sample to make these predictions about the<br />
population. Though similar <strong>in</strong> conceptual nature, it is not as straight<strong>for</strong>ward as replac<strong>in</strong>g with ̅<br />
and with .<br />
Homework Problems – 6.1<br />
1. In your own words, what does the Central Limit Theorem tell us<br />
2. In your own words, why is the Central Limit Theorem a very powerful practical result<br />
3. A sample of size 36 is taken from a population distribution of unknown shape, though the<br />
mean is believed to be 100 with a standard deviation of 18. What is the probability that<br />
the sample mean is:<br />
a. Greater than 102<br />
b. Less than 98<br />
c. Between 95 and 105<br />
d. Between what two values will the middle 90% of means be<br />
4. A sta<strong>in</strong>ed glass company produces panes of glass with a mean thickness of 0.42 <strong>in</strong>ches<br />
and a standard deviation of 0.04 <strong>in</strong>ches, if produced properly. Suppose a random sample<br />
of w<strong>in</strong>dows reveals a sample mean of 0.43.<br />
a. What is the probability of this average, or a larger average<br />
b. Given the probability you have computed, what can be said about recent<br />
production standards<br />
5. Promote Market<strong>in</strong>g has a research team to research new market<strong>in</strong>g tactics to propose to<br />
potential clients. A group of 40 clients have been <strong>in</strong>vited <strong>for</strong> a conference to be put on by<br />
the market<strong>in</strong>g firm. The research team usually generates<br />
<strong>in</strong> revenues <strong>for</strong><br />
each member of the team with .<br />
a. What will be the shape of the distribution of ̅ How do you know<br />
b. What is the probability that average sales will exceed $420,000 <strong>for</strong> this particular<br />
event<br />
c. How would your answer change if 100 clients were to show up<br />
d. If the team (300 people) have an average revenue that is <strong>in</strong> the 90 th percentile of<br />
revenues, they will earn 4-days of paid vacation. What average sales would be<br />
required <strong>for</strong> this<br />
6. A computer simulation reveals that a distribution of average <strong>in</strong>comes <strong>in</strong> a sample of 500<br />
has a standard deviation of $130. What is the standard deviation <strong>for</strong> the population of all<br />
<strong>in</strong>comes Interpret the result you get <strong>in</strong> real-world terms.<br />
7. Use the Excel Sampl<strong>in</strong>g Distribution Applet to address this problem. In a population, it is<br />
found that 30% of homes have 5 rooms, 40% have 4 rooms, and 30% have 3 rooms. You<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 190
can set this up <strong>in</strong> our applet by hav<strong>in</strong>g a “die” with 10 values: three 5‟s, four 4‟s, and<br />
three 3‟s.<br />
a. What is the average number of rooms a home has <strong>in</strong> this population What is the<br />
standard deviation <strong>in</strong> the number of rooms <strong>in</strong> this population<br />
b. Now, suppose you take a sample of size 30 from this population. What shape will<br />
the distribution have and how do you know<br />
c. Take 1,000 random samples each of size and compute the 1,000 sample<br />
means. Accord<strong>in</strong>g to the applet, what is the average of the average rooms <strong>in</strong> the<br />
sample What is the standard deviation <strong>in</strong> the average number of rooms <strong>in</strong> a<br />
house Compare these two results to what the Central Limit Theorem says we<br />
should come up with. That is, f<strong>in</strong>d , ̅- and , ̅-.<br />
d. Take 1,000 random samples each of size and compute the 1,000 sample<br />
means. Accord<strong>in</strong>g to the applet, what is the average of the average rooms <strong>in</strong> the<br />
sample What is the standard deviation <strong>in</strong> the average number of rooms <strong>in</strong> a<br />
house Compare these two results to what the Central Limit Theorem says we<br />
should come up with. That is, f<strong>in</strong>d , ̅- and , ̅-.<br />
e. Take 1,000 random samples each of size and compute the 1,000 sample<br />
means. Accord<strong>in</strong>g to the applet, what is the average of the average rooms <strong>in</strong> the<br />
sample What is the standard deviation <strong>in</strong> the average number of rooms <strong>in</strong> a<br />
house Compare these two results to what the Central Limit Theorem says we<br />
should come up with. That is, f<strong>in</strong>d , ̅- and , ̅-.<br />
f. Why do the values <strong>in</strong> the population have the highest standard deviation when<br />
compared with the distribution of means <strong>in</strong> the last there parts<br />
g. What is the probability that, <strong>in</strong> a sample of 100 homes, the average number of<br />
rooms is greater than 5<br />
h. Expla<strong>in</strong> <strong>in</strong> practical terms why the standard deviation of any ̅ distribution<br />
decreases as the sample size <strong>in</strong>creases.<br />
6.2 Confidence Interval <strong>for</strong> ̅<br />
6.2.1 Confidence Interval <strong>for</strong> ̅ Us<strong>in</strong>g Sampl<strong>in</strong>g Distributions<br />
As discussed previously, our ultimate goal is to make <strong>in</strong>ferences about the population parameter<br />
. Aga<strong>in</strong>, keep <strong>in</strong> m<strong>in</strong>d that this is the only reason why we are spend<strong>in</strong>g time on this! Otherwise,<br />
we would have completed our semester early!<br />
When we generate our sampl<strong>in</strong>g distribution <strong>for</strong> ̅ we see very vividly that our sample means are<br />
subject to sampl<strong>in</strong>g variability, depend<strong>in</strong>g on which “die values” are “rolled” <strong>for</strong> each <strong>in</strong>dividual<br />
sample of size . Thus, we should be very skeptical of conclud<strong>in</strong>g that ̅ is representative<br />
of the true population mean. However if we have many, many “<strong>in</strong>dividuals roll the die,” we<br />
should get a fairly reasonable understand<strong>in</strong>g of a range of values <strong>for</strong> the true value of . Let‟s<br />
consider an example.<br />
Suppose we want to better understand a population of ages of people <strong>in</strong> a town.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 191
1 1 18 22 25 27 30 18 21 2<br />
3 19 20 32 20 25 29 32 33 40<br />
29 25 29 24 23 29 29 26 27 1<br />
31 32 31 31 35 33 30 32 31 33<br />
19 20 22 21 20 20 19 22 22 9<br />
23.46<br />
9.250319<br />
But, wait! Let‟s pretend that we actually don‟t have access to the entire population of values<br />
(yes, we clearly see them <strong>in</strong> the table above, but we normally do not have that luxury). Due to<br />
limited time and money, you are only able to sample 30 of these values. After tak<strong>in</strong>g a random<br />
sample, here is what you have chosen:<br />
32 31 31 35 19 20 22 21 20 20<br />
20 25 29 32 33 19 19 19 18 22<br />
25 27 30 18 21 33 30 32 31 33<br />
̅<br />
25.56667<br />
5.870342<br />
Aga<strong>in</strong>, at this po<strong>in</strong>t, we would have no way of tell<strong>in</strong>g how close we are to the actual mean of<br />
23.46.<br />
To get a good estimate of , we will come up with a confidence <strong>in</strong>terval. A confidence<br />
<strong>in</strong>terval is a range of values such that there is an probability that the true population mean, ,<br />
is between those values.<br />
How do we calculate this Here is our motivation <strong>for</strong> what is to come:<br />
There are two ways to th<strong>in</strong>k about <strong>in</strong>ferential statistics:<br />
1) Use theoretical results and make conclusions us<strong>in</strong>g them<br />
2) Build a sampl<strong>in</strong>g distribution <strong>for</strong> the statistic of choice ( ̅ or ̂) us<strong>in</strong>g the Bootstrap<br />
Method and make conclusions us<strong>in</strong>g this empirical data.<br />
We will draw parallels between the two regularly.<br />
Here is the basic idea of Bootstrap Sampl<strong>in</strong>g:<br />
1) From the population, take a random sample, preferably of size 30 or greater. The larger<br />
the random sample, the more power we have <strong>in</strong> mak<strong>in</strong>g <strong>in</strong>ferences about the population.<br />
2) If this is a truly representative sample, then we can th<strong>in</strong>k of it as a “m<strong>in</strong>i” population that<br />
acts and behaves accord<strong>in</strong>g to the population as a whole. This is a key <strong>in</strong>gredient!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 192
3) We cannot use this sample to calculate the correspond<strong>in</strong>g parameter because of sampl<strong>in</strong>g<br />
variability. However, if this sample behaves like the population, then we can resample<br />
from it and get an idea of the overall variability. That is, draw a sample of the same<br />
sample size from this “m<strong>in</strong>i” population, but do so with replacement. This is the same<br />
idea as roll<strong>in</strong>g a die a fixed number of times – we are sampl<strong>in</strong>g with replacement from<br />
the population 1,2,3,4,5, 6. What will this do It will account <strong>for</strong> sampl<strong>in</strong>g variability, if<br />
repeated.<br />
4) Calculate the statistic from this sample and record it.<br />
5) Repeat steps 3) and 4) 1,000 to 10,000 times. We now have a sampl<strong>in</strong>g distribution and<br />
can make estimates about the true population parameter. And, guess what this distribution<br />
will look like You guessed it – it will be approximately normal, by the Central Limit<br />
Theorem.<br />
Below is a diagrammatic representation of steps 1) – 5):<br />
Sample 1<br />
Sample 2<br />
Sample 3<br />
Population<br />
Random<br />
Sample,<br />
Sample 4<br />
.<br />
.<br />
.<br />
Sample 10,000<br />
Some of the assumptions we make are <strong>in</strong>deed dangerous. For example, do we really have a m<strong>in</strong>i<br />
population If the answer is “no,” then theoretical results are equally worthless s<strong>in</strong>ce they, too,<br />
assume that the sample is representative.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 193
22.2666666666667 to<br />
22.7666666666667<br />
22.7666666666667 to<br />
23.2666666666667<br />
23.2666666666667 to<br />
23.7666666666667<br />
23.7666666666667 to<br />
24.2666666666667<br />
24.2666666666667 to<br />
24.7666666666667<br />
24.7666666666667 to<br />
25.2666666666667<br />
25.2666666666667 to<br />
25.7666666666667<br />
25.7666666666667 to<br />
26.2666666666667<br />
26.2666666666667 to<br />
26.7666666666667<br />
26.7666666666667 to<br />
27.2666666666667<br />
27.2666666666667 to<br />
27.7666666666667<br />
27.7666666666667 to<br />
28.2666666666667<br />
28.2666666666667 to<br />
28.7666666666667<br />
28.7666666666667 to<br />
29.2666666666667<br />
29.2666666666667 to<br />
29.7666666666667<br />
29.7666666666667><br />
Now, back to our example…<br />
If we have truly collected a random sample, then we should be able to th<strong>in</strong>k about the sample as<br />
a small population. If this is a small population, then we should be able to sample from it. We<br />
will draw random samples of size from the small “population” which is also of size<br />
. Sounds strange, but we will sample with replacement, so it is possible to resample the<br />
same value multiple times.<br />
We will draw 1,000 samples of size from this “population” and, as you might have<br />
figured, we will calculate the mean of each and build the sampl<strong>in</strong>g distribution <strong>for</strong> ̅.<br />
200<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
Sampl<strong>in</strong>g Distribution of x-bar<br />
As we should expect based on CLT, the distribution of these 1,000 means is approximately<br />
normal.<br />
Let‟s suppose that we want to have an <strong>in</strong>terval with<strong>in</strong> which there is a 95% probability that the<br />
true population mean, , lies. This is the same as look<strong>in</strong>g <strong>for</strong> the middle 95% of means!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 194
Thus, we need to f<strong>in</strong>d the lower and upper limits <strong>for</strong> this <strong>in</strong>terval by f<strong>in</strong>d<strong>in</strong>g the 2.5 percentile<br />
and the 97.5 percentile. In Excel, we can do this by us<strong>in</strong>g the percentile() function. We get:<br />
Upper (97.5 percentile): 27.50<br />
Lower (2.5 percentile): 23.60<br />
Thus, we can say that we are 95% confident that the true population mean is between 23.6 years<br />
and 27.5 years. In other words, there is a 95% probability that we have “trapped” the population<br />
mean between our lower and upper limit. Said one other way, 95% of all sample means, when<br />
the variability from sample to sample is taken <strong>in</strong>to account, are between these lower and upper<br />
limits. If this is representative of the population, then we should believe that 95% of the time, we<br />
will have means between these two values.<br />
What if we wanted to be 99% certa<strong>in</strong> We would need to f<strong>in</strong>d lower and upper limits so that<br />
there is only 1% <strong>in</strong> the tails:<br />
Thus, we would like 0.01/2 = 0.005 (or .5%) <strong>in</strong> each of the two tails. To f<strong>in</strong>d the lower and upper<br />
limits, we would need to f<strong>in</strong>d the 0.005 percentile and the 1-0.005 = 0.995 percentile. We get:<br />
Upper (97.5 percentile): 28.17<br />
Lower (2.5 percentile): 22.83<br />
Thus, we are 99% confident that the true population mean age, , is between 22.83 years and<br />
28.17 years. In other words, there is a 99% probability that the true mean age is between 22.83<br />
and 28.17 years.<br />
If we want to be more confident, we need to expand our <strong>in</strong>terval of values!<br />
Note that <strong>in</strong> only one of our confidence <strong>in</strong>tervals (99%), we have captured the true mean with<strong>in</strong><br />
our range. This is very likely, s<strong>in</strong>ce our confidence percentage is very high. BUT, keep <strong>in</strong> m<strong>in</strong>d<br />
that we never know what the true mean is! Thus, we cannot say that it would have been better to<br />
stick with the wider 99% <strong>in</strong>terval. After all, there is a 1% chance we might have made an error.<br />
The level of confidence that we desire depends on the situation and the allowable mean width we<br />
are will<strong>in</strong>g to tolerate. More confidence means wider possibilities. In general, we never know<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 195
whether or not we have captured the true mean <strong>in</strong> our <strong>in</strong>terval. On the upside, there is a<br />
probability associated with it!<br />
As a f<strong>in</strong>al note, it is <strong>in</strong>terest<strong>in</strong>g that we actually missed the true mean <strong>in</strong> our 95% confidence<br />
<strong>in</strong>terval, s<strong>in</strong>ce there is only a 5% chance of error. Keep <strong>in</strong> m<strong>in</strong>d, however, that this <strong>in</strong>terval was<br />
based on simulation. It is based on 1,000 samples and may have been better to <strong>in</strong>crease the<br />
number of samples.<br />
6.2.2 Confidence Interval <strong>for</strong> ̅ Us<strong>in</strong>g Theoretical Results – When and are Unkown<br />
In the previous section, we found that the sampl<strong>in</strong>g distribution of ̅ with is<br />
approximately normal with , ̅- and , ̅- . As a bit of notation, if a random variable<br />
has a normal distribution with mean and standard deviation, we would write:<br />
√<br />
̅ (<br />
√<br />
)<br />
This reads, “ -bar is normally distributed with mean and standard deviation √<br />
.”<br />
This, however, assumes that we know someth<strong>in</strong>g that we probably don‟t – the population mean<br />
and standard deviation!<br />
As you might guess, we will use ̅ and √<br />
to approximate these. This proposes a problem: we are<br />
<strong>in</strong>troduc<strong>in</strong>g more error. In order to account <strong>for</strong> this, the normal distribution is not appropriate.<br />
When us<strong>in</strong>g these approximations, we must use the theoretical Student’s Distribution. This<br />
distribution looks much like the normal distribution, but is constructed by sample size, not the<br />
mean and standard deviation. Below is a comparison of the -distribution <strong>in</strong> comparison to the<br />
standard normal distribution <strong>for</strong> size .<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 196
We see that the standard deviation (<strong>in</strong> red) is just slightly larger than that of the standard normal<br />
(<strong>in</strong> blue) – it is about 1.0339. So, as sample size gets greater, the -distribution beg<strong>in</strong>s to look<br />
more like a standard normal. BUT, look at the one below where sample size is 10:<br />
The variability is nearly 14% greater.<br />
As we mentioned, this distribution‟s shape relies on the sample size. The relationship is called<br />
the degrees of freedom and can be calculated as<br />
, that is degrees of freedom is equal<br />
to one less than the sample size.<br />
So, <strong>in</strong> our previous example, we had a sample size of 30, so<br />
In a probability calculator, we would enter 29 <strong>for</strong> the degrees of freedom:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 197
̅<br />
This will work much like the standard normal distribution. It, too, functions <strong>in</strong> display<strong>in</strong>g<br />
standard deviations. That is, the mean is 0 standard deviations away from the mean. We can to<br />
know the number of standard deviations to the left and to the right of the mean we need to travel,<br />
<strong>in</strong> order to “trap” 95% of the distribution.<br />
We use the calculator:<br />
Thus, we would expect 95% of sample means to be with<strong>in</strong> 2.045 standard deviations of the<br />
mean. In other words:<br />
√<br />
Or:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 198
√<br />
The lower limit is:<br />
√<br />
And the upper limit is:<br />
Thus, we are 95% confident that the true average age <strong>in</strong> this town is between 23.4 and 27.8.<br />
Notice that this is not very much different than our simulated confidence <strong>in</strong>terval of 23.6 to 27.5.<br />
So, which is more precise This is arguable, but it is difficult to argue with empirical data.<br />
Personally, I prefer the bootstrap confidence <strong>in</strong>terval we ran earlier. My reason<strong>in</strong>g is that a<br />
distribution of means is asymptotically normal, mean<strong>in</strong>g that, under <strong>in</strong>f<strong>in</strong>itely many sampled<br />
units, the distribution would be exactly normal. This is very theoretical and not always valid.<br />
For now, we will compare both.<br />
For the 99% confidence <strong>in</strong>terval, theory produces the follow<strong>in</strong>g:<br />
√<br />
We would now simply adjust the number of standard deviations to 2.756:<br />
Lower limit:<br />
√<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 199
√<br />
Upper limit:<br />
√<br />
Similarly, there is a 95% chance that the population mean age is between 22.6 and 28.5.<br />
Compare this to our empirical result above of 22.8 to 28.2. We are, aga<strong>in</strong>, very close.<br />
Homework Problems –6.2<br />
1. Describe, <strong>in</strong> your own words, what a bootstrap distribution is and why we would want to<br />
use one. Be sure to mention the logical process beh<strong>in</strong>d build<strong>in</strong>g one, as well as the<br />
assumptions we are mak<strong>in</strong>g when we do so.<br />
2. What is a confidence <strong>in</strong>terval Expla<strong>in</strong> <strong>in</strong> your own words.<br />
3. The follow<strong>in</strong>g is a random sample of 10 labor costs associated with farm<strong>in</strong>g <strong>for</strong> civilian<br />
consumers (<strong>in</strong> billions of dollars) s<strong>in</strong>ce 1970.<br />
Labor Costs (bill. $)<br />
229.9 303.7<br />
137.9 58.3<br />
81.5 196.6<br />
36.6 168.4<br />
122.9 347.4<br />
(SOURCE: Data randomly sampled from U.S. Statistical Abstract, Table 847)<br />
a. Does the Central Limit Theorem apply <strong>for</strong> this data Why or why not<br />
b. Us<strong>in</strong>g a bootstrap distribution, calculate a 95% confidence <strong>in</strong>terval <strong>for</strong> , the true<br />
population average labor cost.<br />
c. In a complete sentence, <strong>in</strong>terpret the real-world mean<strong>in</strong>g of this value.<br />
d. Us<strong>in</strong>g the bootstrap distribution and percentiles, how likely is it that a sample of<br />
labor costs has a mean greater than $190,000,000,000<br />
4. In Arizona, primarily the Phoenix Metropolitan area, the issue of red-light cameras used<br />
to catch red-light runners and speeders was a prom<strong>in</strong>ent one <strong>for</strong> much of the early 2000‟s.<br />
Many studies were carried out over this period of debate to determ<strong>in</strong>e whether or not they<br />
were effective, and whether or not they used taxpayer money appropriately. Suppose the<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 200
follow<strong>in</strong>g data was collected on the revenue generated by randomly sampled red-lights<br />
across the valley. The goal is to have, on average, each camera generate $750 and no less<br />
than $640 per day.<br />
883 522 590 779 887 615 690 771 843 509<br />
872 840 536 892 880 588 547 770 687 842<br />
832 840 676 555 884 617 517 586 505 552<br />
a. Can the state be 95% confident that the desired average is possible<br />
b. Generate a 99% confidence <strong>in</strong>terval <strong>for</strong> , the population average daily revenue<br />
per camera. Expla<strong>in</strong> <strong>in</strong> a complete sentence what this means.<br />
c. Is the CLT valid <strong>in</strong> this problem Expla<strong>in</strong>.<br />
d. Us<strong>in</strong>g the assumption that the distribution of ̅ is normally distributed, calculate a<br />
theoretical 95% confidence <strong>in</strong>terval <strong>for</strong> (you will need to estimate the<br />
√<br />
standard deviation of ̅‟s and ̅ to estimate .<br />
e. In reality, anytime we estimate parameters, like you did above <strong>in</strong> part d), we<br />
actually shouldn‟t assume a normal distribution. Instead, we should assume what<br />
is known as a -distribution, which is symmetrical, though has more variability to<br />
account <strong>for</strong> the uncerta<strong>in</strong>ty <strong>in</strong> our estimates.<br />
Watch this brief <strong>in</strong><strong>for</strong>mative video:<br />
http://www.youtube.com/watchv=yV-0ReCXW64<br />
Pull up the follow<strong>in</strong>g applet: http://www.stat.tamu.edu/~west/applets/tdemo.html.<br />
You can type <strong>in</strong> the percentile correspond<strong>in</strong>g to means you want to consider.<br />
stands <strong>for</strong> “degrees of freedom” and can be calculated by tak<strong>in</strong>g the sample size<br />
m<strong>in</strong>us 1 ( ). (From the video, we know that, if the sample size is really, really<br />
big, then the difference between the normal distribution and t-distribution<br />
becomes <strong>in</strong>dist<strong>in</strong>guishable.) The output of this applet will give you the number of<br />
standard deviations your endpo<strong>in</strong>ts will be on either side of the mean.<br />
For example, you will f<strong>in</strong>d that a 99% confidence <strong>in</strong>terval <strong>for</strong> a sample of size 100<br />
has endpo<strong>in</strong>ts that are 2.626 standard deviation from the mean (left and right).<br />
Let‟s say your sample mean is ̅ and standard deviation . Then, the<br />
confidence <strong>in</strong>terval will be an <strong>in</strong>terval around the sample mean. That is, one<br />
standard deviation is √ √<br />
(remember, the standard deviation of means<br />
requires that we divide the standard deviation among <strong>in</strong>dividual ‟s and divide by<br />
the square root of the sample size). So, 2.626 standard deviations would be<br />
2.626(0.5) = 1.313 units away from the mean. The endpo<strong>in</strong>ts would be 40 – 1.313<br />
and 40 + 1.313, or 38.687 to 41.313.<br />
Formulaically, we found:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 201
̂<br />
̅<br />
√<br />
Where is the number of standard deviations endpo<strong>in</strong>ts <strong>for</strong> a confidence<br />
<strong>in</strong>terval with total area <strong>in</strong> the tails. i.e.<br />
Us<strong>in</strong>g this “crash course” <strong>in</strong> theoretical confidence <strong>in</strong>terval-f<strong>in</strong>d<strong>in</strong>g, compute the<br />
95% confidence us<strong>in</strong>g these ideas. Do you get a similar result How close<br />
6.3 Confidence Interval <strong>for</strong> ̂<br />
6.3.1 Confidence Interval <strong>for</strong> ̂ Us<strong>in</strong>g Sampl<strong>in</strong>g Distributions<br />
Suppose that it is of <strong>in</strong>terest to estimate the proportion of recent customers that say they would<br />
come back and shop at your store. You take a sample and determ<strong>in</strong>e that, of 30 people, 20 said<br />
they would and 10 said they wouldn‟t. You would like to make an <strong>in</strong>ference about the population<br />
of all of your customers. In your sample, you know that:<br />
Is the proportion of your customers that will come back and purchase from you aga<strong>in</strong>. You are<br />
look<strong>in</strong>g to f<strong>in</strong>d a confidence <strong>in</strong>terval <strong>for</strong> ̂. How do we do that with the simulator if we have no<br />
data<br />
In reality, we do. We just have to make it numerical. In reality, 20/30 is an average. It is the<br />
average of 30 responses. If we let:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 202
{<br />
So, we have a set of twenty 1‟s and ten 0‟s. We enter these <strong>in</strong> to our simulator.<br />
We run the bootstrap sample on these 1‟s and 0‟s 1,000 times. We will get a variety of sample<br />
proportions:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 203
0.433333333333333<br />
to<br />
0.483333333333333<br />
0.483333333333333<br />
to<br />
0.533333333333333<br />
0.533333333333333<br />
to<br />
0.583333333333333<br />
0.583333333333333<br />
to<br />
0.633333333333333<br />
0.633333333333333<br />
to<br />
0.683333333333333<br />
0.683333333333333<br />
to<br />
0.733333333333334<br />
0.733333333333334<br />
to<br />
0.783333333333334<br />
0.783333333333334<br />
to<br />
0.833333333333334<br />
0.833333333333334<br />
to<br />
0.883333333333334<br />
0.883333333333334<br />
to<br />
0.933333333333334<br />
0.933333333333334<br />
to<br />
0.983333333333334<br />
0.983333333333334><br />
We see that this distribution is approximately normal. No surprise there!<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
Sampl<strong>in</strong>g Distribution of p-hat<br />
We calculate the 2.5- and 97.5-percentiles to get the middle 95% of sample proportions<br />
generated <strong>in</strong> the bootstrap sample:<br />
(As %)<br />
Results<br />
Percentile 1: 97.5 0.833<br />
Percentile 2: 2.5 0.500<br />
Thus, we are 95% confident that the proportion of the population of customers that will shop at<br />
your store will between 0.50 and 0.83. This is quite a wide <strong>in</strong>terval! At least you know what to<br />
expect with 95% confidence!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 204
DULY CAUTIONED: The assumptions here are the same as <strong>for</strong> bootstrapp<strong>in</strong>g with ̅: a<br />
random sample is drawn from the population and is representative of the population. If not, the<br />
sample is worthless, <strong>in</strong> any case.<br />
6.3.2 Confidence Interval <strong>for</strong> ̂ Us<strong>in</strong>g Theoretical Results<br />
Without provid<strong>in</strong>g the <strong>in</strong>tuition <strong>for</strong> this method, we will simply state the results <strong>for</strong> the CLT<br />
perta<strong>in</strong><strong>in</strong>g to the sampl<strong>in</strong>g distribution of ̂:<br />
Central Limit Theorem <strong>for</strong> ̂<br />
The sampl<strong>in</strong>g distribution of ̂ (which is really just an average of 0‟s and 1‟s) is approximately<br />
normal just as long as (similar idea as <strong>for</strong> the standard CLT).<br />
With<br />
̂<br />
( ̂)<br />
, ̂-<br />
, ̂- √<br />
̂( ̂)<br />
NOTE: the standard deviation is often referred to as the marg<strong>in</strong> of error <strong>in</strong> polls.<br />
The results above state that,<br />
1. the average proportion of the sampl<strong>in</strong>g distribution is the true population proportion.<br />
2. The standard deviation of proportions of the sampl<strong>in</strong>g distribution is the above, complex,<br />
calculation.<br />
AS LONG AS ̂ and ( ̂) , both of which are<br />
true statements. We can now proceed:<br />
Here, we get to use the standard normal distribution to calculate the number of standard<br />
deviations correspond<strong>in</strong>g to the desired <strong>in</strong>terval. So, we know that:<br />
̂<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 205
, ̂- √<br />
( )<br />
The number of standard deviations correspond<strong>in</strong>g to the middle 95% of a standard normal<br />
distribution is calculated below:<br />
Thus, these endpo<strong>in</strong>ts are approximately 1.96 standard deviations away from the mean. So, our<br />
confidence <strong>in</strong>terval would be:<br />
̂ √<br />
̂( ̂)<br />
In our case:<br />
Lower limit:<br />
Upper limit:<br />
These limits are nearly identical to the simulation values!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 206
Homework Problems –6.2<br />
1. In a sample of 55 students from Arizona State University tak<strong>in</strong>g a political science class,<br />
30 say they would be <strong>in</strong>terested <strong>in</strong> tak<strong>in</strong>g another political science class. The university is<br />
<strong>in</strong>terested <strong>in</strong> determ<strong>in</strong>e the proportion of all its students that are <strong>in</strong>terested <strong>in</strong> tak<strong>in</strong>g<br />
another political science class.<br />
a. What is the population of <strong>in</strong>terest <strong>in</strong> this study<br />
b. Construct a 90% bootstrap confidence <strong>in</strong>terval <strong>for</strong>, , the true proportion.<br />
c. Interpret the real-world mean<strong>in</strong>g of your confidence <strong>in</strong>terval.<br />
2. A software company takes a random sample of recent orders and f<strong>in</strong>ds that, of the 250<br />
sampled, 42 resulted <strong>in</strong> the return of a piece of purchased software.<br />
a. What is the population of <strong>in</strong>terest <strong>in</strong> this study<br />
b. Construct a 99% bootstrap confidence <strong>in</strong>terval <strong>for</strong>, , the true proportion.<br />
c. Interpret the real-world mean<strong>in</strong>g of your confidence <strong>in</strong>terval.<br />
3. A batch of apples was <strong>in</strong>spected prior to shipment <strong>for</strong> any defects. Each apple was<br />
marked as either pass (P), re-<strong>in</strong>spect (R) or fail (F). The follow<strong>in</strong>g results were reported.<br />
F P P P P P P P R R<br />
P P R P R R P R P P<br />
P R P R P F R R P P<br />
P P P P P P P R P P<br />
P P P F P R P P P R<br />
a. What is the population of <strong>in</strong>terest <strong>in</strong> this study<br />
b. Construct a 95% bootstrap confidence <strong>in</strong>terval <strong>for</strong>, , the true proportion of<br />
pass<strong>in</strong>g apples.<br />
c. Interpret the real-world mean<strong>in</strong>g of your confidence <strong>in</strong>terval.<br />
d. Us<strong>in</strong>g the CLT <strong>for</strong> ̂‟s, construct a 95% confidence <strong>in</strong>terval (see blue box <strong>in</strong> this<br />
section). How does it compare to the bootstrap confidence <strong>in</strong>terval<br />
Chapter 7<br />
Hypothesis Test<strong>in</strong>g<br />
We are often faced with uncerta<strong>in</strong>ty. Specifically, we often want to know whether one product is<br />
better than the other, whether one group outper<strong>for</strong>ms another <strong>in</strong> some type of task, or how one<br />
manufactur<strong>in</strong>g process compares to another, among many other th<strong>in</strong>gs. How can we ever know<br />
The first step would be to conduct a study and collect data. The data must then be compared.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 207
But, how do we do so if there exists variability from one sample to the next This chapter will<br />
address this question<br />
7.1 The Concept Beh<strong>in</strong>d Hypothesis Test<strong>in</strong>g<br />
So, you have a research question… what now The question might at first seem obvious: let‟s<br />
run a study. This question, however, needs some special treatment be<strong>for</strong>e anyth<strong>in</strong>g else happens,<br />
especially if the study comes at a significant cost.<br />
For <strong>in</strong>stance, suppose we‟re <strong>in</strong>terested <strong>in</strong> determ<strong>in</strong><strong>in</strong>g whether pesticides damage the soil <strong>in</strong><br />
which we grow the majority of our food. This is a loaded curiosity. We first need to fully def<strong>in</strong>e<br />
how it is that we would conduct such a study. For <strong>in</strong>stance, will be compar<strong>in</strong>g two regions, one<br />
that has been sprayed with pesticides and one that hasn‟t been sprayed What is it, exactly, that<br />
we will measure <strong>in</strong> order determ<strong>in</strong>e the level of soil damage<br />
First and <strong>for</strong>emost, we need to <strong>for</strong>mulate a hypothesis, or a belief about what it is that we expect<br />
to see. For example,<br />
Our hypothesis is that pesticides <strong>in</strong>flict serious damage on sprayed soils<br />
Great, so we know what we believe. Did we just state what we wanted to happen Probably not.<br />
We‟ll usually <strong>for</strong>mulate a hypothesis based on some exist<strong>in</strong>g observations. Perhaps we‟re see<strong>in</strong>g<br />
that plants aren‟t produc<strong>in</strong>g as many edibles as previously thought. Or, maybe we‟re f<strong>in</strong>d<strong>in</strong>g<br />
ris<strong>in</strong>g levels of cancers. (By the way, all of the above are becom<strong>in</strong>g em<strong>in</strong>ent public concerns <strong>in</strong><br />
the U.S. and beyond.) So, based on these observations, we‟re <strong>for</strong>m<strong>in</strong>g an educated belief on the<br />
effect of pesticides.<br />
The next critical question:<br />
How will we measure “soil damage”<br />
This can be a controversial question and may lack a consensus of an answer. Will it be measured<br />
by the quantities of beneficial microbes present <strong>in</strong> the soil By the soil‟s pH level By the<br />
amount of nitrogen it conta<strong>in</strong>s<br />
However we choose to measure “soil damage,” we want to be sure that we are be<strong>in</strong>g accurate.<br />
That is, we need to be sure that we are actually measur<strong>in</strong>g what we say we‟re measur<strong>in</strong>g. This<br />
sounds <strong>in</strong>fantile, but it happens all the time that researchers say they‟re measur<strong>in</strong>g someth<strong>in</strong>g that<br />
they‟re not actually measur<strong>in</strong>g.<br />
So, suppose we do some research and conclude that we test <strong>for</strong> soil damage by determ<strong>in</strong><strong>in</strong>g the<br />
weight of vegetables harvested from these plants and compar<strong>in</strong>g the average weight per plant <strong>for</strong><br />
the experimental group (some determ<strong>in</strong>ed quantity of pesticides sprayed). We f<strong>in</strong>d that healthy<br />
plants produce about 30 lbs. of some vegetable across their seasonal life span. Will the average<br />
plant yield <strong>for</strong> plants sprayed with pesticides be lower<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 208
S<strong>in</strong>ce this is a mathematical question, we would want to <strong>for</strong>mulate our hypothesis <strong>in</strong>to<br />
mathematical statements.<br />
S<strong>in</strong>ce we are deal<strong>in</strong>g with an average <strong>in</strong> this scenario, the statistical symbol often used to<br />
represent the average plant yield <strong>for</strong> the entire population of this particular vegetable is the<br />
Greek letter Mu, .<br />
Now, our experimental hypothesis is that pesticides damage the soil, measured by the pounds of<br />
vegetables yielded from these plants. If that is the case, we would expect to see a yield of less<br />
than 30 lbs. of fruit per plant. That is, our hypothesis is that<br />
S<strong>in</strong>ce this is the experimental hypothesis, we have no evidence to conclude that this is true. Thus,<br />
we should probably assume that there is no difference between the yields of pesticide-sprayed<br />
and non-sprayed plants. Thus, beg<strong>in</strong> by assum<strong>in</strong>g that:<br />
This second hypothesis is called the null hypothesis, that is, the hypothesis that is assumed until<br />
there is sufficient evidence otherwise. Symbolically, this hypothesis is written and is typically<br />
read as “null hypothesis,” or “h-naught.”<br />
The hypothesis that we believe is called the alternative hypothesis, and is written<br />
, or “h-ay.”<br />
To write these two hypotheses, we would write:<br />
When evidence is <strong>in</strong>sufficient, we say<br />
“Based on sample data, we fail to reject <strong>in</strong> favor of ”<br />
When evidence is sufficient to conclude that the average is really below 30, we say<br />
“Based on sample evidence, we reject <strong>in</strong> favor of ”<br />
We are cautious to make these conclusions based on sample data. Certa<strong>in</strong>ly, we may have<br />
obta<strong>in</strong>ed an oddball sample that doesn‟t represent the population.<br />
Let‟s practice writ<strong>in</strong>g some hypotheses. First, off, let‟s make note of the variety of population<br />
characteristics, called population parameters, that we can seek to describe <strong>in</strong> a study.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 209
Population Parameters<br />
In a study, we seek to ga<strong>in</strong> <strong>in</strong><strong>for</strong>mation about the target population. There is a number of th<strong>in</strong>gs<br />
we can test about the population parameters, actual values. Two common ones are:<br />
1) Population average, denoted by Greek Mu (“mew”),<br />
2) Population percentage, denoted by Greek Pi (“pie”),<br />
Un<strong>for</strong>tunately, we do not know the true values <strong>for</strong> and and realistically cannot, unless we<br />
sample the entire population. We can only estimate them based on the sample we collect. The<br />
values we collect from the sample are sample statistics and are estimators <strong>for</strong> the respective<br />
population parameters. These estimators <strong>for</strong> the values above, respectively, are notated:<br />
1) ̂ (“mew-hat”)<br />
2) ̂ (“pie-hat”)<br />
Example 1: Because of variation <strong>in</strong> the manufactur<strong>in</strong>g process, tennis balls produced by a<br />
particular mach<strong>in</strong>e do not have identical diameters. Let denote the true average diameter<br />
<strong>for</strong> tennis balls currently be<strong>in</strong>g produced. Suppose that the mach<strong>in</strong>e was <strong>in</strong>itially calibrated to<br />
achieve the design specification <strong>in</strong>. However, the manufacturer is now concerned that<br />
the diameters no longer con<strong>for</strong>m to this specification. If sample evidence suggests that the<br />
true average diameter <strong>for</strong> tennis balls is not 3 <strong>in</strong>ches, the production process will have to be<br />
halted while the mach<strong>in</strong>e is recalibrated. Because stopp<strong>in</strong>g the production is costly, the<br />
manufacturer wants to be quite sure that the true average diameter is not 3 <strong>in</strong>ches be<strong>for</strong>e<br />
undertak<strong>in</strong>g recalibration. What are the compet<strong>in</strong>g hypotheses<br />
SOLUTION:<br />
Under the orig<strong>in</strong>al assumption, . The researcher wants to test whether . So:<br />
Example 2: A long-used chemical <strong>in</strong> a particular carpet-clean<strong>in</strong>g product has been known to<br />
successfully remove dark sta<strong>in</strong>s 70% of the time. After extensive research, the product's<br />
<strong>for</strong>mula is modified. The head of production must decide whether or not to sell the new<br />
product. Write null and alternative hypotheses <strong>for</strong> conduct<strong>in</strong>g an experiment that might help<br />
him decide.<br />
SOLUTION:<br />
Under orig<strong>in</strong>al specifications, the proportion of time the product works is . He is<br />
concerned that . If it is truly less effective, then he will not sell the new product. That is,<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 210
Example 3: Many older homes have electrical systems that use fuses rather than circuit<br />
breakers. A manufacturer of 40-amp fuses wants to make sure that the mean amperage at<br />
which its fuses burn out is <strong>in</strong> fact 40. If the mean amperage is lower than 40, customers will<br />
compla<strong>in</strong> because the fuses require replacement too often. If the mean amperage is higher<br />
than 40, the manufacturer might be liable <strong>for</strong> damage to an electrical system as a result of<br />
fuse malfunction. To verify the mean amperage of the fuses, a random sample of fuses is<br />
selected and tested. If a hypothesis test is per<strong>for</strong>med us<strong>in</strong>g the result<strong>in</strong>g data, what null and<br />
alternative hypotheses would be of <strong>in</strong>terest to the manufacturer<br />
SOLUTION:<br />
The fuse is designed and assumed to be 40 amps. That is, on average,<br />
sure it is not the case that . So,<br />
. He wants to make<br />
So Your Average IS Different!<br />
In our pesticide experiment, our target population is all plants of this particular variety. Thus, we<br />
will take a random sample of plants from the pesticide group. Once we have that, we will f<strong>in</strong>d<br />
the sample mean, which is called a sample statistic. That is, we can‟t possibly keep track of all<br />
the plants <strong>in</strong> the population, so we will use the mean of the sample to help us describe the entire<br />
population. Usually, this sample statistic is written as ̂ (“mew-hat”). Suppose that you f<strong>in</strong>d,<br />
from the pesticide group, that<br />
̂<br />
The claim has been proven, right Maybe, maybe not.<br />
We must remember that this is just one random sample from all plants. Certa<strong>in</strong>ly, this sample<br />
average is lower, but can it not just be due to random variation that we‟re see<strong>in</strong>g a difference<br />
After all, not all no-pesticide plants will produce exactly 30 lbs. of the vegetable.<br />
What if we collect a sample and<br />
̂<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 211
Without some sort of analysis, we might be tempted to say this is sufficiently lower. However,<br />
we need to have some sort of <strong>for</strong>mal way to determ<strong>in</strong>e:<br />
When is “low,” low enough Or, more generally<br />
The Big Question<br />
When mak<strong>in</strong>g conclusions about the population based on sample data, we must first ask the<br />
question,<br />
When do we conclude that an “extreme” is extreme enough to reject <br />
As you might guess, there is probability <strong>in</strong>volved.<br />
That is, if the probability of observ<strong>in</strong>g what we have just seen, or what is more extreme, is small<br />
“enough,” then we will reject and conclude that might be a more valid conclusion.<br />
Punchl<strong>in</strong>e: We shouldn‟t reject the null hypothesis unless the probability of see<strong>in</strong>g someth<strong>in</strong>g as<br />
or more extreme is very unlikely.<br />
What Happens If I Reject<br />
When the Data Provides Insufficient Evidence<br />
Imag<strong>in</strong>e a medical test to determ<strong>in</strong>e whether or not you have some disease. Let‟s call this<br />
disease, Disease X.<br />
As <strong>for</strong> hav<strong>in</strong>g the condition, you have one of two possibilities: you have it or you don‟t.<br />
As <strong>for</strong> the test, it will either say that you have it or you don‟t.<br />
Now, realistically, we know that there is no way to be omniscient and really know whether or not<br />
you have the condition. However, let‟s imag<strong>in</strong>e that we are all-know<strong>in</strong>g and can judge the<br />
validity of the test. There are four possibilities:<br />
1) The test is positive, and you do have X (accurate)<br />
2) The test is positive, and you don’t have X (<strong>in</strong>accurate)<br />
3) The test is negative, and you do have X (<strong>in</strong>accurate)<br />
4) The test is positive, and you don’t have X (accurate)<br />
It is evident that possibilities 2) and 3) represent scenarios where there is an <strong>in</strong>accurate result.<br />
That is, it would be <strong>in</strong>valid <strong>for</strong> the test to tell you that you have the condition when, <strong>in</strong> fact, you<br />
don‟t. It would also be <strong>in</strong>valid <strong>for</strong> the test to tell you that you don‟t have the condition when, <strong>in</strong><br />
fact, you do.<br />
Contrarily, we do want the test to tell us positive when we do have the condition and negative<br />
when we don‟t.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 212
Hypothesis Test<br />
Conclusion<br />
Test Says<br />
Medical researchers usually give these four <strong>in</strong>stances name, as summarized <strong>in</strong> the follow<strong>in</strong>g<br />
table:<br />
Truth<br />
Have Don‟t Have<br />
Positive True Positive False Positive<br />
(Type II Error)<br />
Negative False Negative True Negative<br />
(Type I Error)<br />
As can be seen, the green cells represent accurate results (true results) and the red cells represent<br />
<strong>in</strong>accurate results (false results).<br />
As a patient, you would probably be quite upset (devastated, even) if you received false results<br />
<strong>for</strong> a terrible condition, such as X!<br />
In a hypothesis test, we are up aga<strong>in</strong>st the same dilemma: our test result can be either positive or<br />
negative. The truth may or may not be accurately represented. Let‟s modify our table slightly to<br />
represent the hypothesis test scenario:<br />
Don‟t<br />
Reject<br />
Truth<br />
True<br />
True Positive<br />
False<br />
False Positive<br />
(Type II Error)<br />
Reject<br />
False Negative<br />
(Type I Error)<br />
True Negative<br />
In reality, we shouldn‟t reject (make it appear false), when it is true. If we do, we have a false<br />
negative on our hands. Similarly, we shouldn‟t not reject (make it appear true), when it is<br />
false. These are labeled Type I and Type II errors, respectively.<br />
How Do We Avoid Erroneous Conclusions<br />
Un<strong>for</strong>tunately, we are not omniscient. Thus, we can never be sure that our conclusions are<br />
accurate. If we knew, there would be no test<strong>in</strong>g necessary!<br />
On the flipside, we can determ<strong>in</strong>e how large of an error rate we require. Earlier, we mentioned<br />
that we will reject when the probability of observ<strong>in</strong>g someth<strong>in</strong>g as or more extreme as what<br />
we have observed is “small.” This value of small fully determ<strong>in</strong>es our probability of a Type I<br />
error. As researchers, it is our duty to set this value. This probability of a Type I error is called<br />
the criterion, or alpha-level, and is denoted with the Greek letter alpha, .<br />
Criterion/Alpha-Level<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 213
Hypothesis Test<br />
Conclusion<br />
Our chosen risk of a Type I error is called the criterion or alpha-level, and is denoted by the .<br />
Typical values <strong>for</strong> are:<br />
That is, rarely will we choose a very small or considerably large alpha-level.<br />
Suppose that we reject when the probability of observ<strong>in</strong>g someth<strong>in</strong>g as or more extreme as<br />
what we have observed is 5% (or smaller). We have that .<br />
This means that there is still a 5% (or smaller) chance that we observe a value (sample mean,<br />
sample proportion, etc.) more extreme than what we have observed. That is, there is a 5% chance<br />
that we have falsely rejected the null hypothesis. Probabilistically,<br />
( ) ( )<br />
( )<br />
To visualize this, consider the diagram below. Recall that a conditional probability statement<br />
limits us to the event after the “pipe,” |, and then asks the question, “what percentage of the time<br />
can we expect the event to occur, out of the times the specified condition occurs. The modified<br />
table below shows that.<br />
Truth<br />
Don‟t<br />
Reject<br />
True<br />
True Positive<br />
95%<br />
Reject<br />
False Negative<br />
(Type I Error)<br />
5%<br />
100%<br />
At this po<strong>in</strong>t we might wonder: why shouldn‟t we set<br />
Type 1 error risk<br />
extremely small so that we m<strong>in</strong>imize the<br />
Good question. Imag<strong>in</strong>e that your alpha is 0.0001. This means you will only reject 0.01% (or<br />
1 out of 10,000 times) of the time, when it is true. Certa<strong>in</strong>ly, your risk of a Type I error is<br />
extremely small.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 214
Hypothesis Test<br />
Conclusion<br />
S<strong>in</strong>ce your decision criteria, or the numerical figure that we later calculate to decide whether or<br />
not to reject, will be extremely str<strong>in</strong>gent and difficult to achieve. If this is the case, then you<br />
almost never reject the null hypothesis!<br />
Okay, so if you very rarely reject the null hypothesis, then you are also potentially committ<strong>in</strong>g<br />
another act of error: not reject<strong>in</strong>g the null hypothesis, even though it may be false. That is, you<br />
<strong>in</strong>crease the likelihood of a Type II error. Recall that,<br />
( ) ( )<br />
We can see here that fail<strong>in</strong>g to reject results <strong>in</strong> potentially fail<strong>in</strong>g to reject it even when it<br />
should be rejected! Un<strong>for</strong>tunately, there is no free lunch <strong>in</strong> hypothesis test<strong>in</strong>g.<br />
Truth<br />
Don‟t<br />
Reject<br />
True<br />
False Negative<br />
(Type II Error)<br />
Reject<br />
True Positive<br />
Though we cannot yet easily provide numerical support <strong>for</strong> this claim (which certa<strong>in</strong>ly makes<br />
sense), we will make the follow<strong>in</strong>g prelim<strong>in</strong>ary conclusion:<br />
Type II Error -<br />
The probability of a Type II error, denoted , is <strong>in</strong>versely proportional to , the probability of a<br />
Type I error. That is, decreas<strong>in</strong>g will <strong>in</strong>crease .<br />
Important Caution<br />
Students are often confused that the probability of reject<strong>in</strong>g when is true and the<br />
probability of fail<strong>in</strong>g to reject when is true sum to 1. After all, these two possibilities are<br />
only two of the four possible results <strong>in</strong> a test decision.<br />
However, keep <strong>in</strong> m<strong>in</strong>d that these are the percentages of time we reject and fail to reject out of all<br />
the times that is true! This out of only one column total, not the entire sample space.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 215
Hypothesis Test<br />
Conclusion<br />
The important caution br<strong>in</strong>gs up the follow<strong>in</strong>g idea:<br />
If<br />
, then,<br />
( ) ( )<br />
( )<br />
( )<br />
Similarly,<br />
If<br />
, then,<br />
( ) ( )<br />
( )<br />
( )<br />
The probability that we reject the null hypothesis when it is false is referred to as the power of<br />
the test. We summarize these <strong>in</strong> the table below:<br />
Don‟t<br />
Reject<br />
Truth<br />
True<br />
False<br />
Reject<br />
Example 4: The college dropout rate <strong>for</strong> a particular county is known to be 30%. The<br />
educational board of a city with<strong>in</strong> the county believe its dropout rate is significantly lower.<br />
The board follows 60 students and, of them, 15 dropout. The board wants to run a statistical<br />
hypothesis test with to determ<strong>in</strong>e whether their belief is true. Describe the<br />
hypothesis test by:<br />
a. Writ<strong>in</strong>g compet<strong>in</strong>g hypotheses<br />
b. A decision rule <strong>for</strong> reject<strong>in</strong>g<br />
c. A decision criterion rule<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 216
SOLUTION:<br />
d. A generic conclusion statement<br />
a.) Under the null hypothesis, . We want to test to see if . Thus:<br />
b.) We will reject if the probability of observ<strong>in</strong>g someth<strong>in</strong>g as or more extreme as 15 out of<br />
60 dropouts ( ) under the assumption of the null hypothesis is less than or equal to<br />
0.05. That is:<br />
( )<br />
c.) We will reject if the observed value of is smaller than some cutoff value of . That<br />
is, it might be the case that would have to be smaller than, say, 13 <strong>in</strong> order <strong>for</strong> us to<br />
reject the null hypothesis.<br />
d.) Based on sample evidence, we (choose from below)<br />
a. Reject <strong>in</strong> favor of<br />
b. Fail to reject . We do not accept as true, but we don‟t have evidence to<br />
conclude otherwise.<br />
As we see from the above example, our hypothesis test needs to have a structured layout. We<br />
need to know ahead of time what we‟ll do.<br />
It is tempt<strong>in</strong>g, but we cannot determ<strong>in</strong>e our rejection criterion based on what the sample data<br />
tells us! In practice, you can carry this type of philosophy, but you <strong>in</strong>crease the error rate.<br />
Consider, <strong>for</strong> example, the scenario where<strong>in</strong> you take an exam <strong>for</strong> a biology class. You get the<br />
results back and look at what you missed. You say, “oh, of course I should have put that! I knew<br />
that!” If you told that to the <strong>in</strong>structor, she may say, “sorry, you didn‟t demonstrate that on the<br />
exam.” Without surprise, we expect this response. Why Because, it is the test that helps to<br />
determ<strong>in</strong>e our level of understand<strong>in</strong>g! It is not the other way around. If the <strong>in</strong>structor allowed<br />
you to change your answer, then the test wouldn‟t really be demonstrat<strong>in</strong>g what you knew at that<br />
time of the test. A hypothesis test is quite analogous. We carry one out because we have a hunch.<br />
Always th<strong>in</strong>k back to this statement:<br />
If you dig long enough <strong>in</strong> your data, you will f<strong>in</strong>d someth<strong>in</strong>g!<br />
This, however, looks upon the digg<strong>in</strong>g process as a negative th<strong>in</strong>g s<strong>in</strong>ce it does not justify the<br />
decision questions. In fact, it creates a high likelihood that we are observ<strong>in</strong>g a co<strong>in</strong>cidence and<br />
not a solid f<strong>in</strong>d<strong>in</strong>g at all! Thus, we <strong>in</strong>crease the probability of error exponentially!<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 217
Structure of a Hypothesis Test<br />
The follow<strong>in</strong>g should be <strong>in</strong>cluded <strong>in</strong> all hypothesis tests:<br />
1. A statement of compet<strong>in</strong>g hypotheses ( vs. )<br />
2. A decision rule <strong>for</strong> reject<strong>in</strong>g (based on )<br />
3. A decision criterion rule (the physical value of the random variable that represents the<br />
required “extremeness” of our observed sample value.<br />
4. A conclusion statement (what the sample data tells you to conclude)<br />
As an important note: we never say, “accept as true.” Instead, we rema<strong>in</strong> accurate and say<br />
that there is simply not enough evidence to reject it. Th<strong>in</strong>k about this as “<strong>in</strong>nocent,” vs. “not<br />
guilty.” Just because a court cannot prove that someone is guilty, they don‟t say that he is<br />
<strong>in</strong>nocent. Instead, they give the verdict of “not guilty.”<br />
Homework Problems – 7.1<br />
1. In your own words, expla<strong>in</strong> the difference between the null and alternative hypotheses.<br />
Also, expla<strong>in</strong> how to identify each <strong>in</strong> a research study.<br />
2. Expla<strong>in</strong> why we assume that the null hypothesis is true be<strong>for</strong>e test<strong>in</strong>g a hypothesis.<br />
3. It is believed that 7% ( ) of an organic corn crop is lost to <strong>in</strong>sect <strong>in</strong>festations. An<br />
organic farmer has devised a system that may result <strong>in</strong> less <strong>in</strong>sect destruction. He would<br />
like to test this idea with a hypothesis test. Write the compet<strong>in</strong>g hypotheses.<br />
4. A high school statistics class typically gets an average of scores out of 5 on an<br />
Advanced Placement (AP) exam. Over the recent several years, he has found that his<br />
students‟ scores were higher. He would like to test this hypothesis. Write the compet<strong>in</strong>g<br />
hypotheses.<br />
5. A snack dispenser has a failure rate of over a 5-year span. After changes to the<br />
mach<strong>in</strong>e, the manufacturer would like to know whether or not this has changed. Write<br />
compet<strong>in</strong>g hypotheses.<br />
6. What does it mean to say that when describ<strong>in</strong>g a Type I error<br />
7. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test<br />
<strong>for</strong> the scenario <strong>in</strong> question 3, assum<strong>in</strong>g and that he f<strong>in</strong>ds that only 52 out of<br />
1000 bushels of his crop are lost to <strong>in</strong>sect <strong>in</strong>festations.<br />
8. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test<br />
<strong>for</strong> the scenario <strong>in</strong> question 4, assum<strong>in</strong>g and that he f<strong>in</strong>ds his students have<br />
been averag<strong>in</strong>g ̅ on the test.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 218
9. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test<br />
<strong>for</strong> the scenario <strong>in</strong> question 5, assum<strong>in</strong>g and that she f<strong>in</strong>ds the failure rate is 16<br />
out of 1000 mach<strong>in</strong>es.<br />
10. In real-world terms, describe what Type I and II errors would mean <strong>for</strong> each of questions<br />
3, 4, and 5.<br />
11. Why does the risk of a Type II error <strong>in</strong>crease as we decrease <br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 219
APPENDIX A<br />
Answers to Select Problems<br />
1.1 Data and Their Uses<br />
1.<br />
2.<br />
3.<br />
4.<br />
5.<br />
6.<br />
a. Nom<strong>in</strong>al; ice cream names cannot be ordered, <strong>in</strong> general.<br />
b. Interval; temperatures have order and the differences <strong>in</strong> temperature can be<br />
reasonably discussed. For example, to talk about a difference is mean<strong>in</strong>gful.<br />
c. Ratio: Absolute 0 exists s<strong>in</strong>ce there can be no balance at all. Additionally, it<br />
makes sense to talk about ratios. For <strong>in</strong>stance, accounts receivable balances can<br />
be, say, 20% higher this month as compared to last.<br />
d. Ord<strong>in</strong>al; there is an order<strong>in</strong>g, though we can‟t talk about the number 1 candidate<br />
as be<strong>in</strong>g 2 better than the number 3 candidate. This is because the difference of 1<br />
might not necessarily be the same from 1 to 2 as it would be from 2 to 3. Maybe<br />
candidate 3 is a far third.<br />
a. 2,121 elements <strong>in</strong> the sample<br />
b. Length of time is a quantitative variable, s<strong>in</strong>ce it is a numerical measure.<br />
a. 15,000 elements <strong>in</strong> the sample<br />
b. A proportion is a quantitative variable, s<strong>in</strong>ce it is a ratio.<br />
a. Observational; the number of animals a family have is not be<strong>in</strong>g assigned.<br />
Instead, families are simply be<strong>in</strong>g asked about how many animals they have.<br />
b. The study might have considered families with horses. People with horses likely<br />
live on the outskirts of a big city, perhaps be<strong>in</strong>g exposed to less pollen. Also,<br />
maybe more families have pets because their children do not seem to have<br />
allergies to them.<br />
a. Observational; the researchers are look<strong>in</strong>g at preexist<strong>in</strong>g habits. They are not<br />
attempt<strong>in</strong>g to alter the habits to determ<strong>in</strong>e what effect do<strong>in</strong>g so might have on<br />
measures of read<strong>in</strong>g ability and short-term memory.<br />
b. No; perhaps those who watch more television also have other habits that lead<br />
them to scor<strong>in</strong>g poorly on such assessments.<br />
a. Observational; the op<strong>in</strong>ions of the doctors are not be<strong>in</strong>g altered <strong>in</strong> any way.<br />
b. There is a nonresponse bias s<strong>in</strong>ce not all participants responded. Thus, it might be<br />
the case that those with the strongest op<strong>in</strong>ions decided to come <strong>for</strong>ward, whereas<br />
the other 17,000 who didn‟t respond might have <strong>in</strong>fluenced the poll <strong>in</strong> a different<br />
way.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 220
1.2 Descriptive VS. Inferential <strong>Statistics</strong><br />
1.<br />
2.<br />
3.<br />
a. $4 million/day<br />
b. If all days had the same gross revenue, $4 million would be earned.<br />
c. $7.6<br />
d. The amount of gross revenue earned on a given day varies by as much as $7.6<br />
million as another day.<br />
e. The film has generated an average of $4 million/day. There is much <strong>in</strong>stability <strong>in</strong><br />
this average <strong>in</strong> that the actual gross revenue has varied from $1.6 million to $9.2<br />
million, a range of $7.6 million. It is dangerous to place too many bets on what<br />
might happen next, due to the extreme variability <strong>in</strong> revenues.<br />
a. 18 randomly selected college students<br />
b. All college students<br />
c. Answers vary; spend<strong>in</strong>g on cloth<strong>in</strong>g, style preference, etc.<br />
d. Inferential; they wish to make conclusions about the population of all college<br />
students<br />
a. 250 packages of cheese selected<br />
b. All packages of cheese produced by the company<br />
c. 248 or more must pass<br />
4. Consider the follow<strong>in</strong>g two datasets with a range of 30:<br />
0, 1, 2, 2, 3, 2, 28, 29, 30<br />
0, 1, 2, 3, 4, 3, 4, 2, 1 30<br />
While both have a range of 30, the first dataset has most of its data towards the outer ends<br />
of the dataset. In the second dataset, there appears to tightly spaced data, followed by one<br />
outlier of 30. The second dataset is, overall, less spread out.<br />
5. The researchers are try<strong>in</strong>g to use CGCC students as a representative population of all<br />
college students. This presents a bias, <strong>in</strong> that CGCC probably does not accurately<br />
represent all college students.<br />
2.4 Descriptive <strong>Statistics</strong> – Variability<br />
1.<br />
a. Standard deviation = 5.9; on average, beers <strong>in</strong> this sample are with<strong>in</strong> 5.9 calories<br />
of the average calorie content.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 221
2.<br />
b. Q3 – Q1 = 4.75. The middle 50% of beer calories <strong>in</strong> this sample have a range of<br />
4.75 calories. Specifically, they range from 29 calories (first quartile) to 33.75<br />
calories (third quartile).<br />
c. The skewness value is 0.14. This means the distribution is slightly skewed to the<br />
right.<br />
a. Range = 64.3; Interquartile Range = 27.7 (71.9 – 44.2); Standard Deviation =<br />
18.7. The difference between the highest and lowest percentage is 64.3%, tell<strong>in</strong>g<br />
us that the percentage of school enrollees varies greatly across Central Africa.<br />
However, this does not ensure that there is not a s<strong>in</strong>gle outlier creat<strong>in</strong>g this wide<br />
spread. The <strong>in</strong>terquartile range is 27.7%, tell<strong>in</strong>g us that the middle 50% of<br />
percentages span from 44.2% to 71.9%, still a considerable spread. The standard<br />
deviation verifies that percentages are quite variable, s<strong>in</strong>ce, on average, the<br />
percentage of school enrollees varies by 18.7% po<strong>in</strong>ts about the mean.<br />
b. The <strong>in</strong>terquartile range is 27.7%, tell<strong>in</strong>g us that the middle 50% of percentages<br />
span from 44.2% to 71.9%, still a considerable spread. The standard deviation<br />
verifies that percentages are quite variable, s<strong>in</strong>ce, on average, the percentage of<br />
school enrollees varies by 18.7% po<strong>in</strong>ts about the mean.<br />
c.<br />
Enrollment<br />
Mean 60.9<br />
Standard Error 3.9<br />
Median 61.9<br />
Mode 61.9<br />
Standard Deviation 18.7<br />
Sample Variance 351.2<br />
Kurtosis -0.4<br />
Skewness 0.4<br />
Range 64.3<br />
M<strong>in</strong>imum 34.6<br />
Maximum 98.9<br />
Sum 1401.2<br />
Count 23.0<br />
d. Yes, it is skewed to the right, s<strong>in</strong>ce the skewness value is 0.4, a positive value.<br />
e.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 222
Relative Frequency<br />
35%<br />
30%<br />
25%<br />
20%<br />
15%<br />
10%<br />
5%<br />
0%<br />
Percent Enrolled<br />
Percentage<br />
The majority of people <strong>in</strong> Central Africa are not enrolled <strong>in</strong> school, s<strong>in</strong>ce it is<br />
predom<strong>in</strong>antly the case that fewer than 50% of people <strong>in</strong> each nation attend school.<br />
f. We know that ̅ and . A percentage of 79.6% is<br />
standard deviation from the mean. We would expect that at least<br />
( )<br />
of all enrollment percentages would be with<strong>in</strong> one standard deviation of the mean.<br />
This is considered to be a very normal percentage (it is still with<strong>in</strong> the “average”<br />
spread).<br />
3.<br />
a. The range is 5750, which tells us that there is a difference of 5,750 feet from the<br />
shortest street to the longest street. The <strong>in</strong>terquartile range is 2170, tell<strong>in</strong>g us that<br />
the middle 50% of all street lengths range from 980 feet to 3,150 feet. The<br />
standard deviation is 1634, tell<strong>in</strong>g us that, on average, a street varies by 1,634 feet<br />
from the mean street length.<br />
b. The <strong>in</strong>terquartile range is 2170, tell<strong>in</strong>g us that the middle 50% of all street lengths<br />
range from 980 feet to 3,150 feet. The standard deviation is 1634, tell<strong>in</strong>g us that,<br />
on average, a street varies by 1,634 feet from the mean street length.<br />
c.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 223
Relative Frequency<br />
Street Lengths<br />
Mean 2231.4<br />
Standard Error 238.4<br />
Median 2100.0<br />
Mode 960.0<br />
Standard Deviation 1634.1<br />
Sample Variance 2670328.9<br />
Kurtosis -0.2<br />
Skewness 0.8<br />
Range 5750.0<br />
M<strong>in</strong>imum 100.0<br />
Maximum 5850.0<br />
Sum 104874.0<br />
Count 47.0<br />
d. The distribution is strongly skewed to the right.<br />
e.<br />
f.<br />
This means that a street length of 79.6 feet would be about 1.3 standard deviations<br />
below the mean.<br />
Street Length<br />
35.00%<br />
30.00%<br />
25.00%<br />
20.00%<br />
15.00%<br />
10.00%<br />
5.00%<br />
0.00%<br />
100-1099 1100-2099 2100-3099 3100-4099 4100-5099 5100-6099<br />
Feet<br />
4. Answers vary;<br />
By C.T. . / of all street lengths <strong>in</strong> the sample are guaranteed to<br />
fall with<strong>in</strong> 1.3 standard deviations of the mean. This is not unusual.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 224
Symmetric:<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200<br />
Bimodal (two peaks):<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200<br />
Right Skewed:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 225
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200<br />
Left Skewed:<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200<br />
5.<br />
a.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 226
Repair Cost<br />
Mean 971<br />
Standard Error 382<br />
Median 738<br />
Mode -<br />
Standard Deviation 1,207<br />
Sample Variance 1,455,875<br />
Kurtosis 7<br />
Skewness 2<br />
Range 4,194<br />
M<strong>in</strong>imum -<br />
Maximum 4,194<br />
Sum 9,707<br />
Count 10<br />
Due to the great variability <strong>in</strong> repair costs, it would be most appropriate to use the<br />
median as measure of center. It also reflects the fact that most repair costs, if there<br />
are any, tend to be between $600 and $1000. S<strong>in</strong>ce the standard deviation<br />
describes movement about the mean, it is not appropriate to be used <strong>in</strong><br />
comb<strong>in</strong>ation with a median. Thus, we should probably use the <strong>in</strong>terquartile range<br />
to describe the middle 50% of repair costs.<br />
b.<br />
The repair costs of $4,194 is nearly 3 standard deviations above the mean. This<br />
means that it is an outlier cost.<br />
c. Accord<strong>in</strong>g to C.T., at least . / of the data <strong>in</strong> this data set should be<br />
with<strong>in</strong> 2.7 standard deviations of the mean. Thus, there is only a 14% chance that<br />
we have a score outside of 2.7 standard deviations of the mean. This tells us that a<br />
repair cost of $4,194 is fairly unusual.<br />
6.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 227
el freq<br />
CC Ratios<br />
Mean 12.35<br />
Standard Error 0.62<br />
Median 12.91<br />
Mode #N/A<br />
Standard Deviation 1.97<br />
Sample Variance 3.90<br />
Kurtosis -0.50<br />
Skewness -0.60<br />
Range 6.03<br />
M<strong>in</strong>imum 8.81<br />
Maximum 14.84<br />
Sum 123.47<br />
Count 10.00<br />
There do not appear to be extreme outliers, s<strong>in</strong>ce the mean and median are close. However,<br />
based on the mean be<strong>in</strong>g smaller than the median, and the skewness value be<strong>in</strong>g negative, there<br />
is a slight left-skew to the distribution. The standard deviation tells us that average CC ratios are<br />
with<strong>in</strong> 0.62, or 62% po<strong>in</strong>ts, of the mean. We verify these notions by consider the histogram<br />
45.00%<br />
40.00%<br />
35.00%<br />
30.00%<br />
25.00%<br />
20.00%<br />
15.00%<br />
10.00%<br />
5.00%<br />
0.00%<br />
CC Ratio Distribution<br />
CC Ratio<br />
We should also be careful to note that there is not very much data available, which is why we<br />
don‟t dist<strong>in</strong>ctly see a skew.<br />
7.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 228
el freq<br />
Nitrous Oxide (thous. Tons)<br />
Mean 46.35<br />
Standard Error 9.395205<br />
Median 36<br />
Mode 40<br />
Standard Deviation 42.01663<br />
Sample Variance 1765.397<br />
Kurtosis 0.09474<br />
Skewness 0.949789<br />
Range 136<br />
M<strong>in</strong>imum 0<br />
Maximum 136<br />
Sum 927<br />
Count 20<br />
30%<br />
25%<br />
20%<br />
15%<br />
10%<br />
5%<br />
0%<br />
Nitrous Oxide Distribution<br />
Nitrous Oxide (thous. Tons)<br />
The distribution of nitrous oxide emissions is skewed to the right <strong>in</strong>dicat<strong>in</strong>g that most states have<br />
relatively low emissions, whereas fewer states have relatively high emissions. We note that the<br />
median is a good measure, <strong>in</strong>dicat<strong>in</strong>g that 36 thousand tons is the 50 th percentile. There are two<br />
outliers of 136 thousand tons. For this value,<br />
, <strong>in</strong>dicat<strong>in</strong>g that at least around<br />
75% of all values <strong>in</strong> the data set are with<strong>in</strong> 2.1 standard deviations of the mean. Thus, 136 can be<br />
considered a mild outlier.<br />
3.2 Jo<strong>in</strong>t Probability<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 229
1. See Video Solution<br />
2.<br />
a. About 85% of all the past calls were <strong>for</strong> medical assistance.<br />
b. P(call is not <strong>for</strong> medical assistance) = 1 – 0.85 = 0.15.<br />
c. P(two successive calls are both <strong>for</strong> medical assistance) = (0.85)(0.85) = 0.7225.<br />
d. P(first call is <strong>for</strong> medical assistance and second call is not <strong>for</strong> medical assistance)<br />
= (0.85)(0.15) = 0.1275<br />
e. P(exactly one of two calls is <strong>for</strong> medical assistance) = P(first call is <strong>for</strong> medical<br />
assistance and the second is not) + P(first call is not <strong>for</strong> medical assistance but the<br />
second is) = (0.85)(0.15) + (0.15)(0.85) = 0.255.<br />
f. Probably not. There are likely to be several calls related to the same event -<br />
several reports of the same accident or fire that would be received close together<br />
<strong>in</strong> time.<br />
3. (“ ” “ ” “ ”) . / . / . /<br />
4. See Video Solution<br />
5.<br />
a. The "expert" assumed that the positions of the two valves were <strong>in</strong>dependent.<br />
b. The position of the two valves is not <strong>in</strong>dependent but rather dependent. The<br />
effect of the error makes the probability much smaller. The actual probability is<br />
compared to .<br />
6.<br />
a. Assum<strong>in</strong>g that whether Jeanie <strong>for</strong>gets to do one of her “to do” list items is<br />
<strong>in</strong>dependent of whether or not she <strong>for</strong>gets any other of her “to do” list items, the<br />
probability that she <strong>for</strong>gets all three errands = (0.1)(0.1)(0.1) = 0.001.<br />
b. ( )<br />
( )<br />
c. P(remembers the first errand, but not the second or the third) = (0.9)(0.1)(0.1) =<br />
0.009.<br />
5.1 The Ideas Beh<strong>in</strong>d the Cont<strong>in</strong>uous Distribution<br />
1.<br />
a.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 230
Probability<br />
Pizza Size Distribution<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
12 14 16 18<br />
Size (<strong>in</strong>ches)<br />
b. ( )<br />
c. ( )<br />
d. , - ( ) ( ) ( ) ( ) <strong>in</strong>ches per pizza, on<br />
average.<br />
e. ( ) (doesn‟t <strong>in</strong>clude the 12-<strong>in</strong>ch pizza!)<br />
2.<br />
3.<br />
4.<br />
a. ( )<br />
b. ( )<br />
a. , so ( ) <strong>for</strong><br />
b. ( )<br />
c. ( )<br />
d. ; on average, the professor dismisses class 5 m<strong>in</strong>utes after the hour.<br />
e. ; on average, the amount of time that the professor dismisses the class<br />
after the hour by varies by 2.9 m<strong>in</strong>utes about the mean.<br />
f. ( ) ( )<br />
a. ( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 231
. ( ) ( )<br />
c. ( ) ( ) ( )<br />
d. ( )<br />
5.<br />
a. , so ( ) <strong>for</strong><br />
b. ( )<br />
c. ( )<br />
d. Both ( ) ( ) because, <strong>in</strong> a cont<strong>in</strong>uous distribution, the<br />
probability that is 0.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 232
e. ( )<br />
f. ; the average response time is 26 m<strong>in</strong>utes<br />
g. ; on average, wait times deviate from the mean wait time by 4.6 m<strong>in</strong>utes.<br />
h. . Thus, we want ( ) (<br />
) .<br />
5.2 The Normal Distribution<br />
1.<br />
a.<br />
b.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 233
c.<br />
d.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 234
2.<br />
a. The long-run proportion of all children born <strong>in</strong> the U.K. expected to weight more<br />
than 10 lbs. is 0.0186.<br />
b. The long-run proportion of all children born <strong>in</strong> the U.K. expected to weigh at<br />
most 10 lbs. is 09814.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 235
c. The long-run proportion of all children born <strong>in</strong> the U.K. expected to weigh<br />
between 5 and 6.5 lbs. is 01837.<br />
d. The long-run proportion of all children born <strong>in</strong> the U.K. expected to weigh<br />
between 1 and 2 lbs. is 0.0000.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 236
e. 20% of all children are expected to be born weigh<strong>in</strong>g less than 6.5 lbs.<br />
6. In a recent years, Scholastic Aptitude Test (SAT) scores <strong>for</strong> all college-bound seniors <strong>in</strong><br />
the United States was such that po<strong>in</strong>ts and po<strong>in</strong>ts (SOURCE:<br />
http://www.collegeboard.com) .<br />
a. 50% of students scored less than how many po<strong>in</strong>ts<br />
b. 50% of students scored more than how many po<strong>in</strong>ts<br />
c. In order to be <strong>in</strong> the top 10% of SAT-takers, what score would one have to<br />
achieve<br />
d. What score do the lowest 10% score between<br />
e. The middle 50% of students scored between what two values<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 237
3.<br />
a. 50% of students score less than 1518 on the test.<br />
b. By complementary probability, 50% of students should score more than 1518.<br />
c. You would have to score about 1913 po<strong>in</strong>ts.<br />
d. About 1123.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 238
e. The middle 50% score between about 1310 and 1726.<br />
4.<br />
a.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 239
. The Empirical Rule is a summary of what we have done above. It is a nice ruleof-thumb.<br />
5.<br />
a. The distribution would ma<strong>in</strong>ta<strong>in</strong> its exact shape, though would be shifted 10 units<br />
to the right.<br />
b. The distribution would become wider and have a lower peak. This must happen to<br />
make sure the area is still 1 when the distribution becomes wider.<br />
c. The distribution would become narrower and have a higher peak. If a distribution<br />
becomes narrower, its height must <strong>in</strong>crease to ma<strong>in</strong>ta<strong>in</strong> an area of 1.<br />
d. The mean, , determ<strong>in</strong>es where the distribution is centered without alter<strong>in</strong>g its<br />
shape. The standard deviation, , will make a distribution wide and low-peaked if<br />
it large, and will make a distribution narrow and high-peaked if small.<br />
6.1 The Sampl<strong>in</strong>g Distribution <strong>for</strong> ̅<br />
1. Answers vary<br />
2. Answers vary – emphasis on the ability to have a population distribution with any<br />
unknown shape.<br />
3.<br />
a. 0.2525<br />
b. 0.2514<br />
c. 0.9044<br />
d. 95.1 and 104.9<br />
4.<br />
a. 0.0272<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 240
5.<br />
b. This might <strong>in</strong>dicate that the production process is outside of the norm. This type<br />
of average is unlikely <strong>in</strong> a sample of size where The company<br />
should <strong>in</strong>vestigate why the average thickness of its glass samples is so thick.<br />
a. It should be approximately normal, regardless of the distribution of revenues.<br />
b. 0.3869<br />
c. The standard deviation of means would change from $6,957 to $4,400. This<br />
would change ( ) . This makes sense, s<strong>in</strong>ce the<br />
distribution of means is less spread, and so there will be fewer mean sales<br />
amounts beyond $420,000.<br />
d. $421,255.50; If the team averages more than this amount <strong>for</strong> each team member,<br />
then they will receive the paid vacation days.<br />
6. We know that , ̅- , so , so √ A person‟s<br />
√ √<br />
<strong>in</strong>come varies, on average, by about $2,906.89 from the population average of <strong>in</strong>comes.<br />
7.<br />
a. rooms and rooms (NOTE: be sure to use sdev.p() s<strong>in</strong>ce this is a<br />
population standard deviation we want)<br />
b. It should be approximately normal based on the Central Limit Theorem; the<br />
sample size of 30 satisfies the m<strong>in</strong>imum required sample size to meet normality<br />
assumptions.<br />
c. Answers will vary slightly due to sampl<strong>in</strong>g variability of the simulation process;<br />
, ̅- and , ̅- . We see that , ̅- as expected. We also see that<br />
√<br />
√<br />
, which is what we obta<strong>in</strong>ed via simulation.<br />
d. Answers will vary slightly due to sampl<strong>in</strong>g variability of the simulation process;<br />
, ̅- and , ̅- . We see that , ̅- as expected. We also see that<br />
, which is what we obta<strong>in</strong>ed via simulation.<br />
√<br />
√<br />
e. Answers will vary slightly due to sampl<strong>in</strong>g variability of the simulation process;<br />
, ̅- and , ̅- . We see that , ̅- as expected. We also see that<br />
, which is what we obta<strong>in</strong>ed via simulation.<br />
√<br />
√<br />
f. The population standard deviation can be thought of as the distribution of means<br />
from a sample of size . That is, , ̅- . S<strong>in</strong>ce it is the smallest<br />
possible sample size, it will have the highest degree of variability.<br />
g. 0.000 or about 0% chance<br />
h. As with toss<strong>in</strong>g a co<strong>in</strong> repeatedly, when someth<strong>in</strong>g is repeated over-and-over<br />
aga<strong>in</strong>, the amount of variation <strong>in</strong> the outcomes becomes relatively small. That is,<br />
any mild outliers get averaged <strong>in</strong> to a large sample of typical values, and its effect<br />
is dispersed. In small samples, the opposite holds – deviate values are highly<br />
corrosive to the sample mean.<br />
√<br />
6.2 Confidence Interval <strong>for</strong> ̅<br />
1. Answers vary<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 241
2. Answers vary<br />
3.<br />
a. No, the sample size is 10, which is less than the m<strong>in</strong>imum required (30).<br />
b. ( )<br />
c. We are 95% confident that the population average labor cost is between $109.6<br />
billion and $227.6 billion.<br />
d. About 0.213<br />
4.<br />
a. Yes, s<strong>in</strong>ce they can be 95% confident that the average revenue per camera will be<br />
between $654.51 and $752.44.<br />
b. No, s<strong>in</strong>ce they can be 99% confident that the average revenue per camera will be<br />
between $637.42 and $768.01, which <strong>in</strong>cludes the possibility of the average be<strong>in</strong>g<br />
lower than $640.<br />
c. Yes, the sample size is 30, which is the m<strong>in</strong>imum required sample size <strong>for</strong> the<br />
CLT results to be applied.<br />
d. We know that , ̅- , which we are estimat<strong>in</strong>g by ̅. That is, we are assum<strong>in</strong>g<br />
the sample mean is the population mean <strong>for</strong> the basis of our <strong>in</strong>terval. Here,<br />
̅ . Similarly , ̅- √ . We are us<strong>in</strong>g to estimate . Thus, our<br />
estimate of , ̅-<br />
. Us<strong>in</strong>g our probability calculator, we f<strong>in</strong>d:<br />
√<br />
Our 95% confidence <strong>in</strong>terval would be 652.1 to 755.1, which is close to our<br />
bootstrap confidence <strong>in</strong>terval. It is a bit wider than we would like.<br />
e. Here we have that . We have 5% to split between the tails.<br />
Thus, <strong>in</strong> each tail. We f<strong>in</strong>d that (same number of standard<br />
deviations from the mean to each tail, s<strong>in</strong>ce the distribution is symmetric):<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 242
̅<br />
We have that ̅ and √ . So our <strong>in</strong>terval will be<br />
Where<br />
( )<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 243
Thus, our <strong>in</strong>terval is:<br />
( )<br />
Or<br />
( )<br />
This is a bit wider, account<strong>in</strong>g <strong>for</strong> the extra variability <strong>in</strong> estimat<strong>in</strong>g and .<br />
6.3 Confidence Interval <strong>for</strong> ̂<br />
7.1 The Concept Beh<strong>in</strong>d Hypothesis Test<strong>in</strong>g<br />
1. The null hypothesis is assumed to be true and is usually based on what has been observed<br />
be<strong>for</strong>e. The alternative hypothesis is what we would like to test, which is someth<strong>in</strong>g that<br />
would challenge past observations or assumptions about a population.<br />
2. We assume it is true because it is based on past observations or research. For example, if<br />
the Census Bureau f<strong>in</strong>ds that 35% of Americans enjoy hypothesis test<strong>in</strong>g, then this is<br />
typically based on some fairly extensive research. If a researcher believes this rate is<br />
greater <strong>in</strong> his community, then he can test his alternative hypothesis.<br />
3.<br />
4.<br />
5.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 244
6. This is the probability that we reject the null hypothesis when it is, <strong>in</strong> fact, true. That is<br />
( ) . This allows us to be 95% confident that we fail to reject<br />
when it is true, a correct decision.<br />
7.<br />
1) Hypotheses:<br />
8.<br />
2) <strong>Decision</strong> Rule: We will reject the null hypothesis when the likelihood of<br />
observ<strong>in</strong>g someth<strong>in</strong>g as small or smaller than 52 out of 1000 bushels is no<br />
larger than a 1% probability, under the assumption of the null hypothesis. That is,<br />
( )<br />
3) We will reject if the observed value of is smaller than some cutoff<br />
value of .<br />
4) Based on the sample evidence, we will either:<br />
a. Reject <strong>in</strong> favor of of <strong>in</strong>sect-related crop destruction <strong>for</strong> the<br />
farmer‟s new method.<br />
b. Fail to reject . We do not have sufficient evidence to conclude that the<br />
farmer‟s new method is better than his old method.<br />
1) Hypotheses:<br />
2) <strong>Decision</strong> Rule: We will reject the null hypothesis when the likelihood of<br />
observ<strong>in</strong>g someth<strong>in</strong>g as large or larger than ̅ is no larger than a 5%<br />
probability, under the assumption of the null hypothesis. That is,<br />
( ̅ )<br />
̅ 3) We will reject if the observed average of is larger than some cutoff<br />
value of ̅.<br />
4) Based on the sample evidence, we will either:<br />
a. Reject <strong>in</strong> favor of out of 5 questions are answered correctly by<br />
his students (as of recent observations).<br />
b. Fail to reject . We do not have sufficient evidence to conclude that the<br />
<strong>in</strong>structor‟s more recent students do better on the AP exam than his <strong>for</strong>mer<br />
students.<br />
9.<br />
1) Hypotheses:<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 245
2) <strong>Decision</strong> Rule: We will reject the null hypothesis when the likelihood of<br />
observ<strong>in</strong>g someth<strong>in</strong>g as small/large or smaller/larger than 16 out of 1000<br />
bushels is no larger than a 1% probability, under the assumption of the null<br />
hypothesis. That is,<br />
( )<br />
3) We will reject if the observed value of is smaller or larger than some<br />
cutoff values of . That is, if it is smaller than some value, say , or larger than<br />
some value, say , then we will reject . Remember, we set-up a hypothesis<br />
first, then do the test. Even though 16 is larger than 15 out of 1000, we did not<br />
know this to beg<strong>in</strong> with. We are still test<strong>in</strong>g whether or not this value is<br />
significantly different and do not care about the direction of the difference.<br />
4) Based on the sample evidence, we will either:<br />
a) Reject <strong>in</strong> favor of of mach<strong>in</strong>es fail. That is, either a<br />
significantly fewer number of them fail, or a significantly greater number<br />
of them fail.<br />
b) Fail to reject . We do not have sufficient evidence to conclude that new<br />
mach<strong>in</strong>es fail more or less when compared to the old mach<strong>in</strong>e.<br />
10.<br />
1) Type I: We conclude the farmer‟s method reduces crop destruction, when there is<br />
no difference; Type II: We conclude the farmer‟s method is no different than the<br />
old method, when <strong>in</strong> fact there is less than 7% crop destruction with his new<br />
method.<br />
2) Type I: We conclude the <strong>in</strong>structors students per<strong>for</strong>m better than his <strong>for</strong>mer<br />
students, when <strong>in</strong> fact there is no difference; Type II: We conclude that his new<br />
students per<strong>for</strong>m just as well as his <strong>for</strong>mer students, when <strong>in</strong> fact they do better.<br />
3) Type I: We conclude that the new mach<strong>in</strong>es fail more or less than the <strong>for</strong>mer<br />
mach<strong>in</strong>es, when <strong>in</strong> fact there is no difference; Type II: We conclude that there is<br />
no difference between the failure rates of the new and old mach<strong>in</strong>es, when <strong>in</strong> fact<br />
there is a significant difference.<br />
11. Increas<strong>in</strong>g means we will reject less often, as we set more str<strong>in</strong>gent conditions upon<br />
the rejection process. If we reject less often, then there is an elevated likelihood that we<br />
may fail to reject, when <strong>in</strong> fact we should. This is precisely what a Type II error is.<br />
<strong>Statistics</strong> <strong>for</strong> <strong>Decision</strong>-<strong>Mak<strong>in</strong>g</strong> <strong>in</strong> Bus<strong>in</strong>ess © Milos Podmanik Page 246