Introduction & Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Introduction and Summary Measures of Data

Table of Contents

Pretest
1. Biostatistics: How Much, Why, When and What

1.1 How Much: The Six Year Old Biostatistician

Professional Benefits
Conceptual Understanding
Needs of Clinicians

1.2 Why
1.3 When

Whole Populations
Intuitive Biostatistics
Huge Samples
Samples from Populations
Clinical Relevance
Groups vs. Individuals
Who Decides?

1.4 What Statistics

2. Variables and Measures

2.1 Types of Variables
2.2 Qualitative vs. Quantitative Variables
2.3 C.R.A.P. Detector #1.1
2.4 C.R.A.P. Detector #1.2

3. Central Tendency

Why Important?
3.1 Mean
3.2 Median
3.3 Mode
3.4 Summary Principle

4. Variability

Why Important?
4.1 Standard Deviation
4.2 Interquartile Range
4.3 Range

5. Exploratory Data Analysis (EDA)

Why Important?
5.1 Hinges
5.2 Ranges
5.3 Outliers
5.4 Box & Whisker plots

6. Standard Scores

Why Important?
6.1 z-Scores
6.2 General z-Score Properties
6.3 Gaussian z-Score Properties

7. Distributions

Why Important?
7.1 Gaussian
7.2 Binomial
7.3

Posttest

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Biostatistics: How Much, Why, When and What
A critical distinction between the scientific approach and other methods of inquiry lies in the emphasis placed on real world validation. Where research has shown that particular approaches are appropriate and effective for specific applications, clinicians are wise to select those approaches. Clinicians are then called upon to defend decisions on the basis of empirical research evidence. Consequently, the clinician must be an intelligent consumer of medical research outcomes, able to understand, interpret, critically evaluate and apply valid results from the latest medical research.
How Much: The Six Year Old Biostatistician
Norman & Streiner (1986) tell an old story about three little French boys who happened to see a man and woman naked on a bed in a basement apartment. The four year old said, "Look, that man and woman are wrestling!". The five year old said, "You silly, they're not wrestling they're making love!" The six year old said, "Yes! And very poorly too!!" The four year old did not understand. The five year old had achieved a conceptual understanding. The six year old understood it well enough, presumably without actual experience, to be a critical evaluator. The intent of the following instruction is to make you a critical evaluator of medical research, a "six year old biostatistician". So that is the purpose of these lessons - to turn you into a six year old biostatistician.

Biostatistics: How Much, Why, When and What
Practice
Exercise 1: Given achievement of the objectives of these lessons you should:
No Response
Be a competent biostatistician
Be a critical evaluator of medical research
Be a competent producer of medical research
None of the above

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Professional Benefits

You are about to overview the most frequently used and most important descriptive and inferential biostatistical methods as they are relevant for the clinician. The goal is that you will appreciate how the application of the theories of measurement, statistical inference, and decision trees contributes to better clinical decisions and ultimately to improved patient care and outcomes.

Biostatistics: How Much, Why, When and What
Practice
Exercise 2: Being well informed about biostatistics contributes to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Conceptual Understanding

Conceptual understanding, rather than computational ability, will be the focus. Development of an adequate vocabulary, an examination of fundamental principles and a survey of the widely used procedures or tools to extract information from data, will form a basis for fruitful collaboration with a professional biostatistician when appropriate.

Biostatistics: How Much, Why, When and What
Practice
Exercise 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics: How Much, Why, When and What
Practice
Exercise 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Needs of Clinicians

The object is to help you understand the tools and procedures that are used in statistics and to when you really want to do statistics ask a biostatistician to help you with the statistics. So the objective here is not to make you into biostatisticians, but into appreciators of what biostatistics can contribute to the appropriate care of your patients and to seek the appropriate help when necessary.
The needs of practicing physicians, not the skills to be a biostatistician or for sophisticated medical research, will inform the presentations.

Biostatistics: How Much, Why, When and What
Practice
Exercise 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Why
Now if you had to boil it all down, the goal of biostatistics is very straightforward. And , that is to prove that the treatment and only the treatment caused the effect. So that is really the job of biostatistics When you have a complex being like a human animal and you're performing experiments on that human animal you try to control as many variables as you can. You may take a sample that has only one gender. You may take a sample that has only a narrow range of ages.
OK were going to go back and look at what biostatistics is going to do for us here. It's going to limit the effects of chance and thats the main thing that it does. We talked about this with small samples in particular so its going to take care of things like false positives for example we might get in a small sample.
It helps us determine sample size...someone mentioned you have to have a big enough sample to find effects. So its going to help you figure out whether the sample size is big enough to help you detect the results that you think are clinically important. To give you an example what if you have a new formula that increases the weight per week of newborns and you want to know whether this particular formula is a useful formula or not in increasing the weights of newborns. Well you have to answer the question to the statistician. First of all is what do you consider a useful increase in weight per week as a clinician. So you're going to be asked questions by a statistician if they're going to help you with these things as to what you think is clinically important. In order to design the experiment to have enough people in the sample to find small effects, you'll have to define what is the smallest useful effect to you. Is 5 years useful in longevity? Is 1 pound more than formula X important in this formula for newborns? We're now going to try and control for confounding variables, gender, other kinds of things that might confound results. We're going to try and design alternative ways of measuring humans because humans are not laboratory chemicals the controls and things are very difficult in humans and I hope that in your epidemiology lectures you'll have talked about other ways than randomized trials. Because randomized trials have tremendous flaws, there a perfect experimental design but humans don't always accept to be in a random trial. So you have special kinds of people who say I'm going to be in a random trial and others don't. So you already have a special kind of population that's skewed a bit. The question is whether you can design alternative ways of measuring effects without going to the extreme of a randomized clinical trial.
But, still there are many other kinds of things that might affect the outcomes that you'd see if you apply a treatment to that individual. So that when you boil it all down what biostatistics is trying to do is to eliminate or to minimize anything that might interfere with your being able to prove that the treatment and only the treatment caused the effect. These issues are summarized below.

BIOSTATISTICS
Goal of Experimental Method:

To prove that the treatment and only the treatment caused the effect.

Usefulness:

Place limits on effects of chance in small sample experiments - (Alpha or False Positives).

Determine sample size needed to detect clinically relevant effects - (Beta or False Negatives).

Control for effects of one or more confounding variables.

Assist in developing alternative designs for human experiments.

Use maximum information content measurement.

Measure intangibles such as intelligence, depression, and well-being.

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

When

Whole Populations
Now, biostatistics is useful in some areas but not in others. You need to know when it makes sense to use biostatistics and when it doesn't make sense. To help illustrate let's look at some questions.
Let's say you want to know whether there are more women or men in your biostatistics class. How would you calculate the answer?
And, the next question is: If you find a difference in the number of men and women is the difference statistically significant. More specifically what test would you use to try to establish significance?
Well, for the first question all you have to do is count them. No need for any fancy statistics because the number is relatively small and it's easy to count them.
For the second part of the question - once your have counted them, how would you know whether the the result was statistically signficant? Say there are 150 total students and you find out there are 80 women and 70 men. What would you do?
The answer is, no significance test is needed here. The reason is you have all the data.
On the other hand, lets suppose you took a small sample from the larger population of the whole class and for that small group you measured the numbers, and then you wanted to try to extrapolate to the whole class. Then you need inferential statistics because your inferring from that small group what the whole population is like. When you use the entire population, inferential statistics is inappropriate, unnecessary and irrelevant. Inferential Statistics is only needed when you're trying to infer from a small group to a larger group. So inferential statistics is worth zip, zero when you have the entire population and can count.

Biostatistics: How Much, Why, When and What
Practice
Exercise 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Intuitive Biostatistics
The point is that when you think biostatistics don't think complicated of complicated technical jargon and computations. Don't think the names of the tests t-test, chi-squared and so on. But, try to approach it from the fresh point of view of a person who's kind of naive, walks in and says, "I think I'll try to figure out how to do this sort of thing." That's the way you'll develop a conceptual feel for this sort of thing. Rather than relying on faded memories of times past when you were in some biostatistics course.

Biostatistics: How Much, Why, When and What
Practice
Exercise 7: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Huge Samples
Now lets take a second problem. Let's say you perform a test of fitness on people. Let's say you want to examine whether their fitness affects their longevity. In other words you're trying to find evidence concerning the hypothesis that longevity is associated with fitness. You want to be able to determine whether fitness causes a longevity effect. Let's say you get 300,000 males and follow them for 50 years to see how long they live. So you measure the fitness of these 300000 males. You track them for 50 years, see when they die, and you get the following results. For those who are fit the mean longevity is 75 years and for those that are not fit it is 70 years. Is the difference in longevity statistically significant and what test would you calculate to do it.
Before you go to the statistics book and look through the lists of statistical significance tests though, first do a little thinking. The last time you read the New England Journal of Medicine how many studies did you read that had 300000 males in the trial?
Probably some have or tens of thousands or even scores of thousands. But there are probably no medical studies that involve that several hundred thousand. When you get very large numbers experimental error becomes negligably small So for large numbers you don't need statistical tests. If you're able to go out and get 300000 volunteers, once you get big numbers you don't need inferential statistics. Because the numbers are there and they're are overwhelming and the error associated with such large samples is made very small by the fact that you have huge groups.

Biostatistics: How Much, Why, When and What
Practice
Exercise 8: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Samples from Populations
Now let's start with the last question but this time let's say you have 50 males. Let's say you get the same ratio of females to males. What tests should you use and are the results statistically significant?
Let's be more precise. Does increased fitness increase longevity? That's the question.
You would probably use a t-test. You are comparing the differences in the means of 2 independent groups. You don't have to worry about the kind of test here now really. The point is that this is a situation where you would want to realize that the way to approach the problem is with inferential statistics. You have samples from larger populations and you want to use data from the sample to generalize to the large populations.
(Ask Dr. Oser for clarification on the following.)
Clinical Relevance
Now let's go to a last question. Let's suppose you have results like the fitness longevity research just described. You have applied a significance test and found that the results are statistically significant. Is that useful to you as a physician? What measure would you apply or what do you call the measure of usefullness to physicians when you look at results like that? So I've told you the statistics are significant. Does that mean something to you? And if it does what do you call that kind of analysis you go through. In other words, I've combed the journal articles and I've only picked out the journal articles that show these are statistically significant so you know by that the results were probably not due to chance. That's one of the mistakes that biostatistics prevents you from making is that the result you got was just by chance having a funny group or something like that, OK. So another result thats statistically significant, is that the end of your analysis? Do you care about anything else? (Question reproducible ...yes)
What kinds of questions do you ask as a physician once you have results from a statistician that have been validated as statistically significant?
Group A: Is it relevant, can we implement in making patients lives better? Is it feasible.
Another: Can you apply it to your population?
So in this case we have 50 males we've left out one group in the population anyway. So the question is should we go back and look at another group or something like that so we can apply it to the population that may be in our practice.
Another suggestion: Measure of fitness? Return to basal rate after 5 min step test?

Groups vs. Individuals
Other suggestions for what a physician should look at: am I going to be applying to patients as a whole or individual patients because if I apply to individuals it's not appropriate? Can you apply results that some statistical importance to individual patients?
Does anyone have an alternative?
Modern medicine is based upon this leap of faith. That the best evidence you have is epidemiological evidence. And that the one distraction physicians have to be careful about is that physicians have to be careful because they've seen results in 10 patients of their own and they tend to neglect the population kinds of statistics.
Now appropriateness is certainly a question. Can you apply everything you know to every person. You'll have to answer that based upon individual characteristics. But knowing nothing else about a person other than you had this result I would suggest that medicine as a discipline would strongly suggest that you have no alternative but to use this. Otherwise you're distracted by all kinds of bias of patients that you have your thoughts about , etc.

Biostatistics: How Much, Why, When and What
Practice
Exercise 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Who Decides?
Quality of tests, significance level of test.
Is it clinically significant? And, does it make sense to my patients? Is this result worth whatever it requires to be fit throughout my life. You look at things like smoking cessation. There are many countries around the world where smoking cessation is nothing like it is in the United states. And they have made some decisions that their lifestyle is more important and that includes smoking than is heart and lung disease, cost to the general population and all the terrible things that smoking causes. Now, I wouldn't suggest thats a very wise clinical decision. The point is though you have to make a decision based upon a broader array of things than the numbers that the statistician gives you. If for example you have treatment in which there is a small outcome, but the sample size is large enough for you to determine the statistical significance it may not be worth it. The change may be so small in your patients lives, that it may not be worth the treatment aspects of it. So fundamentally what you should be looking at is go to the literature, find out if somebody has applied the appropriate tests to be statistically significant and if they are, that's where you're job begins then. To say whether or not this is clinically significant or appropriate for my patients. So collaboration with a biostatistician if you're doing the analysis but in the final go round it is the physician who determines whether or not you apply this result to your particular patients taking into account all the individual aspects that you know about them and whether or not the result is worth it.

Biostatistics: How Much, Why, When and What
Practice
Exercise 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

What Statistics
You are going to start by focusing on summary measures of data. That is, how you take data from a population or a group and somehow express it in a few summary measures (e.g., the mean or standard deviation).
(Insert transparency of lesson outlines)
We're going to look at several ways of doing that. First looking at kinds of variables in Section 2. Then, measures of central tendency and variability and how they are used. We're going to take a peek at exploratory data analysis which is an area that would be very useful if you were going to be collecting data yourself. And, lastly we'll be looking at distributions of information, Gaussian, binomial and Poisson distributions.

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

End Lesson 1 - Section 1
Biostatistics: How Much, Why, When and What

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Variables and Measures
Now well move into some more familiar territory. When you start to measure the impact of a treatment you have to ask yourself, "What kinds of variables am I dealing with here? What are my choices of variables?"
Now, you might ask, why do I need to know about types of variables or measures? You need to know, in order to evaluate the appropriateness of the statistical techniques used, and consequently whether the conclusions derived from them are valid. In other words, you can't tell whether the results in a particular medical research study are credible unless you know what types of variables or measures have been used in obtaining the data.
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

1.1 Types of Variables
Look at Figure 1.1 below. On the left hand side you see that there are two classifications of variables. There are qualitative variables and there are quantitative variables. Now one isn't necessarily better than the other. One we're a little more used to doing stuff with. With quantiative variables for example, we can do averages and things like that, we know there are numbers, you can add them up and divide and things like that. Its a little trickier some times with qualitative variables. But in human experiments there's no way you can get around it. There are two classes those - nominal and ordinal.

Figure 1.1: Types of Variables

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Four Types of Scales
There are four types of measurement scales (nominal, ordinal, interval and ratio; see Figure). Each of the four scales, respectively, provides higher information levels about variables measured with the scale.
Nominal Scales
What does nominal comes from? Name. So nominal comes from name and the important thing is there is no measure of distance between those. You're either married or not married, gender is determined, yes or no, and so there is no question of how far apart in a quantitative sense those categories are so they're just names. Nominal scales name and that is all that they do. Some examples of nominal scales are sex (male, female), race (black, hispanic, oriental, white, other), political party (democrat, republican, other), blood type (A, B, AB, O), and pregnancy status (pregnant, not pregnant; see Figure).
Ordinal Scales
In the next group we have a little more sophistication than naming. What does ordinal imply? Ranking. So there in some order. Higher and lower. We don't rank gender as higher and lower. But we do rank stages of cancer for example as higher and lower. We have pain ratings that are higher and lower. So were now a t a more sophisticated level of measure. A finer tuned level of measurement. But we've now added only one element. We know that something is higher than something or lower than something or more painful than something or less painful than something. Ordinal scales both name and order. Some examples of ordinal scales are rankings (e.g., football top 20 teams, pop music top 40 songs), order of finish in a race (first, second, third, etc.), cancer stage (stage I, stage II, stage III), and hypertension categories (mild, moderate, severe; see Figure). In the next group we have a quantitative group and people divide these into interval and ratio variables. And, I wouldn't worry about the division so much.
[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Interval Scales
What about interval variables? Why is that called an interval variable like temperature for example? What is the difference between 36 degrees and 37 degrees compared to the difference between 40 degrees and 41 degrees? Is the difference the same? So the intervals are the same on the scale. So now we know not only is one higher than the other but that the distances or the intervals on the scales are the same. So again it's a higher level of information that we have. Interval scales name, order and have the property that equal intervals in the numbers on the scale represent equal quantities of the variables being measured. Some examples of interval scales are fahrenheit and celsius temperature, SAT, GRE and MAT scores, and IQ scores. The zero on an interval scale is arbitrary. On the celsius scale, 0 is the freezing point of water. On the fahrenheit scale, 0 is 32 degrees below the freezing point of water (see Figure).
[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Ratio Scales
Ratio scales have all the properties of interval scales plus a meaningful, absolute zero. That is, zero represents the total absence of the variable being measured. Some examples of ratio scales are length measures in the english or metric systems, time measures in seconds, minutes, hours, etc., blood pressure measured in millmeters of mercury, age, and our common measures of mass, weight, and volume (see Figure).
[Insert Figure 1.1 about here]

They are called ratio scales because ratios are meaningful with this type of scale. It makes sense to say 100 feet is twice as long as 50 feet because length measured in feet is a ratio scale. Likewise it makes sense to say a Kelvin temperature of 100 is twice as hot as a Kelvin temperture of 50 because it represents twice as much thermal energy (unlike fahrenheit temperatures of 100 and 50).
[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

1.2 Qualitative vs. Quantitative Variables
Nominal and ordinal scales are called qualitative measures. Interval and ratio scales are called quantitative measures (see Figure).
[Insert Figure 1.1 about here]
With the ratio variables the only difference that we have is there is a true zero so that you can actually talk about ratios. That is a person's lung capacity can be twice somebody else's lung capacity. In order to do that you have to have a true zero to develop ratios like that. But really for all statistical purposes it makes no difference. The important thing is If you measures like these interval measures you should keep them at the finest level of measure you have. Don't put say temperature measures into categories like the temperature was less than this or greater than that or in another group less than this and greater than that and so on. Don't cluster or group those and make them into ordinal variables. If you dothen you're throwing away information. So if you have information at the interval level record it at the interval level. If its at the ordinal level record it at that level. And of course if you're at the nominal level you're stuck with recording it at that level. So never cluster your variables together when you begin your experiments in a way that you lose information.
Now, when statistical analyses are applied, the statistics must take into account the nature of the underlying measurment scale, because there are fundamental differences in the types of information imparted by the different scales (see Figure). Consequently, nominal and ordinal scales must be analyzed using what are called non-parametric or distribution free statistics. On the other hand, interval and ratio scales are analyzed using parametric statistics. Parametric statistics typically require that the interval or ratio variables have distributions shaped like bell (normal) curves, a reasonable assumption for many of the variables frequently encountered in medical practice.

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician
(Show kinds of variables transparency) Dependent variables and independent variables. Basically what your trying to do generally is say that we're looking at an outcome like gastric ulcers and you determine other variables that may or may not affect that out come so the independent variables are the ones you manipulate, the treatments that you manipulate, and the dependent ones are the measures that are the outcomes of that.
[Insert Figure 1.1 about here]

[Insert Figure C.R.A.P Detector about here] There are a couple little C.R.A.P. detectors you can use here you can look at. CRAP detectors to find circular reasoning and other kinds of stuff.
1.3 C.R.A.P. Detector #1.1
Dependent variables should be sensible. Ideally, they should be clinically important, but also related to the independent variable.

1.4 C.R.A.P. Detector #1.2
In general, the amount of information increases as one goes from nominal to ratio. Classifying good ratio measures into large categories is akin to throwing away data.

Biostatistics Introduction Quick Quiz
Question 1: Given achievement of the objectives of these lessons you should be:
No Response
A competent biostatistician
A critical evaluator of medical research
A competent producer of medical research
None of the above

Biostatistics Introduction Quick Quiz
Question 2: Being knowledgeable about biostatistics can contribute to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Biostatistics Introduction Quick Quiz
Question 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Biostatistics Introduction Quick Quiz
Question 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 7: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 8: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Biostatistics Introduction Quick Quiz
Question 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

End Lesson 1 - Section 2: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

2: Central Tendency:

Why Important?
Why do you need to know about measures of central tendency? You need to be able to understand summaries of large amounts of data that use simple measures to best represent the location of the data as a whole. Collectively, such measures or values are referred to as measures of central tendency. Measures of central tendency are ubiquitous in the medical research literature. The most frequently used measures of central tendency are the mean, median and mode.
(Show Central Tendency transparency) We do try to display information in some graphical form. And, this is a display of physicians salaries in 1999 after all the health plans have come forward and... No this was old data actually. But, the point of this slide is to show you that there are various ways to represent a distribution of data. The mode is the most frequent, the median has equal numbers above and below it and the mean is the average value. And, if the distribution is a nice symmetric distribution, all three of those collapse into one.
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

2.1 Mean
The most frequently used measure of central tendency is the mean. The mean, or more formally, the arithmetic mean, is simply the average of the group. That is, the mean is obtained by summing all the numbers for the subjects in the group and dividing by the number of subjects in the group. The mean is useful only for quantitative variables (see Figure).
[Insert Figure 2.3 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

2.2 Median
The median is the middle score. That is, the median is the score for which half the subjects have lower scores and half have higher scores. Another way to say this is that the median is the score at the fiftieth percentile in the distribution of scores (see Figure).
[Insert Figure 2.3 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

2.3 Mode
The mode is the most frequent score. Another way to say this is that the mode is the score that occurs most often (see Figure)..
[Insert Figure 2.3 about here]

2.4 Summary Principle
In a symmetric distribution with one mode like the normal distribution the mean, median and mode all have the same value. But, in a non-symmetric distribution their values will be different. In general, as the distribution becomes more lopsided the mean and the median move away from the mode. With extremely skewed distributions the mean will be somewhat misleading as a measure of central tendency, because it is heavily influenced by extreme scores. So for example, if we take a distribution of doctor's incomes, some doctors make huge sums of money, and the median or the mode is more representative of doctor's incomes as a whole than the mean, because the very high incomes of some doctors inflates the average, making it less representative of doctors as a whole (see Figure 2.3).
[Insert Figure 2.3 about here]

Biostatistics Introduction Quick Quiz
Question 1: Given achievement of the objectives of these lessons you should be:
No Response
A competent biostatistician
A critical evaluator of medical research
A competent producer of medical research
None of the above

Biostatistics Introduction Quick Quiz
Question 2: Being knowledgeable about biostatistics can contribute to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Biostatistics Introduction Quick Quiz
Question 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Biostatistics Introduction Quick Quiz
Question 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 7: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 8: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Biostatistics Introduction Quick Quiz
Question 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

End Lesson 1 - Section 3: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

3: Variability

Why Important?
Why do you need to know about measures of variability? You also need to be able to understand summaries of large amounts of data that use simple measures to best represent the variability in the data. Measures of variability also occur very frequently in the medical research literature. If all data values are the same, then, of course, there is zero variability. If all the values lie very close to each other there is little variability. If the numbers are spread out all over the place there is more variability. Again there are many measures of variability. Some of the most frequently used measures of variability are the standard deviation, interquartile range and the range.
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

3.1 Standard Deviation
(Show standard deviation formula transparency) We can measure the spread of information or data by looking at the standard deviation and its just the mean value which we extract from the information. We sum the individual values subtracted from the mean, square it, divide it by the number of values and take the square root. The reason for that is pretty clear. The reaon that we subtract and square is pretty clear. Whether the value is above the mean or below the mean it comes out the same when we square it. So positive and negative makes no difference here. When you divide by the number of values to get an average we square root this whole thing here because we square it up here, to get back to the original measures. So by squaring to get rid of the negative and postive values we get squared measures and we square root it to get back to the original kinds of measures like feet cubic inches or whatever else it might be. The standard deviation can be thought of as the average distance that values are from the mean of the distribution (see Figure).. This means that you must be able to compute a meaningful mean to be able to compute a standard deviation. Consequently, computation of the standard deviation requires interval or ratio variables. In a distribution having a bell (normal) curve, approximately 68% of the values lie within 1 standard deviation of the mean. On the other hand, approximately 2.1% of the values lie in each tail of the distribution beyond 2 standard deviations from the mean (see Figure)..
[Insert Figure 2.5 Normal Distribution about here]

[Insert Figure SD Formula Figure about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

3.2 Interquartile Range
Remember the median was the point in the distribution where 50% of the sample were below and 50% are above. Quartiles can be defined at the 25th percentile, the 50th percentile, the 75th percentile and the 100th percentile. The interquartile range, then, from the 25th percentile to the 75th percentile, includes 50% of the values in the sample. The interquartile range is the distance between the 25th percentile and the 75th percentile. The interquartile range is a measure of variability that can be appropriately applied with ordinal variables and therefore may be used especially in conjunction with non-parametric statistics (see Figure)..
[Insert Figure for Interquartile Range & Range about here]
(Show EDA transparency) Another way to display data thats been proposed by exploratory data analsys is to rank the data from low to high then measure the median and then the quartile values that is the values between which one half of the data resides. The rest of the data is out in the wings here. And here you find the interquartile range which is the range from the lower to the upper quartile. And, the range is the extreme values (max -min)
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

3.3 Range
The range is simply the difference between the highest and lowest value in the sample (see Figure)..
[Insert Figure for Interquartile Range & Range about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

4: Exploratory Data Analysis (EDA)

Why Important?
Why do you need to know about exploratory data analysis (EDA)? The purpose of EDA is to provide a simple way to obtain a big picture look at the data and a quick way to check data for mistakes to prevent contamination of subsequent analyses. Exploratory data analysis can be thought of as a preliminary to a more in depth analysis of the data (see Figure)..
A primary tool in exploratory data analysis is the box plot (see figure). What does a box plot tell you? You can, for example, determine the central tendency, the variability, the quartiles, and the skewness for your data. You can quickly visually compare data from multiple groups. A small rectangular box is drawn with a line representing the median, while the top and bottom of the box represent the 75th and 25th percentiles, respectively. If the median is not in the middle of the box the distribution is skewed. If the median is closer to the bottom, the distribution is positively skewed. If the median is closer to the top, the distribution is negatively skewed. Extreme values and outliers are often represented with asterisks and circles (see Figure)..
[Insert Box and Whisker plot about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

4.1 Hinges
The top and bottom edges of the box plot are referred to as hinges or Tukey's hinges (see Figure)..
[Insert Box and Whisker plot about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

4.2 Ranges
? (see Figure).
[Insert Ranges Figure about here]

[Insert Box and Whisker plot about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

4.3 Outliers
Outliers and extreme values are often represented with circles and asterisks, respectively. Outliers are values that lie from 1.5 to 3 box lengths (the box length represents the interquartile range) outside the hinges. Extreme values lie more than 3 box lengths outside the hinges. In a box and whisker plot the actual values of the scores will typically lie adjacent to the outlier and extreme value symbols to facilitate examination and interpretation of the data (see Figure)..
[Insert Box and Whisker plot about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

4.4 Box & Whisker plots
Box and whisker plots represent more completely the range of values in the data by extending vertical lines to the largest and smallest values that are not outliers, extending short horizontal segments from these lines to make more apparent the values beyond which outliers begin (see Figure).
[Insert Box and Whisker plot about here]
(Show infant mortality transparency) I have an example here which is a plot of infant mortality in Houston Texas in 1978 and this show two congressional districts 15 & 18. And what you see here is the median (about 20 deaths per thousand live births). And here you see some census tracks (number of live births in the census tracks) (on the left) and these are the number of live births (on the right) in those census tracks and this is the ratio of those infant mortalities. So there were a hundred deaths per thousand live births in census track 126 (apparently 176 live births). Here in Houston Texas we have infant mortality rates that approach those in Bangladesh and other undeveloped countries. And this kind of diagram shows you those extremes very dramatically.

Biostatistics Introduction Quick Quiz
Question 1: Given achievement of the objectives of these lessons you should be:
No Response
A competent biostatistician
A critical evaluator of medical research
A competent producer of medical research
None of the above

Biostatistics Introduction Quick Quiz
Question 2: Being knowledgeable about biostatistics can contribute to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Biostatistics Introduction Quick Quiz
Question 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Biostatistics Introduction Quick Quiz
Question 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 7: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 8: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Biostatistics Introduction Quick Quiz
Question 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

End Lesson 1 - Section 4: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

5. Standard Scores

Why Important?
Why do you need understand standard scores or z-scores? Again, they appear frequently in the medical literature. A natural question to ask about a given value from a sample is, "How many standard deviations is it from the mean?". The z-score answers the question. The question is important because it addresses not only the value itself, but also the relative position of the value. For example, if the value is 3 standard deviations above the mean you know it's three times the average distance above the mean and represents one of the higher scores in the distribution. On the other hand, if the value is one standard deviation below the mean then you know it is on the low end of the midrange of the values from the sample. But, there is much more that is important about z-scores.
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

5.1 z-Scores
For every value from a sample, a corresonding z-score can be computed. The z-score is simple the signed distance the sample value is from the mean. There is a simple formula for computing z-scores (see Figure):
[Show z-score formula transparency here]
And lastly we want to talk about how we take a whole bunch of data...If I have a mean value of originally you know we talked about the mean value of how long people are going to live and we had to somehow be able to take that value and compare to other values from other experments. How do we collapse all that stuff? Well we have a way of normalizing or collapsing it and generally what you do is you get something like this mean value and subtract from each one of the values and divide it by its standard deviation and that then parametizes this distribution so this distribution has a mean of 0 all the time and a standard deviation of 1 an so you can build tables for it. So no matter what your data looks like, no matter what the mean value is, you can reduce it to a table if you reformulate your data like this. So we can take all kinds of experiments then and build tables for them because we can normalize it or reduce it by doing things like forming a z value.
Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

5.2 General z-Score Properties
Because every sample value has a correponding z-score it is possible then to graph the distribution of z-scores for every sample. The z-score distributions share a number of common properties that it is valuable to know. The mean of the z-scores is always 0. The standard deviation of the z-scores is always 1. The graph of the z-score distribution always has the same shape as the original distribution of sample values. The sum of the squared z-scores is always equal to the number of z-score values. Furthermore, z-scores above 0 represent sample values above the mean, while z-scores below 0 represent sample values below the mean (see Figure).
[Insert z-score graph about here]

5.3 Gaussian z-Score Properties
If the sample values have a Gaussian (normal) distribution then the z-scores will also have a Gaussian distribution. The distribution of z-scores having a Gaussian distribution has a special name because of its fundamental importance in statistics. It is called the standard normal distribution. All Gaussian or normal distributions can be transformed using the z-score formula to the standard normal distribution. Statisticians know a great deal about the standard normal distribution. Consequently, they also know a great deal about the entire family of normal distributions. All of the previous properties of z-score distributions hold for the standard normal distribution. But, in addition, probability values for all sample values are known and tabled. So, for example, it is known that approximately 68% of values lie within one standard deviation of the mean. Approximately 95% of values lie with 2 standard deviations of the mean. Approximately 2.1% of values lie below 2 standard deviations below the mean. Approximately 2.1% of values lie above 2 standard deviations above the mean (see Figure).
[Insert standard normal z-score graph about here]

Biostatistics Introduction Quick Quiz
Question 1: Given achievement of the objectives of these lessons you should be:
No Response
A competent biostatistician
A critical evaluator of medical research
A competent producer of medical research
None of the above

Biostatistics Introduction Quick Quiz
Question 2: Being knowledgeable about biostatistics can contribute to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Biostatistics Introduction Quick Quiz
Question 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Biostatistics Introduction Quick Quiz
Question 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 7: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 8: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Biostatistics Introduction Quick Quiz
Question 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

End Lesson 1 - Section 5: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

6. Distributions

Why Important?
Why do you need to know about distributions? Again the primary answer is that various kinds of distributions occur repeatedly in the medical research literature. Any time a set of values is obtained from a sample, each value may be plotted against the number or proportion of times it occurs in a graph having the values on the horizontal axis and the counts or proportions on the vertical axis. Such a graph is one way in which a frequency distribution may be displayed, since a frequency distribution is simply a table, chart or graph which pairs each different value with the number or proportion of times it occurs.
It turns out that some distributions are particularly important because they naturally occur frequently in clinical situations. Some of the most important distributions are Gaussian, Binomial and Poisson distributions.

6.1 Gaussian
The Gaussian distribution or bell curve (also known as the normal distribution) is by far the most important, because it occurs so frequently and is the basis for the parametric statistical tests. When values are obtained by summing over a number of random outcomes the sum tends to assume a Gaussian distribution. The Gaussian distribution gives a precise mathematical formulation to the "law of errors". The idea being that when measurements are made most of the errors will be small and close to the actual value while there will be some measurements that will have greater error but as the size of the errors of measurement increase the number of such errors decreases (see Figure).
[Insert Gaussian Figure about here]

6.2 Binomial
The family of binomial distributions is relevant whenever independent trials occur which can be categorized as having two possible outcomes and known probabilities are associated with each of the outcomes. For example, without knowing the correct answers for true-false questions there would be equal probabilities of each answer being right or wrong. The binomial distribution would describe the probabilities associated with various numbers of right and wrong answers on such a true-false test. As another example, assume that we want to determine the probability of that a genetically based defect will occur in the children of families having various sizes, given the presence of the characteristic in one of the parents. The binomial distribution would describe the probabilities that any number of children from each family would be expected to inherit the defect. These are both examples of dichotomous variables, which when graphed over multiple trials can be expected to assume a binomial distribution (see Figure).
[Insert Binomial Figure about here]

6.3 Poisson
Another important set of discrete distributions is the Poisson distribution. It is useful to think of the Poisson distribution as a special case of the binomial distribution, where the number of trials is very large and the probability is very small. More specifically, the Poisson is often used to model situations where the number of trials is indefinitely large, but the probability of a particular event at each trial approaches zero. The number of bacteria on a petri plate can be modeled as a Poisson distribution. Tiny areas on the plate can be viewed as trials, and a bacterium may or may not occur in such an area. The probability of a bacterium being within any given area is very small, but there are a very large number of such areas on the plate. A similar case would be encountered when counting the number of red cells that fall in a square on a hemocytometer grid, looking at the distribution of the number of individuals in America killed by lightening strikes in one year, or the occurrence of HIV associated needle sticks in US hospitals each year. The Poisson approximation to the binomial distribution is good enough to be useful even when N is only moderately large (say N > 50) and p only relatively small (p < .2) (Hayes, 1981) (see Figure).
[Insert Poisson Figure about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Biostatistics Introduction Quick Quiz
Question 1: Given achievement of the objectives of these lessons you should be:
No Response
A competent biostatistician
A critical evaluator of medical research
A competent producer of medical research
None of the above

Biostatistics Introduction Quick Quiz
Question 2: Being knowledgeable about biostatistics can contribute to (check all that apply):

No Response
Better clinical decisions
Improved patient outcomes
Improved patient care
All except "No Response

Biostatistics Introduction Quick Quiz
Question 3: Computation is a focus of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 4: Conceptual understanding is a goal of these lessons.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 5: Biostatistical issues emphasized will be based on the needs of:

No Response
biostatisticians
evaluators
practicing physicians
medical researcher

Biostatistics Introduction Quick Quiz
Question 6: Inferential statistics are needed when you:

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 7: Inferential statistics are not needed when you (check all that apply):

No Response
summarize data from a group
generalize from samples to populations
compute statistical measures
have huge samples

Biostatistics Introduction Quick Quiz
Question 8: You will do better with biostatistics if you focus on the underlying concepts rather than the jargon and the notation.

No Response
True
False

Biostatistics Introduction Quick Quiz
Question 9: Can the physician appropriately apply results obtained from patient groups to his own individual patients?

No Response
Yes
No
It depends

Biostatistics Introduction Quick Quiz
Question 10: Who decides whether a biostatistical result is clinically relevant?

No Response
BiostatisticianCITE
Patient
Physician
Other

End Lesson 1 - Section 6: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Introduction and Summary Measures of Data

Table of Contents

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Professional Benefits

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Conceptual Understanding

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Needs of Clinicians

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

End Lesson 1 - Section 1 Biostatistics: How Much, Why, When and What

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Four Types of Scales

Nominal Scales

Ordinal Scales

[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Interval Scales

[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Ratio Scales

[Insert Figure 1.1 about here]

[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 1.1 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 1.1 about here]

End Lesson 1 - Section 2: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 2.3 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 2.3 about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 2.3 about here]

[Insert Figure 2.3 about here]

End Lesson 1 - Section 3: Why Biostatistics

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure 2.5 Normal Distribution about here]

[Insert Figure SD Formula Figure about here]

Lesson 1: Summary Measures of Data

Biostatistics for the Clinician

[Insert Figure for Interquartile Range & Range about here]

Lesson 1: Summary Measures of Data

End Lesson 1 - Section 1
Biostatistics: How Much, Why, When and What