You may
have encountered questions on Biostatistics during your various PGME exams or
while preparing for them. During the MBBS curriculum, not much importance is
given to this field. But it is very important if you are going through a study
or anything related to R&D that you understand it to the core and not
just be smug by reading the abstract and conclusion. For this, understanding
Biostatistics is very important. All the data in your reference text books,
from Harrison's to Chaurasia's, are calculated using some simple mathematical
tests. You only need a very basic knowledge to get the gist of it.
As far as Biostatisticians are concerned, Biostatistics questions asked in PGME exams are not knotty at all. For example, AIPGMEE 2010 contained this :
Que : Z-Score is used in?
The very first thing that will come to your mind (if you have learned and understood Biostatistics properly) is that the question is kind of murky and even then, pretty simple because of the options. The answer is (b), Normal distribution.
What is this normal distribution? Almost everything in Biostatistics revolves around this continuous distribution. Wait, what is a continuous distribution? So lets begin with the two types of variables. A variable can be anything : height, weight, number of people with lung cancer, CRP level, etc. There are generally two types of variables.
I am sure that you all are familiar with this diagram above and many of you know what's written inside the shades as percentages depict. But for beginners, let me point out once more. The above one is the normal distribution curve. The y-axis is the probability of the values that are plotted correspondingly in the x-axis. "σ" is standard deviation and "μ" is the mean; but lets talk about them later. I told you that the x-axis contains the values of the continuous random variables. As an example, lets just say that the values denote height of the children in the age group of 10-15. As you may already know, mean is the average, mode is the value (here, height in cm) that occurs maximum number of times in the sample and median is the value in the middle of an order when you arrange the values in ascending or descending fashion. Intuitively, there will be more children with height closer to mean than are not. Number of children with extreme values of height will be considerably less. We are exactly talking about probability and that is exactly what is shown in the figure above! The centre of the curve is the mean and you can see that on both sides, the probability is relatively higher for values when they are closer to mean than when they are not.
As far as Biostatisticians are concerned, Biostatistics questions asked in PGME exams are not knotty at all. For example, AIPGMEE 2010 contained this :
Que : Z-Score is used in?
a) Chi-Square test
b) Normal distribution
c) Binomial
distribution
d) Student's t-test
The very first thing that will come to your mind (if you have learned and understood Biostatistics properly) is that the question is kind of murky and even then, pretty simple because of the options. The answer is (b), Normal distribution.
What is this normal distribution? Almost everything in Biostatistics revolves around this continuous distribution. Wait, what is a continuous distribution? So lets begin with the two types of variables. A variable can be anything : height, weight, number of people with lung cancer, CRP level, etc. There are generally two types of variables.
a) Discrete
- A variable which can take only some specific predefined values, such as
integers. An example would be :- "number of people who have
undergone CABG in the age group of 50-70", "number of people with BMI
more than 30.0", etc. Here the variable can take only integer values. It
doesn't make sense that 15.33 patients have a BMI of more than 30.0. If it
does, then its the end of the world.
b)
Continuous - There are infinite amounts of possible values than this variable
can take over an interval that can be finite or
infinite. Examples :- "blood glucose level", "time taken to
complete LSCS", etc.
Now lets
focus into the term "distribution" or more specifically,
"probability distribution". To put simply, probability is the chance of an event
occuring. If we plot the probability of the values of a "random
variable" in the y-axis and its values on the x-axis, we get a graph which
is known as probability distribution. There are discrete and continuous probability
distributions for discrete and continuous random variables respectively. But
lets bother about continuous probability distribution alone for the moment.
First of
all, normal distribution is the most studied and the most useful among all
continuous probability distributions. "Normal Distribution" is the name given to all probability distributions that
looks like the one in the figure below (thankfully, many things in life follow this
distribution and in that respect, its "normal") :-
I am sure that you all are familiar with this diagram above and many of you know what's written inside the shades as percentages depict. But for beginners, let me point out once more. The above one is the normal distribution curve. The y-axis is the probability of the values that are plotted correspondingly in the x-axis. "σ" is standard deviation and "μ" is the mean; but lets talk about them later. I told you that the x-axis contains the values of the continuous random variables. As an example, lets just say that the values denote height of the children in the age group of 10-15. As you may already know, mean is the average, mode is the value (here, height in cm) that occurs maximum number of times in the sample and median is the value in the middle of an order when you arrange the values in ascending or descending fashion. Intuitively, there will be more children with height closer to mean than are not. Number of children with extreme values of height will be considerably less. We are exactly talking about probability and that is exactly what is shown in the figure above! The centre of the curve is the mean and you can see that on both sides, the probability is relatively higher for values when they are closer to mean than when they are not.
"σ" is
a measure of the dispersion of the values of the sample from the mean of the values. You can easily see
that 68.1% of the values are within the range "μ-σ" to "μ+σ", 95.4
% in the range from "μ-2σ" to "μ+2σ" and almost
all (99.7%) will be found in the range from "μ-3σ" to "μ+3σ".
Can we define a formula for the probability of a normal distribution given its
mean and standard deviation?
where f
(x) is the probability plotted on y-axis and 'x' is the value on x-axis.
Don't worry about the f(x) and terms inside it. Its just a standard notation
for describing normal distributions. You can see the probability f(x) is a
function depending on the mean and the standard deviation of the values
represented by 'x'. This function is termed as probability density function.
Just remember the term as you may encounter it atleast once in your life.
If you want
to find the probability of a child having a height 160 cm from a given sample
with mean= 155 and standard deviation equal to 10, simply substitute 160 for
"x" , 155 for mean and 10 for standard deviation in the above formula and voila
- you will get the answer. Now what about the probability of a child
from this sample to have a height of 155cm? 155 is actually the mean
of the sample and I am asking you the probability corresponding to it. Here
both "x" and mean is 155 and the equation becomes much more simpler. Try it out yourself.
Now that we
have covered the basics about the normal distribution and that you are ready
follow the curve, lets introduce a new term named as "standard normal
distribution" (SND). Is it any different from normal distribution? First
of all, I would say that all standard normal distributions are normal
distributions, but all normal distributions are not standard distributions.
That's right. A standard normal distribution is simply a normal distribution
with mean equal to zero and standard deviation equal to 1. Okay, fine. But why do we need SND?
Well, for almost all purposes in testing of hypotheses, we need SND and its applications.
It is proven by the formidable Central Limit Theorem (CLT) that the mean of most of the important variables that you will be concerned with, for example like height, weight, CRP level, PEFR rate, etc.
follows normal distribution, especially if your sample size is large. CLT has profound applications and without this much of the hypothesis testing would be meaningless. Since we know the mean and standard deviation of
SND, it is possible to find out the values and their corresponding
probabilities easily. These values are put in a table and is available for
reference at the end of many text books such as your beloved Park. This is the
Standard Normal Distribution Table. But wait! These values are NOT the
probability values [f(x)] calculated using the above equation. These are
actually "cumulative probabilities". This is an important concept to
be understood clearly which will be described after going through the next paragraph.
f(x) was
the probability of getting a value "x" from the normal distribution
or SND, in this case. Cumulative probability density function is denoted by
"F(x)" and is defined as the probability that a value you select from
the sample is less than or equal to "x". A mere "equal to
x" probability is nothing but f(x). Simple, but it is very important that
you should always keep this in mind that the probability values given in the
Standard Normal Distribution Table is F(x) and not f(x). Clearly, F(x) can be
calculated from f(x), but that's not very important in this context.
Standard Normal Distribution Table : Positive Z Scores Negative Z Scores
How to read the Standard Normal Distribution table?
Standard Normal Distribution Table : Positive Z Scores Negative Z Scores
How to read the Standard Normal Distribution table?
The topmost row and the leftmost column are the values of "x" of SND. There are
negative values in the table because the mean of SND is zero and values to
either side of the mean has to be there and that too symmetrically. Otherwise
it won't be "normal" at all! All the other values in this table are
the cumulative probability values and they are all positive (probabilities can't be negative!) and less than one (Probability range [0,1] ).
The topmost
row add precision to the main values in the leftmost column. For example, if
you want to find the probability of getting a value less than 1.96, first look
into the leftmost column and find out 1.9. Now look for 0.06 in the topmost row. It contains values
such as 0.01, 0.02, 0.03, etc. 1.96 = 1.9 + 0.06. The probability value under
"0.06" in the row of "1.9" contains the probability that we
wanted. It comes to 0.9750.
Like that
:
F (-1.96) =
0.024998.
F (+3.90)
= 0.999952
Do these
values make any sense to you? As you can infer from the SND graph, 3.9 is kind
of far away from the mean 0. F (3.9) means probability of getting a value less
than or equal to 3.9. You already know that 99.7% of the values in SND will lie
between -3 and 3 (since the standard deviation here is 1 and mean is 0 ). Here we are considering something even farther than 3, ie
3.9. Obviously the F (3.9) would be very close it 1. It is exactly what we got
from the table, 0.999952. Similarly F(-1.96) is comparitively lesser
(0.024998)
Why SND?
In most
cases of real life, whether its the height of children or the blood glucose
level of men aged between 35 and 60, they all can be approximated to an SND by
a process known as standardisation. This is by subtracting the sample mean from
the population mean and dividing by the standard deviation. Population is the
whole data. A sample is taken from a population. Hence population mean is the
"real" mean and sample mean is an estimate of the population mean. If
you take the mean of all sample means from a population, it will be equal to
the population mean. Sample mean is represented by :-
Standardisation
of sample mean is given by the formula :
By this, we
get another variable 'Z'. 'n' is the sample size (eg:-number of children whose
heights are measured to get the sample mean). Since 'Z' is obtained by
standardising the sample mean, it follows Standard Normal Distribution! This is
the catch in it and many of the advanced standardisation procedures in medical
hypotheses testing involves a procedure built entirely upon the basis of above
formula and concept. Lets not dwell too much into that and curb your awe. The
purpose here is to understand 'Z'. Since 'Z' follows SND, we know everything
about 'Z' (from the table). Indirectly, we know everything about the sample
too. Z-score is nothing but a specific value of 'Z' that you get when you
standardise your sample mean and it is used for creating confidence intervals,
testing of hypotheses and various other cases.
Let me stop here for the time being. Like I mentioned earlier, the AIPGMEE 2010 question was kind of a vague one. Now you may understand why. A quick glance on the basics of probabilities and a read on discrete random variable distribution will be highly interesting for those who are curious. I believe that the basics of normal distribution and the relation between Z-score and normal distribution would be clear by now to the readers. Last but not least, normal distribution is also known as Gaussian distribution.
Let me stop here for the time being. Like I mentioned earlier, the AIPGMEE 2010 question was kind of a vague one. Now you may understand why. A quick glance on the basics of probabilities and a read on discrete random variable distribution will be highly interesting for those who are curious. I believe that the basics of normal distribution and the relation between Z-score and normal distribution would be clear by now to the readers. Last but not least, normal distribution is also known as Gaussian distribution.
No comments:
Post a Comment