Monday, November 19, 2012

Demystifying the Normal Distribution and Z-score for Medicos

You may have encountered questions on Biostatistics during your various PGME exams or while preparing for them. During the MBBS curriculum, not much importance is given to this field. But it is very important if you are going through a study or anything related to R&D that you understand it to the core and not just be smug by reading the abstract and conclusion. For this, understanding Biostatistics is very important. All the data in your reference text books, from Harrison's to Chaurasia's, are calculated using some simple mathematical tests. You only need a very basic knowledge to get the gist of it.


As far as Biostatisticians are concerned, Biostatistics questions asked in PGME exams are not knotty at all. For example, AIPGMEE 2010 contained this :

Que : Z-Score is used in?
          a) Chi-Square test
          b) Normal distribution
          c) Binomial distribution
          d) Student's t-test

The very first thing that will come to your mind (if you have learned and understood Biostatistics properly) is that the question is kind of murky and even then, pretty simple because of the options. The answer is (b), Normal distribution.

What is this normal distribution? Almost everything in Biostatistics revolves around this continuous distribution. Wait, what is a continuous distribution? So lets begin with the two types of variables. A variable can be anything : height, weight, number of people with lung cancer, CRP level, etc. There are generally two types of variables.
a) Discrete - A variable which can take only some specific predefined values, such as integers. An example   would be :- "number of people who have undergone CABG in the age group of 50-70", "number of people with BMI more than 30.0", etc. Here the variable can take only integer values. It doesn't make sense that 15.33 patients have a BMI of more than 30.0. If it does, then its the end of the world.
b) Continuous - There are infinite amounts of possible values than this variable can take over an interval that can be finite or infinite. Examples :- "blood glucose level", "time taken to complete LSCS", etc.

Now lets focus into the term "distribution" or more specifically, "probability distribution". To put simply, probability is the chance of an event occuring. If we plot the probability of the values of a "random variable" in the y-axis and its values on the x-axis, we get a graph which is known as probability distribution. There are discrete and continuous probability distributions for discrete and continuous random variables respectively. But lets bother about continuous probability distribution alone for the moment.
First of all, normal distribution is the most studied and the most useful among all continuous probability distributions. "Normal Distribution" is the name given to all probability distributions that looks like the one in the figure below (thankfully, many things in life follow this distribution and in that respect, its "normal")  :-



I am sure that you all are familiar with this diagram above and many of you know what's written inside the shades as percentages depict. But for beginners, let me point out once more. The above one is the normal distribution curve. The y-axis is the probability of the values that are plotted correspondingly in the x-axis. "σ" is standard deviation and "μ" is the mean; but lets talk about them later. I told you that the x-axis contains the values of the continuous random variables. As an example, lets just say that the values denote height of the children in the age group of 10-15. As you may already know, mean is the average, mode is the value (here, height in cm) that occurs maximum number of times in the sample and median is the value in the middle of an order when you arrange the values in ascending or descending fashion. Intuitively, there will be more children with height closer to mean than are not. Number of children with extreme values of height will be considerably less. We are exactly talking about probability and that is exactly what is shown in the figure above! The centre of the curve is the mean and you can see that on both sides, the probability is relatively higher for values when they are closer to mean than when they are not.
"σ" is a measure of the dispersion of the values of the sample from the mean of the values. You can easily see that 68.1% of the values are within the range "μ-σ" to "μ+σ", 95.4 % in the range from "μ-2σ" to "μ+2σ"  and almost all (99.7%) will be found in the range from "μ-3σ" to "μ+3σ". Can we define a formula for the probability of a normal distribution given its mean and standard deviation?





where f (x) is the probability plotted on y-axis and 'x' is the value on x-axis. Don't worry about the f(x) and terms inside it. Its just a standard notation for describing normal distributions. You can see the probability f(x) is a function depending on the mean and the standard deviation of the values represented by 'x'. This function is termed as probability density function. Just remember the term as you may encounter it atleast once in your life. 
If you want to find the probability of a child having a height 160 cm from a given sample with mean= 155 and standard deviation equal to 10, simply substitute 160 for "x" , 155 for mean and 10 for  standard deviation in the above formula and voila - you will get the answer. Now what about the probability of a child from this sample to have a height of 155cm? 155 is actually the mean of the sample and I am asking you the probability corresponding to it. Here both "x" and mean is 155 and the equation becomes much more simpler. Try it out yourself.
Now that we have covered the basics about the normal distribution and that you are ready follow the curve, lets introduce a new term named as "standard normal distribution" (SND). Is it any different from normal distribution? First of all, I would say that all standard normal distributions are normal distributions, but all normal distributions are not standard distributions. That's right. A standard normal distribution is simply a normal distribution with mean equal to zero and standard deviation equal to 1. Okay, fine. But why do we need SND? Well, for almost all purposes in testing of hypotheses, we need SND and its applications. 
It is proven by the formidable Central Limit Theorem (CLT) that the mean of most of the important variables that you will be concerned with, for example like height, weight, CRP level, PEFR rate, etc. follows normal distribution, especially if your sample size is large. CLT has profound applications and without this much of the hypothesis testing would be meaningless. Since we know the mean and standard deviation of SND, it is possible to find out the values and their corresponding probabilities easily. These values are put in a table and is available for reference at the end of many text books such as your beloved Park. This is the Standard Normal Distribution Table. But wait! These values are NOT the probability values [f(x)] calculated using the above equation. These are actually "cumulative probabilities". This is an important concept to be understood clearly which will be described after going through the next paragraph. 
f(x) was the probability of getting a value "x" from the normal distribution or SND, in this case. Cumulative probability density function is denoted by "F(x)" and is defined as the probability that a value you select from the sample is less than or equal to "x". A mere "equal to x" probability is nothing but f(x). Simple, but it is very important that you should always keep this in mind that the probability values given in the Standard Normal Distribution Table is F(x) and not f(x). Clearly, F(x) can be calculated from f(x), but that's not very important in this context.

 Standard Normal Distribution Table :  Positive Z Scores        Negative Z Scores
                                                                         
How to read the Standard Normal Distribution table?
The topmost row and the leftmost column are the values of "x" of SND. There are negative values in the table because the mean of SND is zero and values to either side of the mean has to be there and that too symmetrically. Otherwise it won't be "normal" at all! All the other values in this table are the cumulative probability values and they are all positive (probabilities can't be negative!) and less than one (Probability range [0,1] ). 
The topmost row add precision to the main values in the leftmost column. For example, if you want to find the probability of getting a value less than 1.96, first look into the leftmost column and find out 1.9. Now look for 0.06 in the topmost row. It contains values such as 0.01, 0.02, 0.03, etc. 1.96 = 1.9 + 0.06. The probability value under "0.06" in the row of "1.9" contains the probability that we wanted. It comes to 0.9750. 
Like that : 
F (-1.96) = 0.024998.
F (+3.90) = 0.999952
Do these values make any sense to you? As you can infer from the SND graph, 3.9 is kind of far away from the mean 0. F (3.9) means probability of getting a value less than or equal to 3.9. You already know that 99.7% of the values in SND will lie between -3 and 3 (since the standard deviation here is 1 and mean is 0 ). Here we are considering something even farther than 3, ie 3.9. Obviously the F (3.9) would be very close it 1. It is exactly what we got from the table, 0.999952. Similarly F(-1.96) is comparitively lesser (0.024998) 

Why SND?   
In most cases of real life, whether its the height of children or the blood glucose level of men aged between 35 and 60, they all can be approximated to an SND by a process known as standardisation. This is by subtracting the sample mean from the population mean and dividing by the standard deviation. Population is the whole data. A sample is taken from a population. Hence population mean is the "real" mean and sample mean is an estimate of the population mean. If you take the mean of all sample means from a population, it will be equal to the population mean. Sample mean is represented by :-

Standardisation of sample mean is given by the formula :

By this, we get another variable 'Z'. 'n' is the sample size (eg:-number of children whose heights are measured to get the sample mean). Since 'Z' is obtained by standardising the sample mean, it follows Standard Normal Distribution! This is the catch in it and many of the advanced standardisation procedures in medical hypotheses testing involves a procedure built entirely upon the basis of above formula and concept. Lets not dwell too much into that and curb your awe. The purpose here is to understand 'Z'. Since 'Z' follows SND, we know everything about 'Z' (from the table). Indirectly, we know everything about the sample too. Z-score is nothing but a specific value of 'Z' that you get when you standardise your sample mean and it is used for creating confidence intervals, testing of hypotheses and various other cases.
Let me stop here for the time being. Like I mentioned earlier, the AIPGMEE 2010 question was kind of a vague one. Now you may understand why. A quick glance on the basics of probabilities and a read on discrete random variable distribution will be highly interesting for those who are curious. I believe that the basics of normal distribution and the relation between Z-score and normal distribution would be clear by now to the readers. Last but not least, normal distribution is also known as Gaussian distribution.

No comments:

Post a Comment