If one considers data science, one of the key concept that manifests itself in various facets is the concept of Hypothesis testing. For example, in Regression, if we have to know whether the co-effecient is statistically significant or not then we do a hypothesis testing. So before we dive in and understand what Hypothesis test is, it provides a sense of completion if we try to understand the big picture as to why do we need this concept and what is it trying to solve for us.
Holy grail (so to say) of statistics is to accurately draw inferences about the population. For example, average IQ of Indians, Proportion of Indians who are right handed etc. But then it is not very difficult to realise that we cannot give and IQ test to every Indian or survey everyone to figure out the proportion of people who are right or left handed. It is simply not feasible and is prohibitively expensive. This is why we work with samples. So we are always on look out for techniques which allow us to draw inference about the populations based on the samples taken from the population. This quest leads us to two(of many) techniques:
- Estimate the population parameters (Mean, variance etc.) based on sample
- Make decision concerning the value of parameters
We will look at the various estimation techniques in another post. But the second category, where in we make decisions concerning the value of parameters are done through Hypothesis testing.
If Hypothesis testing is a technique, then it leads us to the question what are the problems where this technique is apt as a solution? More specifically, given a problem should we estimate the population parameter or should we perform a hypothesis test? Well like everything, it depends. It depends for what question you are trying to find answer for.
So if hypothesis testing boils down to a process of drawing conclusions from sample data about the entire, but unknown, group (Population) from which the sample was randomly taken then what should be the key features or constituents of such a process? Among others, it should provide a systematic and robust approach to either
- Believe that the relationship (between the dependent and independent variables) that is seen in the sample is the same and would see the same in the population if out test were to be done for the entire population
- Conclude that the relationship which we are seeing in the sample is merely because of the sampling error and such a relationship would not be seen if we were to do the test for the entire population.
Once we have such a process in place, we generally morphs it in to a more specific inferential procedure based on data we have, scales of the variables etc. and can be classified into two broad categories;
Parametric procedures allow us to draw inferences only when specific assumptions about our population are true. Specifically the following two:
- Population of dependent variable has Normal distribution
- Scores in our data are of interval or ratio scale
Non-Parametric procedure does not require any such mandatory assumptions about the population and variables to be true. These can be used with Nominal, Ordinal scale variables and also when the population is skewed.
In future posts we will get into more details to understand each of these categories.