How do we check if a variable follows the normal distribution?
- Plot a histogram out of the sampled data. If you can fit the bell-shaped “normal” curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
- Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution.
- Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.
- Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.
What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there’s a great chance of presence of some skewed values i.e outliers if we talk in data science terms.
Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal.