How do we check if a variable follows the normal distribution?

  1. Plot a histogram out of the sampled data. If you can fit the bell-shaped “normal” curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
  2. Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution.
  3. Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.
  4. Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.

What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?

Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there’s a great chance of presence of some skewed values i.e outliers if we talk in data science terms.

Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal.

Speak Your Mind