In most simpler words, EDA refers to process of analysing the data to get maximum insights into a data set, extract important variables, spot anomalies and uncover underlying structure. Most of the EDA techniques are graphical as purpose of EDA is to open-mindedly explore the insights. These techniques may consist of simple statistics plots such as box plot, mean plot and plotting raw data such as histogram.
In this article, I would like to discuss more on the basic challenges in EDA.
Out of all most basic, two of the important ones are Missing Values and Outliers. In this article, we’ll be discussing the same.
1) Missing Values:
The missing values could be because of fault in data extraction process or data collection process. The missing values can be at random or dependent on specific observation.
It can lead to a biased model as analysis with the features is not done correctly which can lead to wrong prediction.
Deletion: If training data has large number of observations then observations containing missing values can be deleted. After deletion the dataset should have sufficient data points and dataset should not be biased.
Imputation: Another method of treating missing values is by imputing mean/median/mode values. Replacing the missing values with the mean / median / mode is a raw method. Imputation is done in two ways, one is generalized another is similar case type.
In generalized way, we calculate mean/median/mode of all the non-missing values then replace it with missing values.
Predictive Model: In this method, a predictive model is utilized to predict the missing values. The observation containing non-missing values are taken as training data and rest is taken as testing data, taking missing values as target variable.
2) Outliers:
Outliers are data points which appears far away and diverges from overall dataset. These data points impact dataset statistics in huge way and can result to wrong prediction.
If the outliers are non-random then they can reduce the normality of dataset. They can impact the assumption of statistical models like regression, annova etc. They can introduce bias.
The causes of outliers can be natural or non-natural. Natural outliers are the case in which these data points are not an error, they are collected correctly. On the other hand, non natural outliers are actual errors which might be due to data entry error, measurement error, sampling error or data processing error etc.
The outliers can be detected by using visualization tools like box plot, scatter plot, histogram etc. Apart from these visualizations, some practitioner refers to few guidelines to detect outliers. Some of them are, if a value is higher than the mean plus or minus three Standard Deviation is considered as outlier.
Values outside of 5th and 95th percentile are considered as outliers. If a value is higher than the 1.5*IQR the value will be considered as outlier. In the same way, if a value is lower than the 1.5*IQR the value will be considered as outlier. Here IQR refers to Interquartile range.
Deletion: If the outliers are due to error and are lesser in number the data points are deleted.
Imputation: Even the outliers can be treated by imputing values, If the outliers are non-natural then they can be replaced with mean/median/mode values.
Treat Separately: If the outliers are in significant amount and natural, then group of outliers can be treated separately. Separate statistical model can be built, then outputs can be combined.
Value Transformation: Transforming values can affect in eliminating outliers. One way is to implement natural log to dataset, it will reduce the variation caused by extreme values.
Written By: Vishwas Anandani
댓글