There are multiple types of missing data e.g. Missing at Random (MAR) and Missing at Completely Random (MCAR) missing data. For details on the classification examples, users are referred to Little and Ruben1. For satisfying both MAR and MCAR, the missing records or observation should not be related to specific information. Example, when housewife fill up the information, the income field may be missing but for a reason.
There are multiple approaches available for missing value treatment and analysis. One of the important points to keep in mind is to review the variables and find out the rationale around missing values. There may a business reason for missing value and the reason could be helpful e.g. in understanding customer behavior. A few years back we were building a customer churn model for a telecom client and found that a variable had around 80% missing values. Typically analyst would exclude the variables with over 30-40% missing. When we look at the variables, we found that one of the variables was “Value of international calls”. Of course, it is not expected that all the customers would be international callers.
- Deletion of missing observations: This approach can be adopted with assumption of Missing at Random (MAR) or Missing Completely at Random (MCAR) otherwise the sample could be bias.
- Replacing with zero, mean or median values: This approach can also cause bias in mean or variance estimation.
- Using Multiple Imputation2,3 techniques
In the graph, it seems some of the values are outliers, but actually they are missing values. Analysts have to be careful about these values. Some time the missing values are denoted with 99999 etc. In this case for missing date of birth (DOB) and Start Date, a default date is populated hence when age and years with an organization is calculated, it has some patterns with exceptionally high values
Outlier data points are the observations and values which are significant beyond the typical values of a variable. For example, income of a successful businessman or COE may have a value which is significantly higher than the typical values. The inclusion of such observations may cause bias in estimates including mean or variance values. The impact could be more pronounced on a sample depending on whether these observations are selected in a sample.
In a statistical or predictive modeling, the outliers could be two types, first outlier values for a dependent variable and second outlier values for a predictor. Outliers for predictor variables are also called leverage points. Residual analysis for regression and graphical analysis are some of the ways to identify outliers.
Why outliers are important? How outliers are different from influential points? How outliers can be detected? How robust regression can help?
WOE Variable transforming for tackling missing and outlier observations
One of the practical approaches adopted by many practitioners while building predictive model using Binary Logistic Regression is transforming variables to Weightage of Evidence (WOE) variables. WOE variable transformation is used for tackling both missing and outliers. Missing or outlier classes are grouped with other classes based on Weight of Evidence (WOE) using fine and coarse classing.
1 Little, R.J.A. & Rubin, D.B. (1987). Statistical analysis with missing data. New York: Wiley.