Ways to Handle Missing Values in Machine Learning

Missing values in machine learning
Missing values in machine learning

Missing data may result from a variety of circumstances, including an interviewer’s inaccuracy in giving questions, or a respondent’s refusal to answer specific questions as well as human error in data coding and data input. For researchers, this is a significant issue and requires careful assessment. During analysis, they must determine what to do with missing values; whether to leave instances with missing data out of the analysis or to impute values before analysis.

Little and Rubin (1987) focused on the process of data imputation emphasizing that it is dependent on the pattern/mechanism that creates missing values. There are three sorts of missing values:

a) values that are not missing at random (NMAR),

(b) values missing at random (MAR), and

(c) values missing fully at random (MCAR) (MCAR).

In the case of NMAR and MAR, researchers may overlook missing values, but in the case of NMAR, missing value data approaches are required.

Little and Rubin (1987) proposed three methods for dealing with missing values in data: a) comprehensive case analysis (list-wise deletion), (b) available case methods (pair-wise deletion), and (c) using projected scores to fill in the missing values (imputation).

List-wise Deletion

In this case, all instances relevant to the variable of interest are eliminated via list-wise deletion, and the remaining data is analyzed using traditional data analysis techniques. The following are some of the benefits of this method:

a) comparability of univariate statistics, since they are all generated using the same sample base of cases, and

b) simplicity, because standard analysis may be done without change. However, there are drawbacks notably regarding the possible loss of information when incomplete instances are discarded.

Pair-wise Deletion

Unlike list-wise deletion, instances related to each moment are estimated independently utilizing cases with values for the relevant variables in pair-wise deletion (PD). Pair-wise deletion makes use of all situations in which the variable of interest is present. It has the benefit of being straightforward while simultaneously increasing the sample size. However, its drawback is that the sample base shifts from variable to variable based on the pattern of missing data.


Imputation is the process of substituting an estimated value for a missing value using a mathematical and statistical model. It’s worth noting, nevertheless, that an imputation model should be selected by following the missing value pattern and the data analysis strategy. The model should, in particular, be flexible enough to retain the correlations or links among variables that will be the subject of future research.

Because it may be utilized for a range of post-implementation analyses, a flexible imputation model that maintains a high number of relationships is sought. Following are some of the most often used imputation techniques.

a) Mean substitution: Mean substitution used to be the most popular imputation technique for missing data but it is no longer recommended. The replacement mean is employed instead of the actual mean when mean substitution is used.

b) Regression analysis: As the name implies, regression analysis fits a regression model to data. It accomplishes so by doing a regression analysis in circumstances where there is no missing data to forecast values for cases where there is missing data. However, this technique suffers from the same flaw as a mean substitution: all instances with the same independent variable values will be imputed with the same missing variable value. That is why some researchers favor stochastic substitution, which uses regression and includes a random value in the projected outcome.

c) Interpolation: Interpolation is a technique for estimating unknown values by comparing them to known values. Investors often use interpolation to anticipate stock prices because it creates new data points from existing data. Several interpolation approaches such as the Linear Interpolation Method, Polynomial Interpolation Method, and Nearest Neighbour Method may be used in research.

The K Nearest Neighbor Method: In this scenario, the researcher may pick a distance metric for K neighbors, and the average is utilized to impute an estimate. The approach may determine the most common value among neighbors and the mean value among the closest neighbors.

d) Time-Series Approach: Another alternative for imputing data is to utilize the time series method. Imputation approaches for time series data presume that the neighboring observations will be the same as the missing data. When that assumption is correct, these strategies function effectively.

e) Maximum likelihood estimation (MLE): Maximum likelihood estimation is a better alternative to multiple regression because it does not rely on statistical assumptions as regression does. As a result, it is one of the most utilized imputation techniques.

f) Multiple imputations (MI): Multiple imputations is a robust data imputing technique. As the name implies, MI imputes each missing value with a collection of probable values. The findings of each of the numerous imputed data sets are then merged to form inferential assertions that indicate missing data uncertainty in terms of p-values.

g) Hotdeck technique: In this method, the user determines which variables from the stratum should initially be focused on and discussed. The next step is to collect numerous samples from each stratum to calculate the missing value estimate for the provided variable in that stratum.   h) Complex research design adjustments: It is essential to make adjustments in the case of complex research designs. In such circumstances, independent samples are chosen from the population using repeated, systematic sampling techniques. Sample variance is assessed by looking at the variability of sub-sample estimates.

Kultar Singh – Chief Executive Officer, Sambodhi

Kultar Singh