Preparing high-quality tabular data for model training using noise injection and interpolation methods

Introduction

Machine learning models require training on substantial amounts of high-quality, relevant data.

Yet, real-world data presents significant challenges due to its inherent imperfections.

Data augmentation is a key strategy to tackle these challenges and provide robust training for the model.

In this article, I’ll explore major data augmentation techniques for tabular data:

noise injection and
interpolation methods, including SMOTE algorithms,

along with practical implementation examples.

What is Data Augmentation Technique

Data augmentation is data enhancement technique in machine learning that handles specific data transformations and data imbalance by expanding original datasets.

Its major techniques include noise injection where the model is trained on a dataset with intentionally created noise and interpolation methods where the algorithm estimates unknown data based on the original dataset.

Due to this expansion approach leveraging the original dataset, sufficiently large and accurate dataset that reflects the true underlying data distribution is prerequisite to fully leverage data augmentation.

Unless otherwise, noise and outliers in the original dataset that the model shouldn’t learn are also augmented as new data, completely misleading the model.

Why Data Augmentation is Important: The Challenges of Real-World Data

For a model to be effective, it must be trained on data that accurately reflects patterns likely to recur in the future.

Lack of high-quality, relevant data prevents models from learning effectively, leading to poor performance.

However, primary challenges arise when dealing with real-world datasets: data quantity issues and data quality issues.

Data Quantity Issues:

Acquiring sufficient data can be a significant hurdle when relevant events are extremely rare (e.g., predicting rare decreases).

Insufficient data lead to major problems:

Underfitting where the model fundamentally fails to learn patterns from data and generates high bias and
Class imbalance in classification tasks where certain classes in the target variable lack sufficient data compared to the others, making the model bias toward dominant classes.

Data Quality Issues:

Even with sufficient data, exceptional imperfections like missing values, noise, or inconsistencies can severely mislead a model.

This causes a common problem, overfitting where the model learns incorrect patterns from the training data, ultimately preventing it from generalizing its learning to unseen data, generating high variance.

Choosing the Right Data Enhancement Approach

Data enhancement collectively refers to machine learning strategies to expand and improve the quality of datasets for model training to boost its generalization capabilities.

Primary approaches include imputation, synthetic data generation, and data augmentation, each of which handles different types of data limitation challenge.

Data Enhancement Techniques by Data Limitation Types

Imputation:

This technique addresses missing values within existing datasets.

Importantly, it doesn’t increase the number of samples; instead, it fills in gaps in the original data points.

Depending on the type of missing data, imputation approaches vary:

Statistical: Mean, Median, Mode Imputation
Model-based: KNN Imputation, Regression Imputation
Deep learning based: GAIN (Generative Adversarial Imputation Networks)
Time series specific: Forward Fill/Backward Fill

Synthetic Data Generation:

This approach is ideal when we are facing limitations in data quantity, privacy concerns, or data sharing restrictions.

It involves creating entirely new datasets from scratch, meticulously designed to reflect the statistical properties of real data without using actual sensitive information.

Advanced techniques like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) can generate high-fidelity synthetic data, which is particularly useful when real data is scarce, sensitive, or contains significant imperfections.

Data Augmentation:

This method tackles limitations related to data quantity by expanding original datasets (key difference from synthetic data generation).

It involves applying various transformations to the original data (e.g., rotating images, adding noise to audio) without collecting new raw data.

This process helps the model generalize better to unseen examples, lowering variance.

With understanding of the data enhancement overview, I’ll explore two major data augmentation techniques: noise injection and interpolation methods in the next section.

Noise Injection

Noise injection is a data augmentation technique to deliberately introduce controlled random perturbations into continuous features during model training.

This method is applicable for both regression and classification tasks, but noise has to be injected to continuous values.

For example:

Original Data Point: [age: 35, income: 60000, gender: 1]

Applying noise injection by adding a small, random value to each continuous feature:

Augmented Data Point 1:
[age: 35 + 1.2 = 36.2, income: 60000 - 550 = 59450, gender: 1]
Augmented Data Point 2:
[age: 35 - 0.8 = 34.2, income: 60000 + 720 = 60720, gender: 1]

In this example, noise for age and income are randomly selected from the value ranged from -10 to 10 and -1,000 to 1,000 respectively.

A discrete feature gender is out of the scope, so remains the same.

Although noise injection will not increase the number of the samples in the dataset, it can implicitly expand the feature space by adding values to continuous features.

Major techniques applicable for tabular data include:

Gaussian Noise Injection: Adds random values sampled from a Gaussian distribution to the original dataset, and
Jittering: Applies small, random perturbations (often follows Gaussian) to individual data points in time series/sequential data.

Now, take a look at how a common noise injection method: Gaussian Noise Injection works.

Demonstration: Gaussian Noise Injection

I created a scenarios where a Linear Regression model is trained on extremely noisy data because the deployment environment is expected to have noise (e.g., sensor readings with measurement errors).

This scenario is challenging for the model because in its nature, Linear Regression requires abundant, linearly separable data to accurately learn linear approximations.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import  mean_squared_error, mean_absolute_error, r2_score

# underfit due to limited samples
n_samples, n_features = 100, 10

# creates true X
X_true = np.random.rand(n_samples, n_features)

# creates true y (extremely noisy)
true_coefficients = np.random.randn(n_features)
true_bias = 100
y_true_noise = np.random.rand(n_samples) * 10000
y_true = np.dot(X_true, true_coefficients) + true_bias + y_true_noise

# splits and scales the data
X_train, X_test, y_train, y_test = train_test_split(X_true, y_true, test_size=30, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# trains the model and makes a prediction
model = LinearRegression().fit(X_train, y_train)
y_pred_train = model.predict(X_train_s)
y_pred_test = model.predict(X_test_s)

# computes evaluation matrics
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

Results from Original Data

Without noise injection, the model failed to learn the pattern, ending up with significantly high errors (e.g., generalization MSE: 48,429.01).

MSE: Train 21,232.91 → Generalization on test set: 48,429.01
MAE: Train 3,472.48 → Generalization on test set: 5,943.21
R2 Score: Train: -1.00 → Generalization on test set: -4.5368

Then, I added Gaussian Noise to the training dataset and retrained the model:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import  mean_squared_error, mean_absolute_error, r2_score

# adds gaussian noise to training dataset (before scaling)
gaussian_noise = np.random.normal(loc=0, scale=1, size=X_train.shape)
X_train_noise = X_train + gaussian_noise

# scale the dataset
scaler = StandardScaler()
X_train_noise_s = scaler.fit_transform(X_train_noise)
X_test_noise_s = scaler.transform(X_test)

# retrain the model and make a prediction
model = LinearRegression().fit(X_train_noise_s, y_train)
y_pred_train_noise = model.predict(X_train_s_noise)
y_pred_test_noise = model.predict(X_test_s_noise)

# computes evaluation matrics
mse_train = mean_squared_error(y_train, y_pred_train_noise)
mae_train = mean_absolute_error(y_train, y_pred_train_noise)
r2_train = r2_score(y_train, y_pred_train_noise)
mse_test = mean_squared_error(y_test, y_pred_test_noise)
mae_test = mean_absolute_error(y_test, y_pred_test_noise)
r2_test = r2_score(y_test, y_pred_test_noise)

Results from Data with Gaussian Noise

The model improved performance significantly from generalization MSE of 48,429 to 8,962.

MSE: Train 9,240.38 → Generalization on test set: 8,962.52
MAE: Train 2,632.58 → Generalization on test set: 2,610.19
R2 Score: Train: 0.13 → Generalization on test set: -0.0247

These results indicates that the model become more robust to noisy real-world data after trained on the Gaussian noise.

Note: Some occasions we should avoid noise injection are:

When interpretability is crucial: The noise added makes the relationship between input features and predictions obscure.
When the model is sensitive to small input perturbations: Especially in safety-critical systems, even small changes to the input could lead to inaccurate outputs.
When training time is extremely limited: The process of injecting noise could increase the computational cost and training time when implemented at scale.

Else, noise injection is an useful method to combat moderate overfitting by forcing the model to learn varied versions of the data.

Now, I’ll explore interpolation methods and SMOTE algorithms in the next section.

Interpolation

Interpolation is a data augmentation technique that expands the underlying data distribution of the original dataset by estimating unknown values between data points randomly chosen from the original dataset.

Because of this estimation process, this method requires the original dataset to be accurate and sufficiently robust.

It’s not suitable for:

Very limited dataset because it cannot estimate new data correctly, or
Noisy datasets as the noise is also expanded to new data, misleading the model.