Machine learning is full of challenges, and like many before me, I made mistakes — some obvious, others painfully costly. Each mistake taught me something critical, whether it was about data preprocessing, model selection, or real-world deployment. In this article, I’ll break down 13 mistakes I learned the hard way, why they were mistakes, what I learned, and tips to avoid them.
If you’re a beginner, consider this a survival guide. If you’re experienced, you might see reflections of your own struggles. Either way, let’s dive deep into these lessons and decode the technical missteps that can make or break an ML project.
1. Skipping Exploratory Data Analysis (EDA)
In the early days, I was all about jumping straight into model training. I thought the sooner I started training, the quicker I’d get results. But this led to some painful discoveries.
I encountered unexpected behaviors, poor model performance, and issues like missing values, outliers, and incorrect distributions — all of which could have been detected earlier through thorough exploration. The problem? I never really got to know my data before jumping into the models.
Pro Tips
1) Visualization Is Key: A picture is worth a thousand words. Visualizing your data can help uncover hidden patterns that aren’t obvious from raw data. Here are a few tools that really helped me with this:
- Histograms: Great for understanding the distribution of numerical data.
- Box plots: Perfect for identifying outliers.
- Pair plots: Help visualize the relationship between multiple features and detect patterns.
- Heatmaps: These help visualize correlations between features.
2) Summary Statistics: A Quick Check df.describe()
is another handy function in pandas that helps you quickly inspect your data. It provides essential statistics (mean, std, min, max, etc.) for numerical columns, revealing potential issues like data skew, imbalanced values, or even unexpected distributions.
3) Make use of Automated EDA Tools: If you’re short on time, you can use powerful libraries to automate EDA and generate comprehensive reports:
pandas-profiling
: This package generates a detailed EDA report with visualizations, distributions, and correlations.sweetviz
: A simpler and more visual alternative for generating EDA reports with comparison features between train and test sets.
# Install pandas-profiling if you haven't already
!pip install pandas-profiling
# Import the necessary libraries
import pandas as pd
from pandas_profiling import ProfileReport
# Load your dataset
df = pd.read_csv('your_data.csv')
# Generate the EDA report
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# Save the report as an HTML file
profile.to_file("eda_report.html")

2. Not Handling Missing Data Properly
My go-to solution for missing values were either dropping the missing values using df.dropna()
or filling them with zeros, assuming it wouldn’t cause much harm.
Unfortunately, I couldn’t have been more wrong. This seemingly simple approach ended up distorting my models, especially when dealing with categorical data. The results weren’t pretty, and I quickly realized that missing data is not something you can just ignore or handle arbitrarily.
The Common Mistakes
- Dropping Missing Values Arbitrarily: By dropping rows or columns that contain missing values without considering the reason behind the missingness, I ended up losing valuable data — sometimes even entire rows that had useful information for my model.
- Filling with Zeros or Constants: This was my go-to strategy for numerical features. But, for some cases, it could make no sense to fill missing values with zeros. It introduces bias, especially if the missing values were supposed to reflect something else — like a lack of measurement or unavailable data.
- Ignoring Context: Sometimes, I treated all missing values equally. For example, categorical features should be handled differently than numerical ones. Filling missing categories with zeros could confuse my model by introducing an “artificial” category.
Best Practices for Handling Missing Data
- Do Not Drop Data Arbitrarily: Dropping rows or columns containing missing data should be your last resort. If the missing values are a small percentage of the dataset, it might not affect your results. But if there’s a substantial amount of missing data, you’ll lose valuable information. Instead of dropping, focus on imputation methods.
- Imputation with Mean, Median, or Mode: For numerical data, using the mean or median to fill missing values might be a good starting point. But remember, this only works well if the data is missing at random and doesn’t exhibit a pattern. For categorical data, filling with the mode (most frequent value) is a better choice, though it assumes that the missing data follows a certain distribution.
- Advanced Imputation Techniques:
- KNN Imputation: KNN (K-Nearest Neighbors) can be used to impute missing values based on the proximity of other data points. It’s often more effective because it uses the similarity of other data points to predict missing values.
- Machine Learning-Based Imputation: Using models like random forests or regression models to predict missing values is another advanced option. These methods can model the relationships between features and predict missing values based on other available information.
4. Imputation Libraries:
- Use the
SimpleImputer
fromsklearn.impute
for straightforward imputation methods (mean, median, most frequent). This is ideal for basic strategies. - If you want to go beyond basic imputation and leverage a predictive approach, try the
IterativeImputer
fromsklearn.impute
, which uses other features to predict missing values iteratively, providing a more context-aware imputation.
from sklearn.impute import SimpleImputer
# Example for numerical data imputation (using mean)
imputer = SimpleImputer(strategy='mean') # Can also use 'median', 'most_frequent', or 'constant'
X_imputed = imputer.fit_transform(X)
# Example for categorical data imputation (using the most frequent category)
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_imputed_categorical = categorical_imputer.fit_transform(X_categorical)
For more advanced imputation using a predictive approach:
from sklearn.impute import IterativeImputer
# Using IterativeImputer for more sophisticated predictions of missing values
iterative_imputer = IterativeImputer()
X_imputed_iterative = iterative_imputer.fit_transform(X)

3. Ignoring Data Leakage
When I first started working with machine learning, I had one of those “I thought I had it all figured out” moments.
Everything was going great during training. My model was performing exceptionally well, achieving high accuracy — but when I deployed it to production, it failed miserably.
It was a total disaster. I couldn’t understand what went wrong until I discovered I had made a rookie mistake: I had unintentionally included target variable information in my training data.
+ There are no comments
Add yours