Last Updated on May 29, 2026 by Rajeev Bagra

Machine learning models are built to find patterns in data and make predictions. However, a model can become either too simple or too complex. These two common problems are known as underfitting and overfitting.
A useful lesson on this topic is available in Kaggle’s Intro to Machine Learning course by Dan Becker:
Kaggle Lesson:
https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting
What is Underfitting?
Underfitting happens when a model is too simple to capture important patterns in the data.
For example, imagine trying to predict house prices using a decision tree that makes only a few splits. The model may ignore many useful relationships between features and prices.
Characteristics of underfitting:
- High training error
- High validation error
- Model misses important patterns
- Predictions are often inaccurate
A model that underfits has not learned enough from the available data.
What is Overfitting?
Overfitting happens when a model becomes too complex and starts memorizing the training data instead of learning general patterns.
Instead of learning trends, the model learns noise, exceptions, and random fluctuations.
Characteristics of overfitting:
- Very low training error
- High validation error
- Excellent performance on training data
- Poor performance on unseen data
A model that overfits may appear impressive during training but often fails in real-world prediction tasks.
Understanding the Tree Depth Graph
Consider a graph where:
- The horizontal axis represents Tree Depth.
- The vertical axis represents Mean Absolute Error (MAE).
As tree depth increases:
- Training error keeps decreasing.
- Validation error decreases at first.
- Eventually validation error begins increasing.
This behavior indicates the transition from underfitting to overfitting.
The general pattern can be visualized as:
Underfitting → Best Fit → Overfitting
The best model is usually located near the lowest point of the validation error curve.
Why Does Validation Error Matter?
Training data is the data used to build the model.
Validation data is separate data that the model has never seen before.
A model should perform well on both datasets. If performance improves only on training data while becoming worse on validation data, the model is likely overfitting.
This idea is central to modern machine learning because the real goal is not to memorize known examples but to make accurate predictions on new data.
Mean Absolute Error (MAE)
MAE measures the average size of prediction errors.
The formula is:
Where:
= actual value
= predicted value
= number of observations
Lower MAE values indicate better predictions.
Decision Trees and Model Complexity
In decision trees, one important parameter is tree depth.
A shallow tree:
- Learns fewer rules
- May underfit
A deep tree:
- Learns many detailed rules
- May overfit
The challenge is finding a depth that balances learning and generalization.
Practical Goal
The objective of machine learning is not:
- Lowest training error
Instead, the objective is:
- Best performance on unseen data
This usually happens at a moderate level of model complexity rather than at the maximum possible complexity.
Key Takeaways
- Underfitting occurs when the model is too simple.
- Overfitting occurs when the model is too complex.
- Training error usually decreases as model complexity increases.
- Validation error helps identify the optimal model complexity.
- The best model is often found near the minimum validation error.
- Decision tree depth is a common example of the balance between underfitting and overfitting.
Further Learning
Kaggle: Underfitting and Overfitting
https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting
Wikipedia: Overfitting
https://en.wikipedia.org/wiki/Overfitting
Kaggle Intro to Machine Learning Course
https://www.kaggle.com/learn/intro-to-machine-learning
Discover more from Aiannum.com
Subscribe to get the latest posts sent to your email.
Leave a Reply