Last Updated on May 15, 2026 by Rajeev Bagra
Machine learning beginners often experience a moment of excitement and confusion when they first train a model and see a large error value printed on the screen.
For example:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
Output:
265806.91478373145
At first glance, this number may look shocking.
But what exactly does it mean?
Let us understand the entire machine learning workflow behind this code step by step.
What Is Happening in This Code?
This code performs a complete beginner-level machine learning pipeline:
Data
→ Split Data
→ Train Model
→ Make Predictions
→ Measure Error
The model being used here is a <scientific_concept>Decision Tree</scientific_concept> regressor from the scikit-learn official website.
Step 1: Splitting the Dataset
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
In machine learning:
Xcontains the input featuresycontains the target values
For a house-price dataset:
| Features (X) | Target (y) |
|---|---|
| Rooms, Area, Location | House Price |
The function:
train_test_split()
randomly divides the dataset into two parts.
| Dataset Portion | Purpose |
|---|---|
| Training Data | Used to teach the model |
| Validation Data | Used to test the model |
Think of it like studying for an exam:
- Training set → study materials
- Validation set → final exam
Why Do We Split the Data?
Suppose a student memorizes answers instead of understanding concepts.
They may perform well on practice questions but fail on new questions.
Machine learning models behave similarly.
If we test the model on the same data used for training, it may simply memorize the dataset instead of learning useful patterns.
The validation set helps us answer the real question:
Can the model predict unseen data accurately?
This idea is fundamental in <academic_field>Machine Learning</academic_field>.
Understanding random_state=0
random_state=0
Data splitting involves randomness.
Without a fixed random state:
- the split changes every time
- results become inconsistent
Setting random_state=0 ensures:
- the same training/validation split every run
- reproducible experiments
This is very important in real-world data science workflows.
Step 2: Creating the Decision Tree Model
melbourne_model = DecisionTreeRegressor()
This creates a <scientific_concept>Decision Tree</scientific_concept> regression model.
A decision tree works like a series of questions.
Example:
Is number_of_rooms > 3?
YES → Is area > 2000 sq ft?
YES → expensive house
NO → medium-priced house
NO → cheaper house
The algorithm automatically discovers such rules from the data.
Step 3: Training the Model
melbourne_model.fit(train_X, train_y)
This is where learning happens.
The model studies relationships between:
Input Features → Output Values
Example:
| Rooms | Area | Price |
|---|---|---|
| 2 | 900 | 300000 |
| 4 | 2500 | 850000 |
The model gradually learns patterns such as:
- larger houses tend to cost more
- more rooms often increase price
- location impacts valuation
Training is essentially the model discovering mathematical patterns hidden inside data.
Step 4: Making Predictions
val_predictions = melbourne_model.predict(val_X)
Now the trained model predicts prices for houses it has never seen before.
Example:
| Actual Price | Predicted Price |
|---|---|
| 500000 | 470000 |
| 900000 | 850000 |
The closer the predictions are to actual values, the better the model.
Step 5: Measuring Prediction Error
mean_absolute_error(val_y, val_predictions)
This calculates:
Mean Absolute Error (MAE)
Where:
- = actual value
- = predicted value
Plain English Meaning of MAE
The calculation process is:
- Find prediction error
- Ignore positive or negative sign
- Add all errors
- Divide by total predictions
Example:
| Actual | Predicted | Absolute Error |
|---|---|---|
| 500000 | 470000 | 30000 |
| 800000 | 850000 | 50000 |
Average these errors → MAE.
Interpreting Your Result
Output:
265806.91478373145
This means:
On average, the model’s predictions differ from actual house prices by about $265,807.
Whether this is good or bad depends entirely on the dataset.
For example:
| Average House Price | MAE Quality |
|---|---|
| $300,000 | Very poor |
| $5 million | Reasonable |
Machine learning metrics always need business or real-world context.
The Hidden Problem: Overfitting
A <scientific_concept>Decision Tree</scientific_concept> can become too complex.
Instead of learning general patterns, it memorizes training data.
This phenomenon is called:
Overfitting
An overfitted model:
- performs extremely well on training data
- performs poorly on unseen validation data
This is one of the most important concepts in machine learning.
How Developers Reduce Overfitting
One common solution is limiting tree complexity.
Example:
DecisionTreeRegressor(max_leaf_nodes=100)
This prevents the tree from becoming excessively detailed.
Other advanced methods include:
- <scientific_concept>Random Forest</scientific_concept>
- <scientific_concept>Gradient Boosting</scientific_concept>
These techniques usually produce more accurate and stable predictions.
Why This Workflow Matters
This simple example represents the foundation of modern AI systems.
The same workflow powers:
- recommendation engines
- fraud detection systems
- medical diagnosis tools
- stock forecasting
- ad targeting
- demand prediction
- pricing systems
Almost every practical AI system follows this structure:
Collect Data
→ Train Model
→ Predict Outcomes
→ Evaluate Accuracy
→ Improve Model
Final Thoughts
Although the code is short, it introduces several foundational concepts in <academic_field>Data Science</academic_field> and <academic_field>Machine Learning</academic_field>:
- dataset splitting
- model training
- prediction generation
- error evaluation
- overfitting
- model generalization
Understanding these ideas deeply is far more important than simply running the code.
Once these fundamentals become clear, advanced topics like neural networks, deep learning, and large language models become much easier to understand.
Discover more from Aiannum.com
Subscribe to get the latest posts sent to your email.

Leave a Reply