Last Updated on June 4, 2026 by Rajeev Bagra
When starting machine learning on Kaggle, you may see code like:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
At first glance, this looks like a collection of mysterious imports. In reality, each line represents an important concept in the machine learning workflow.
Let’s understand them one by one.
What Is a Library?
A library is a collection of pre-written code that developers can reuse.
Instead of writing thousands of lines of code yourself, you import libraries created by experts.
Example:
import pandas as pd
This imports the Pandas library.
Without Pandas, reading and manipulating datasets would be much more difficult.
What Is Pandas?
Pandas is the most popular Python library for working with data.
It helps you:
- Read CSV files
- Filter rows
- Select columns
- Calculate statistics
- Clean data
Example:
import pandas as pd
The pd is simply a nickname.
Now instead of writing:
pandas.read_csv()
you can write:
pd.read_csv()
What Is a Dataset?
Machine learning learns from data.
A dataset is simply a collection of information.
Example:
| House Size | Bedrooms | Price |
|---|---|---|
| 1500 | 3 | 250000 |
| 1800 | 4 | 320000 |
| 2200 | 5 | 450000 |
This table is a dataset.
What Is a CSV File?
CSV stands for:
Comma-Separated Values
Example:
size,bedrooms,price
1500,3,250000
1800,4,320000
2200,5,450000
The Kaggle code loads data from:
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
This file contains housing data used for prediction.
What Is Reading Data?
The dataset is loaded into memory using:
home_data = pd.read_csv(iowa_file_path)
This tells Pandas:
Open the CSV file and convert it into a table that Python can work with.
What Is a DataFrame?
A DataFrame is Pandas’ table structure.
Example:
home_data
might display:
| LotArea | YearBuilt | SalePrice |
|---|---|---|
| 8450 | 2003 | 208500 |
| 9600 | 1976 | 181500 |
Think of a DataFrame as a spreadsheet inside Python.
Features and Target Variables
Machine learning learns relationships.
Suppose:
| Size | Bedrooms | Price |
|---|---|---|
| 1500 | 3 | 250000 |
The inputs:
Size
Bedrooms
are called features.
The output:
Price
is called the target variable.
Features (X)
Features are information used to make predictions.
Example:
features = ['LotArea',
'YearBuilt',
'1stFlrSF']
Target (y)
The value we want to predict.
Example:
y = home_data.SalePrice
Here the model tries to predict:
SalePrice
What Is Machine Learning?
Traditional programming:
Rules + Data = Answers
Machine learning:
Data + Answers = Rules
The algorithm discovers patterns automatically.
What Is a Model?
A model is the learned pattern.
Example:
Larger house → Higher price
Newer house → Higher price
The model learns such relationships from data.
What Is Scikit-Learn?
Scikit-Learn is Python’s most popular machine learning library.
Import example:
from sklearn.tree import DecisionTreeRegressor
It provides:
- Decision Trees
- Linear Regression
- Random Forests
- Clustering
- Metrics
and much more.
What Is a Decision Tree?
The screenshot imports:
from sklearn.tree import DecisionTreeRegressor
A Decision Tree makes decisions by asking questions.
Example:
Is house size > 2000?
|
Yes
|
Price > 300000
No
|
Price < 300000
The model repeatedly splits data into smaller groups.
Why Is It Called a Regressor?
There are two major ML tasks.
Regression
Predict a number.
Examples:
- House price
- Temperature
- Salary
Example output:
$350,000
Classification
Predict a category.
Examples:
- Spam or not spam
- Cat or dog
- Pass or fail
Example output:
Spam
Since house prices are numerical values, Kaggle uses:
DecisionTreeRegressor
What Is Training a Model?
Training means allowing the model to learn from data.
Example:
model.fit(X, y)
The model studies:
Features → SalePrice
relationships.
What Is Prediction?
After training:
model.predict(X)
The model guesses outputs.
Example:
Actual Price: 250000
Predicted Price: 245000
Why Split Data?
The screenshot imports:
from sklearn.model_selection import train_test_split
Why?
Because testing on the same data used for learning is misleading.
A student who memorizes answers is not necessarily knowledgeable.
Similarly, a model must be tested on unseen data.
What Is Training Data?
Training data teaches the model.
Example:
80% of data
What Is Validation Data?
Validation data tests the model.
Example:
20% of data
The model has never seen this data before.
train_test_split()
Example:
train_X, val_X, train_y, val_y = train_test_split(
X,
y,
random_state=1
)
This automatically creates:
- Training features
- Validation features
- Training targets
- Validation targets
What Is random_state?
Example:
random_state=1
This makes results reproducible.
Without it:
Split #1 → Different rows
Split #2 → Different rows
With it:
Always same split
Useful for debugging and learning.
What Are Metrics?
The screenshot imports:
from sklearn.metrics import mean_absolute_error
A metric measures model performance.
Think:
Exam Score for ML Models
Mean Absolute Error (MAE)
MAE measures average prediction error.
Example:
| Actual | Predicted |
|---|---|
| 100 | 105 |
| 200 | 190 |
| 300 | 310 |
Errors:
5
10
10
Average:
8.33
MAE = 8.33
Lower is better.
Underfitting
A model that is too simple.
Example:
All houses cost $200,000
Clearly unrealistic.
The model misses important patterns.
Overfitting
A model that memorizes training data.
Example:
House #1 → exactly $253,871
House #2 → exactly $311,422
Works perfectly on training data.
Fails on new data.
The Goal: Generalization
A good model should:
- Learn patterns
- Avoid memorization
- Predict accurately on unseen data
This ability is called:
Generalization
and it is the ultimate goal of machine learning.
The Entire Kaggle Workflow in One Diagram
CSV Dataset
│
▼
Pandas DataFrame
│
▼
Select Features (X)
Select Target (y)
│
▼
train_test_split()
│
├── Training Data
└── Validation Data
│
▼
DecisionTreeRegressor
│
▼
Train Model
│
▼
Predict Prices
│
▼
mean_absolute_error()
│
▼
Evaluate Performance
│
▼
Tune Model
│
▼
Avoid Underfitting & Overfitting
This workflow is the foundation not only of Kaggle’s introductory course but also of most real-world machine learning projects.
Discover more from Aiannum.com
Subscribe to get the latest posts sent to your email.

Leave a Reply