Modeling MPG - Linear Regression

Modeling MPG: A Case Study in Stable Linear Regression for Car Fuel Efficiency
Introduction: Why Predict Fuel Efficiency?
For a recent project, I built a linear regression model to predict a car's fuel efficiency (MPG) based on core features like displacement, horsepower, and weight. The goal was not just prediction, but rigorously testing the model's stability and the impact of different data preparation choices.
This post walks through the process, sharing the Python code used for each step—from finding missing values to the final test set evaluation.
Phase 1: Data Acquisition and Exploratory Analysis (EDA)
We focused on five key columns and started by analyzing the target variable, 'fuel_efficiency_mpg'.
1. Initial Load and Missing Value Check
The first step was loading the data and quickly identifying any missing values (NaNs), which are crucial for imputation later.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load data and select relevant columns
df = pd.read_csv('[https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)')
df = df[['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']]
# Check for missing values
missing_counts = df.isnull().sum()
print("Missing values per column:\n", missing_counts)
# Calculate the median for the identified missing column (Q2)
median_horsepower = df['horsepower'].median()
print(f"\nMedian Horsepower: {median_horsepower}")
EDA Result: The 'horsepower' column was the only one with missing values. Its median was 93.0.
Target Variable Distribution: The target variable, 'fuel_efficiency_mpg', exhibited a right-skewed distribution (long tail). Although not implemented in the final evaluation, this is a strong indicator that a logarithmic transformation of the target is often beneficial for linear models.
Phase 2: Data Splitting and Preparation Functions
Before any modeling, we defined a function for the core linear regression algorithm (without regularization) and an evaluation metric (RMSE). We also set up our data splitting process.
2.1 Model & Evaluation Functions
def train_linear_regression(X, y):
"""Trains a simple linear regression model."""
# Add the bias term (intercept)
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
# Normal equation: w = (X^T * X)^-1 * (X^T * y)
XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)
w = XTX_inv.dot(X.T).dot(y)
return w[0], w[1:] # Return bias and weights
def rmse(y, y_pred):
"""Calculates the Root Mean Squared Error."""
error = y - y_pred
mse = (error ** 2).mean()
return np.sqrt(mse)
def train_linear_regression_reg(X, y, r=0.0):
"""Trains a regularized linear regression model (Ridge)."""
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
# Regularization term (Identity matrix)
XTX = X.T.dot(X)
reg_term = r * np.eye(XTX.shape[0])
# Normal equation: w = (X^T * X + r*I)^-1 * (X^T * y)
XTX = XTX + reg_term
XTX_inv = np.linalg.inv(XTX)
w = XTX_inv.dot(X.T).dot(y)
return w[0], w[1:] # Return bias and weights
2.2 Data Split (60/20/20)
We used a 60%/20%/20% split for train/validation/test sets and fixed the seed to 42 for reproducibility in the intermediate steps.
def prepare_data(df, seed=42):
"""Shuffles and splits the data into train, validation, and test sets."""
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=seed) # 0.25 of 80% = 20%
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train['fuel_efficiency_mpg'].values
y_val = df_val['fuel_efficiency_mpg'].values
y_test = df_test['fuel_efficiency_mpg'].values
# Drop target variable from feature matrices
X_train = df_train.drop('fuel_efficiency_mpg', axis=1)
X_val = df_val.drop('fuel_efficiency_mpg', axis=1)
X_test = df_test.drop('fuel_efficiency_mpg', axis=1)
return X_train, X_val, X_test, y_train, y_val, y_test
X_train, X_val, X_test, y_train, y_val, y_test = prepare_data(df, seed=42)
Phase 3: Imputation and Regularization Experiments
3.1 Imputation Comparison (Q3)
We compared filling the missing 'horsepower' values with 0 versus the mean of the training data.
# Impute with 0
X_train_0 = X_train.fillna(0).values
X_val_0 = X_val.fillna(0).values
w0, w = train_linear_regression(X_train_0, y_train)
y_pred_0 = w0 + X_val_0.dot(w)
rmse_0 = round(rmse(y_val, y_pred_0), 2)
print(f"RMSE (Impute with 0): {rmse_0}")
# Impute with Mean
mean_hp_train = X_train['horsepower'].mean()
X_train_mean = X_train.fillna(mean_hp_train).values
X_val_mean = X_val.fillna(mean_hp_train).values
w0, w = train_linear_regression(X_train_mean, y_train)
y_pred_mean = w0 + X_val_mean.dot(w)
rmse_mean = round(rmse(y_val, y_pred_mean), 2)
print(f"RMSE (Impute with Mean): {rmse_mean}")
Result: Both RMSE scores were 4.90. We chose to proceed with 0-imputation.
3.2 Regularization Tuning (Q4)
Using 0-imputation, we tested a range of regularization strengths (r).
r_values = [0, 0.01, 0.1, 1, 5, 10, 100]
best_rmse = float('inf')
best_r = -1
print("\n--- Regularization Results ---")
for r in r_values:
# Train using the 0-imputed training data
w0, w = train_linear_regression_reg(X_train_0, y_train, r=r)
# Predict on the 0-imputed validation data
y_pred = w0 + X_val_0.dot(w)
score = round(rmse(y_val, y_pred), 2)
print(f"r={r:<4} | RMSE: {score}")
if score <= best_rmse:
best_rmse = score
best_r = r
print(f"\nBest r (smallest value achieving best RMSE): {best_r}")
Result: The best RMSE (4.90) was achieved by multiple values, making the smallest option, r=0 (or r=0.01), the technically correct choice.
3.3 Model Stability Check (Q5)
We tested model stability by running the whole process (0-imputation, no regularization) across 10 different random seeds.
seed_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
rmse_scores = []
for seed in seed_values:
X_train, X_val, _, y_train, y_val, _ = prepare_data(df, seed=seed)
X_train_0 = X_train.fillna(0).values
X_val_0 = X_val.fillna(0).values
# Train (r=0, no regularization)
w0, w = train_linear_regression(X_train_0, y_train)
# Evaluate
y_pred = w0 + X_val_0.dot(w)
score = rmse(y_val, y_pred)
rmse_scores.append(score)
std_rmse = round(np.std(rmse_scores), 3)
print(f"\nStandard Deviation of RMSE across seeds: {std_rmse}")
Result: The standard deviation was std≈∗∗0.006∗∗. This low value confirms the model is highly stable and not overly sensitive to the data split.
Phase 4: Final Model Evaluation (Q6)
For the final test, we combined the training and validation sets, used seed 9, filled NAs with 0, and trained with a slight regularization, r=0.001.
# 1. Prepare data with seed 9
X_train, X_val, X_test, y_train, y_val, y_test = prepare_data(df, seed=9)
# 2. Combine train and validation sets
X_full_train = pd.concat([X_train, X_val]).fillna(0).values
y_full_train = np.concatenate([y_train, y_val])
# 3. Impute test set with 0
X_test_0 = X_test.fillna(0).values
# 4. Train final model with r=0.001
r_final = 0.001
w0_final, w_final = train_linear_regression_reg(X_full_train, y_full_train, r=r_final)
# 5. Evaluate on test set
y_pred_test = w0_final + X_test_0.dot(w_final)
final_rmse = round(rmse(y_test, y_pred_test), 2)
print(f"\n--- Final Model ---")
print(f"Training on combined set (seed 9), r={r_final}, 0-imputation.")
print(f"Final RMSE on the Test Dataset: {final_rmse}")
Final Result: The RMSE on the unseen test dataset was 5.15.
Conclusion
This project provided a deep dive into stable linear regression modeling. By rigorously testing imputation methods, regularization strength, and data splits, we built a stable and reliable model that predicts car fuel efficiency with an average error of about 5.15 MPG.
Tags: #DataScience #MachineLearning #Python #Regression #CodeTutorial #ZoomCamp #DataTalksClup