Modeling MPG: A Case Study in Stable Linear Regression for Car Fuel Efficiency

Introduction: Why Predict Fuel Efficiency?

For a recent project, I built a linear regression model to predict a car's fuel efficiency (MPG) based on core features like displacement, horsepower, and weight. The goal was not just prediction, but rigorously testing the model's stability and the impact of different data preparation choices.

This post walks through the process, sharing the Python code used for each step—from finding missing values to the final test set evaluation.

Phase 1: Data Acquisition and Exploratory Analysis (EDA)

We focused on five key columns and started by analyzing the target variable, 'fuel_efficiency_mpg'.

1. Initial Load and Missing Value Check

The first step was loading the data and quickly identifying any missing values (NaNs), which are crucial for imputation later.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data and select relevant columns
df = pd.read_csv('[https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)')
df = df[['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']]

# Check for missing values
missing_counts = df.isnull().sum()
print("Missing values per column:\n", missing_counts)

# Calculate the median for the identified missing column (Q2)
median_horsepower = df['horsepower'].median()
print(f"\nMedian Horsepower: {median_horsepower}")

EDA Result: The 'horsepower' column was the only one with missing values. Its median was 93.0.

Target Variable Distribution: The target variable, 'fuel_efficiency_mpg', exhibited a right-skewed distribution (long tail). Although not implemented in the final evaluation, this is a strong indicator that a logarithmic transformation of the target is often beneficial for linear models.

Phase 2: Data Splitting and Preparation Functions

Before any modeling, we defined a function for the core linear regression algorithm (without regularization) and an evaluation metric (RMSE). We also set up our data splitting process.

2.1 Model & Evaluation Functions

def train_linear_regression(X, y):
    """Trains a simple linear regression model."""
    # Add the bias term (intercept)
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    # Normal equation: w = (X^T * X)^-1 * (X^T * y)
    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)

    return w[0], w[1:] # Return bias and weights

def rmse(y, y_pred):
    """Calculates the Root Mean Squared Error."""
    error = y - y_pred
    mse = (error ** 2).mean()
    return np.sqrt(mse)

def train_linear_regression_reg(X, y, r=0.0):
    """Trains a regularized linear regression model (Ridge)."""
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    # Regularization term (Identity matrix)
    XTX = X.T.dot(X)
    reg_term = r * np.eye(XTX.shape[0])

    # Normal equation: w = (X^T * X + r*I)^-1 * (X^T * y)
    XTX = XTX + reg_term

    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)

    return w[0], w[1:] # Return bias and weights

2.2 Data Split (60/20/20)

We used a 60%/20%/20% split for train/validation/test sets and fixed the seed to 42 for reproducibility in the intermediate steps.

def prepare_data(df, seed=42):
    """Shuffles and splits the data into train, validation, and test sets."""
    df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
    df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=seed) # 0.25 of 80% = 20%

    df_train = df_train.reset_index(drop=True)
    df_val = df_val.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)

    y_train = df_train['fuel_efficiency_mpg'].values
    y_val = df_val['fuel_efficiency_mpg'].values
    y_test = df_test['fuel_efficiency_mpg'].values

    # Drop target variable from feature matrices
    X_train = df_train.drop('fuel_efficiency_mpg', axis=1)
    X_val = df_val.drop('fuel_efficiency_mpg', axis=1)
    X_test = df_test.drop('fuel_efficiency_mpg', axis=1)

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = prepare_data(df, seed=42)

Phase 3: Imputation and Regularization Experiments

3.1 Imputation Comparison (Q3)

We compared filling the missing 'horsepower' values with 0 versus the mean of the training data.

# Impute with 0
X_train_0 = X_train.fillna(0).values
X_val_0 = X_val.fillna(0).values

w0, w = train_linear_regression(X_train_0, y_train)
y_pred_0 = w0 + X_val_0.dot(w)
rmse_0 = round(rmse(y_val, y_pred_0), 2)
print(f"RMSE (Impute with 0): {rmse_0}")

# Impute with Mean
mean_hp_train = X_train['horsepower'].mean()
X_train_mean = X_train.fillna(mean_hp_train).values
X_val_mean = X_val.fillna(mean_hp_train).values

w0, w = train_linear_regression(X_train_mean, y_train)
y_pred_mean = w0 + X_val_mean.dot(w)
rmse_mean = round(rmse(y_val, y_pred_mean), 2)
print(f"RMSE (Impute with Mean): {rmse_mean}")

Result: Both RMSE scores were 4.90. We chose to proceed with 0-imputation.

3.2 Regularization Tuning (Q4)

Using 0-imputation, we tested a range of regularization strengths (r).

r_values = [0, 0.01, 0.1, 1, 5, 10, 100]
best_rmse = float('inf')
best_r = -1

print("\n--- Regularization Results ---")
for r in r_values:
    # Train using the 0-imputed training data
    w0, w = train_linear_regression_reg(X_train_0, y_train, r=r)

    # Predict on the 0-imputed validation data
    y_pred = w0 + X_val_0.dot(w)
    score = round(rmse(y_val, y_pred), 2)
    print(f"r={r:<4} | RMSE: {score}")

    if score <= best_rmse:
        best_rmse = score
        best_r = r

print(f"\nBest r (smallest value achieving best RMSE): {best_r}")

Result: The best RMSE (4.90) was achieved by multiple values, making the smallest option, r=0 (or r=0.01), the technically correct choice.

3.3 Model Stability Check (Q5)

We tested model stability by running the whole process (0-imputation, no regularization) across 10 different random seeds.

seed_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
rmse_scores = []

for seed in seed_values:
    X_train, X_val, _, y_train, y_val, _ = prepare_data(df, seed=seed)

    X_train_0 = X_train.fillna(0).values
    X_val_0 = X_val.fillna(0).values

    # Train (r=0, no regularization)
    w0, w = train_linear_regression(X_train_0, y_train) 

    # Evaluate
    y_pred = w0 + X_val_0.dot(w)
    score = rmse(y_val, y_pred)
    rmse_scores.append(score)

std_rmse = round(np.std(rmse_scores), 3)
print(f"\nStandard Deviation of RMSE across seeds: {std_rmse}")

Result: The standard deviation was std≈∗∗0.006∗∗. This low value confirms the model is highly stable and not overly sensitive to the data split.

Phase 4: Final Model Evaluation (Q6)

For the final test, we combined the training and validation sets, used seed 9, filled NAs with 0, and trained with a slight regularization, r=0.001.

# 1. Prepare data with seed 9
X_train, X_val, X_test, y_train, y_val, y_test = prepare_data(df, seed=9)

# 2. Combine train and validation sets
X_full_train = pd.concat([X_train, X_val]).fillna(0).values
y_full_train = np.concatenate([y_train, y_val])

# 3. Impute test set with 0
X_test_0 = X_test.fillna(0).values

# 4. Train final model with r=0.001
r_final = 0.001
w0_final, w_final = train_linear_regression_reg(X_full_train, y_full_train, r=r_final)

# 5. Evaluate on test set
y_pred_test = w0_final + X_test_0.dot(w_final)
final_rmse = round(rmse(y_test, y_pred_test), 2)

print(f"\n--- Final Model ---")
print(f"Training on combined set (seed 9), r={r_final}, 0-imputation.")
print(f"Final RMSE on the Test Dataset: {final_rmse}")

Final Result: The RMSE on the unseen test dataset was 5.15.

Conclusion

This project provided a deep dive into stable linear regression modeling. By rigorously testing imputation methods, regularization strength, and data splits, we built a stable and reliable model that predicts car fuel efficiency with an average error of about 5.15 MPG.

Tags: #DataScience #MachineLearning #Python #Regression #CodeTutorial #ZoomCamp #DataTalksClup

Modeling MPG - Linear Regression

Modeling MPG: A Case Study in Stable Linear Regression for Car Fuel Efficiency

Introduction: Why Predict Fuel Efficiency?

Phase 1: Data Acquisition and Exploratory Analysis (EDA)

1. Initial Load and Missing Value Check

Phase 2: Data Splitting and Preparation Functions

2.1 Model & Evaluation Functions

2.2 Data Split (60/20/20)

Phase 3: Imputation and Regularization Experiments

3.1 Imputation Comparison (Q3)

3.2 Regularization Tuning (Q4)

3.3 Model Stability Check (Q5)

Phase 4: Final Model Evaluation (Q6)

Conclusion

Comments

More from this blog

Mastering Analytics Engineering: My Journey with dbt and Data Modeling

Mastering Data Warehousing: My Journey with BigQuery and NYC Taxi Data

Mastering Workflow Orchestration: My Journey with Kestra and NYC Taxi Data

Scaling Intelligence: Deploying a Lead Scoring Model with Kubernetes (kind)

Serverless Deployment: Frequently Asked Questions (FAQ)

Command Palette

Modeling MPG: A Case Study in Stable Linear Regression for Car Fuel Efficiency

Introduction: Why Predict Fuel Efficiency?

Phase 1: Data Acquisition and Exploratory Analysis (EDA)

1. Initial Load and Missing Value Check

Phase 2: Data Splitting and Preparation Functions

2.1 Model & Evaluation Functions

2.2 Data Split (60/20/20)

Phase 3: Imputation and Regularization Experiments

3.1 Imputation Comparison (Q3)

3.2 Regularization Tuning (Q4)

3.3 Model Stability Check (Q5)

Phase 4: Final Model Evaluation (Q6)

Conclusion

Comments

More from this blog