Skip to content

Multiple Linear Regression (MLR)

📊 Overview

Multiple Linear Regression extends Simple Linear Regression by incorporating multiple independent variables to predict the dependent variable. In hydrology, this allows us to consider rainfall, temperature, and other factors simultaneously.

🎯 Mathematical Foundation

The MLR Equation

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon \]

Where: - \(y\) = Dependent variable (Discharge) - \(x_1, x_2, ..., x_n\) = Independent variables (Rainfall, Tmax, Tmin, lag features) - \(\beta_0\) = Intercept - \(\beta_1, \beta_2, ..., \beta_n\) = Coefficients for each variable - \(\varepsilon\) = Error term

Visual Representation

graph LR
    A[Rainfall] --> E[MLR Model]
    B[Tmax] --> E
    C[Tmin] --> E
    D[Lag Features] --> E
    E --> F[Discharge Prediction]
    G[Error ε] --> E

🔨 Implementation

Step 1: Import Libraries and Load Data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('Runoff_Data.csv', 
                 parse_dates=['Date'], 
                 index_col='Date')
df = df.sort_values('Date')

Step 2: Feature Engineering

# Separate inputs and outputs
INPUTS = df.iloc[:, :-1]  # All columns except Discharge
OUTPUT = df.iloc[:, -1]    # Discharge column

# Create lag features based on CCF analysis
rain_lags = [1, 2, 3]
tmax_lags = [1, 2, 3]
tmin_lags = [1, 2, 3]

# Add lag features
for lag in rain_lags:
    INPUTS[f'Rainfall_lag{lag}'] = INPUTS['Rainfall'].shift(lag)

for lag in tmax_lags:
    INPUTS[f'Tmax_lag{lag}'] = INPUTS['Tmax'].shift(lag)

for lag in tmin_lags:
    INPUTS[f'Tmin_lag{lag}'] = INPUTS['Tmin'].shift(lag)

# Remove rows with NaN values
FEATURES = INPUTS.dropna()
TARGET = OUTPUT.loc[FEATURES.index]

print(f"Features shape: {FEATURES.shape}")
print(f"Target shape: {TARGET.shape}")
print(f"\nFeature columns:")
print(FEATURES.columns.tolist())

Step 3: Train-Test Split

# Split data: 80% training, 20% testing
X_mlr_train, X_mlr_test, y_mlr_train, y_mlr_test = train_test_split(
    FEATURES, TARGET, 
    test_size=0.2, 
    shuffle=False  # Maintain temporal order
)

print(f"Training samples: {len(X_mlr_train)}")
print(f"Testing samples: {len(X_mlr_test)}")

Step 4: Build and Train MLR Model

# Initialize and train the model
mlr_model = LinearRegression()
mlr_model.fit(X_mlr_train, y_mlr_train)

# Make predictions
y_mlr_pred = mlr_model.predict(X_mlr_test)

print("Model trained successfully!")

Step 5: Extract Model Equation

# Get coefficients and intercept
feature_names = FEATURES.columns
coefficients = mlr_model.coef_
intercept = mlr_model.intercept_

# Create equation string
equation = f"Discharge = {intercept:.3f}"
for name, coef in zip(feature_names, coefficients):
    if coef >= 0:
        equation += f" + {coef:.4f}×{name}"
    else:
        equation += f" - {abs(coef):.4f}×{name}"

print("Fitted Equation:")
print("="*50)
print(equation)
print("="*50)

# Create a DataFrame for better visualization
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
}).sort_values('Coefficient', ascending=False)

print("\nFeature Coefficients (sorted by magnitude):")
print(coef_df)

📈 Model Evaluation

Performance Metrics

def evaluate_model(obs, sim):
    """Calculate R², NSE, and PBIAS"""
    obs = np.array(obs)
    sim = np.array(sim)

    # R²
    r = np.corrcoef(obs, sim)[0, 1]
    r2 = r ** 2

    # NSE
    nse = 1 - np.sum((obs - sim) ** 2) / np.sum((obs - np.mean(obs)) ** 2)

    # PBIAS
    pbias = 100 * np.sum(obs - sim) / np.sum(obs)

    return {'R²': r2, 'NSE': nse, 'PBIAS': pbias}

# Evaluate the model
results = evaluate_model(y_mlr_test, y_mlr_pred)

print("Model Performance:")
print(f"  R² = {results['R²']:.4f}")
print(f"  NSE = {results['NSE']:.4f}")
print(f"  PBIAS = {results['PBIAS']:.2f}%")

Performance Comparison Table

# Compare with SLR (if you've run it)
comparison_df = pd.DataFrame({
    'Model': ['SLR (Rainfall only)', 'MLR (All features)'],
    'R²': [0.65, results['R²']],  # Example SLR value
    'NSE': [0.60, results['NSE']],
    'PBIAS': [5.2, results['PBIAS']]
})

print("\nModel Comparison:")
print(comparison_df)

📊 Visualizations

1. Predicted vs Observed Scatter Plot

plt.figure(figsize=(10, 8))

# Scatter plot
plt.scatter(y_mlr_test, y_mlr_pred, alpha=0.5, s=20, label='Predictions')

# Perfect prediction line
min_val = min(y_mlr_test.min(), y_mlr_pred.min())
max_val = max(y_mlr_test.max(), y_mlr_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', 
         label='Perfect Prediction', linewidth=2)

# Add metrics to plot
textstr = f'R² = {results["R²"]:.3f}\nNSE = {results["NSE"]:.3f}\nPBIAS = {results["PBIAS"]:.2f}%'
plt.text(0.05, 0.95, textstr, transform=plt.gca().transAxes,
         fontsize=12, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.xlabel('Observed Discharge (m³/s)', fontsize=12)
plt.ylabel('Predicted Discharge (m³/s)', fontsize=12)
plt.title('MLR: Predicted vs Observed Discharge', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

2. Time Series Comparison

plt.figure(figsize=(15, 6))

# Get test dates
test_dates = y_mlr_test.index

# Plot observed and predicted
plt.plot(test_dates, y_mlr_test.values, 'b-', label='Observed', 
         linewidth=1.5, alpha=0.7)
plt.plot(test_dates, y_mlr_pred, 'r-', label='Predicted', 
         linewidth=1.5, alpha=0.7)

# Fill between for error visualization
plt.fill_between(test_dates, y_mlr_test.values, y_mlr_pred, 
                 alpha=0.2, color='gray')

plt.xlabel('Date', fontsize=12)
plt.ylabel('Discharge (m³/s)', fontsize=12)
plt.title('MLR: Time Series Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

3. Feature Importance Visualization

# Create feature importance plot
plt.figure(figsize=(10, 8))

# Sort coefficients by absolute value
coef_sorted = coef_df.iloc[np.argsort(np.abs(coef_df['Coefficient'].values))[::-1]]

# Create horizontal bar plot
colors = ['green' if x > 0 else 'red' for x in coef_sorted['Coefficient']]
plt.barh(range(len(coef_sorted)), coef_sorted['Coefficient'], color=colors)
plt.yticks(range(len(coef_sorted)), coef_sorted['Feature'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Feature Importance in MLR Model', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')

# Add vertical line at x=0
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()

4. Residual Analysis

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Calculate residuals
residuals = y_mlr_test - y_mlr_pred

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_mlr_pred, residuals, alpha=0.5, s=20)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)

# 2. Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)

# 3. Histogram of Residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True, alpha=0.3)

# Add normal distribution overlay
mu, std = residuals.mean(), residuals.std()
x = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 0].plot(x, stats.norm.pdf(x, mu, std) * len(residuals) * (residuals.max() - residuals.min()) / 30, 
                'r-', linewidth=2, label='Normal')
axes[1, 0].legend()

# 4. Residuals over Time
axes[1, 1].plot(test_dates, residuals, 'o-', alpha=0.5, markersize=4)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals over Time')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.suptitle('MLR Model Diagnostics', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

🔍 Model Interpretation

Understanding Coefficients

# Analyze coefficient signs and magnitudes
print("Model Interpretation:")
print("="*50)
print(f"Intercept: {intercept:.3f}")
print("  → Base discharge when all features = 0")
print()

for name, coef in zip(feature_names, coefficients):
    if 'Rainfall' in name:
        print(f"{name}: {coef:.4f}")
        if coef > 0:
            print(f"  → 1 mm increase in {name} increases discharge by {coef:.4f} m³/s")
    elif 'Tmax' in name:
        print(f"{name}: {coef:.4f}")
        if coef > 0:
            print(f"  → 1°C increase in {name} increases discharge by {coef:.4f} m³/s")
        else:
            print(f"  → 1°C increase in {name} decreases discharge by {abs(coef):.4f} m³/s")

🎯 Advantages and Limitations

✅ Advantages of MLR

  • Multiple Factors: Considers various influences simultaneously
  • Interpretable: Clear understanding of each variable's contribution
  • Fast: Quick to train and predict
  • No Hyperparameters: Simple to implement
  • Statistical Inference: Can provide confidence intervals

❌ Limitations

  • Linearity Assumption: Assumes linear relationships
  • Multicollinearity: Correlated features can cause issues
  • No Interaction Capture: Doesn't model feature interactions automatically
  • Sensitive to Outliers: Can be skewed by extreme values
  • Assumes Independence: Residuals should be independent

🚀 Improvements and Extensions

1. Feature Selection

from sklearn.feature_selection import SelectKBest, f_regression

# Select top K features
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X_mlr_train, y_mlr_train)

# Get selected feature names
selected_features = FEATURES.columns[selector.get_support()].tolist()
print(f"Selected features: {selected_features}")

2. Regularization (Ridge/Lasso)

from sklearn.linear_model import Ridge, Lasso

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_mlr_train, y_mlr_train)

# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_mlr_train, y_mlr_train)

3. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_mlr_train)

📚 Next Steps

Now that you've mastered MLR: 1. Try Artificial Neural Networks for non-linear patterns 2. Experiment with more lag features 3. Apply regularization techniques 4. Compare with advanced models