Multiple Linear Regression (MLR)¶
📊 Overview¶
Multiple Linear Regression extends Simple Linear Regression by incorporating multiple independent variables to predict the dependent variable. In hydrology, this allows us to consider rainfall, temperature, and other factors simultaneously.
🎯 Mathematical Foundation¶
The MLR Equation¶
\[
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon
\]
Where: - \(y\) = Dependent variable (Discharge) - \(x_1, x_2, ..., x_n\) = Independent variables (Rainfall, Tmax, Tmin, lag features) - \(\beta_0\) = Intercept - \(\beta_1, \beta_2, ..., \beta_n\) = Coefficients for each variable - \(\varepsilon\) = Error term
Visual Representation¶
graph LR
A[Rainfall] --> E[MLR Model]
B[Tmax] --> E
C[Tmin] --> E
D[Lag Features] --> E
E --> F[Discharge Prediction]
G[Error ε] --> E
🔨 Implementation¶
Step 1: Import Libraries and Load Data¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv('Runoff_Data.csv',
parse_dates=['Date'],
index_col='Date')
df = df.sort_values('Date')
Step 2: Feature Engineering¶
# Separate inputs and outputs
INPUTS = df.iloc[:, :-1] # All columns except Discharge
OUTPUT = df.iloc[:, -1] # Discharge column
# Create lag features based on CCF analysis
rain_lags = [1, 2, 3]
tmax_lags = [1, 2, 3]
tmin_lags = [1, 2, 3]
# Add lag features
for lag in rain_lags:
INPUTS[f'Rainfall_lag{lag}'] = INPUTS['Rainfall'].shift(lag)
for lag in tmax_lags:
INPUTS[f'Tmax_lag{lag}'] = INPUTS['Tmax'].shift(lag)
for lag in tmin_lags:
INPUTS[f'Tmin_lag{lag}'] = INPUTS['Tmin'].shift(lag)
# Remove rows with NaN values
FEATURES = INPUTS.dropna()
TARGET = OUTPUT.loc[FEATURES.index]
print(f"Features shape: {FEATURES.shape}")
print(f"Target shape: {TARGET.shape}")
print(f"\nFeature columns:")
print(FEATURES.columns.tolist())
Step 3: Train-Test Split¶
# Split data: 80% training, 20% testing
X_mlr_train, X_mlr_test, y_mlr_train, y_mlr_test = train_test_split(
FEATURES, TARGET,
test_size=0.2,
shuffle=False # Maintain temporal order
)
print(f"Training samples: {len(X_mlr_train)}")
print(f"Testing samples: {len(X_mlr_test)}")
Step 4: Build and Train MLR Model¶
# Initialize and train the model
mlr_model = LinearRegression()
mlr_model.fit(X_mlr_train, y_mlr_train)
# Make predictions
y_mlr_pred = mlr_model.predict(X_mlr_test)
print("Model trained successfully!")
Step 5: Extract Model Equation¶
# Get coefficients and intercept
feature_names = FEATURES.columns
coefficients = mlr_model.coef_
intercept = mlr_model.intercept_
# Create equation string
equation = f"Discharge = {intercept:.3f}"
for name, coef in zip(feature_names, coefficients):
if coef >= 0:
equation += f" + {coef:.4f}×{name}"
else:
equation += f" - {abs(coef):.4f}×{name}"
print("Fitted Equation:")
print("="*50)
print(equation)
print("="*50)
# Create a DataFrame for better visualization
coef_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients
}).sort_values('Coefficient', ascending=False)
print("\nFeature Coefficients (sorted by magnitude):")
print(coef_df)
📈 Model Evaluation¶
Performance Metrics¶
def evaluate_model(obs, sim):
"""Calculate R², NSE, and PBIAS"""
obs = np.array(obs)
sim = np.array(sim)
# R²
r = np.corrcoef(obs, sim)[0, 1]
r2 = r ** 2
# NSE
nse = 1 - np.sum((obs - sim) ** 2) / np.sum((obs - np.mean(obs)) ** 2)
# PBIAS
pbias = 100 * np.sum(obs - sim) / np.sum(obs)
return {'R²': r2, 'NSE': nse, 'PBIAS': pbias}
# Evaluate the model
results = evaluate_model(y_mlr_test, y_mlr_pred)
print("Model Performance:")
print(f" R² = {results['R²']:.4f}")
print(f" NSE = {results['NSE']:.4f}")
print(f" PBIAS = {results['PBIAS']:.2f}%")
Performance Comparison Table¶
# Compare with SLR (if you've run it)
comparison_df = pd.DataFrame({
'Model': ['SLR (Rainfall only)', 'MLR (All features)'],
'R²': [0.65, results['R²']], # Example SLR value
'NSE': [0.60, results['NSE']],
'PBIAS': [5.2, results['PBIAS']]
})
print("\nModel Comparison:")
print(comparison_df)
📊 Visualizations¶
1. Predicted vs Observed Scatter Plot¶
plt.figure(figsize=(10, 8))
# Scatter plot
plt.scatter(y_mlr_test, y_mlr_pred, alpha=0.5, s=20, label='Predictions')
# Perfect prediction line
min_val = min(y_mlr_test.min(), y_mlr_pred.min())
max_val = max(y_mlr_test.max(), y_mlr_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--',
label='Perfect Prediction', linewidth=2)
# Add metrics to plot
textstr = f'R² = {results["R²"]:.3f}\nNSE = {results["NSE"]:.3f}\nPBIAS = {results["PBIAS"]:.2f}%'
plt.text(0.05, 0.95, textstr, transform=plt.gca().transAxes,
fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.xlabel('Observed Discharge (m³/s)', fontsize=12)
plt.ylabel('Predicted Discharge (m³/s)', fontsize=12)
plt.title('MLR: Predicted vs Observed Discharge', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
2. Time Series Comparison¶
plt.figure(figsize=(15, 6))
# Get test dates
test_dates = y_mlr_test.index
# Plot observed and predicted
plt.plot(test_dates, y_mlr_test.values, 'b-', label='Observed',
linewidth=1.5, alpha=0.7)
plt.plot(test_dates, y_mlr_pred, 'r-', label='Predicted',
linewidth=1.5, alpha=0.7)
# Fill between for error visualization
plt.fill_between(test_dates, y_mlr_test.values, y_mlr_pred,
alpha=0.2, color='gray')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Discharge (m³/s)', fontsize=12)
plt.title('MLR: Time Series Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
3. Feature Importance Visualization¶
# Create feature importance plot
plt.figure(figsize=(10, 8))
# Sort coefficients by absolute value
coef_sorted = coef_df.iloc[np.argsort(np.abs(coef_df['Coefficient'].values))[::-1]]
# Create horizontal bar plot
colors = ['green' if x > 0 else 'red' for x in coef_sorted['Coefficient']]
plt.barh(range(len(coef_sorted)), coef_sorted['Coefficient'], color=colors)
plt.yticks(range(len(coef_sorted)), coef_sorted['Feature'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Feature Importance in MLR Model', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
# Add vertical line at x=0
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()
4. Residual Analysis¶
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Calculate residuals
residuals = y_mlr_test - y_mlr_pred
# 1. Residuals vs Fitted
axes[0, 0].scatter(y_mlr_pred, residuals, alpha=0.5, s=20)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)
# 2. Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)
# 3. Histogram of Residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True, alpha=0.3)
# Add normal distribution overlay
mu, std = residuals.mean(), residuals.std()
x = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 0].plot(x, stats.norm.pdf(x, mu, std) * len(residuals) * (residuals.max() - residuals.min()) / 30,
'r-', linewidth=2, label='Normal')
axes[1, 0].legend()
# 4. Residuals over Time
axes[1, 1].plot(test_dates, residuals, 'o-', alpha=0.5, markersize=4)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals over Time')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)
plt.suptitle('MLR Model Diagnostics', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
🔍 Model Interpretation¶
Understanding Coefficients¶
# Analyze coefficient signs and magnitudes
print("Model Interpretation:")
print("="*50)
print(f"Intercept: {intercept:.3f}")
print(" → Base discharge when all features = 0")
print()
for name, coef in zip(feature_names, coefficients):
if 'Rainfall' in name:
print(f"{name}: {coef:.4f}")
if coef > 0:
print(f" → 1 mm increase in {name} increases discharge by {coef:.4f} m³/s")
elif 'Tmax' in name:
print(f"{name}: {coef:.4f}")
if coef > 0:
print(f" → 1°C increase in {name} increases discharge by {coef:.4f} m³/s")
else:
print(f" → 1°C increase in {name} decreases discharge by {abs(coef):.4f} m³/s")
🎯 Advantages and Limitations¶
✅ Advantages of MLR¶
- Multiple Factors: Considers various influences simultaneously
- Interpretable: Clear understanding of each variable's contribution
- Fast: Quick to train and predict
- No Hyperparameters: Simple to implement
- Statistical Inference: Can provide confidence intervals
❌ Limitations¶
- Linearity Assumption: Assumes linear relationships
- Multicollinearity: Correlated features can cause issues
- No Interaction Capture: Doesn't model feature interactions automatically
- Sensitive to Outliers: Can be skewed by extreme values
- Assumes Independence: Residuals should be independent
🚀 Improvements and Extensions¶
1. Feature Selection¶
from sklearn.feature_selection import SelectKBest, f_regression
# Select top K features
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X_mlr_train, y_mlr_train)
# Get selected feature names
selected_features = FEATURES.columns[selector.get_support()].tolist()
print(f"Selected features: {selected_features}")
2. Regularization (Ridge/Lasso)¶
from sklearn.linear_model import Ridge, Lasso
# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_mlr_train, y_mlr_train)
# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_mlr_train, y_mlr_train)
3. Polynomial Features¶
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_mlr_train)
📚 Next Steps¶
Now that you've mastered MLR: 1. Try Artificial Neural Networks for non-linear patterns 2. Experiment with more lag features 3. Apply regularization techniques 4. Compare with advanced models