Simple Linear Regression (SLR)¶
📊 Overview¶
Simple Linear Regression is a statistical method used to model the relationship between a dependent variable (\(y\)) and a single independent variable (\(x\)). In hydrology, we'll use it to predict discharge based on rainfall.
🎯 Mathematical Foundation¶
The SLR Equation¶
Where: - \(y\) = Dependent variable (Discharge) - \(x\) = Independent variable (Rainfall) - \(\beta_0\) = Intercept (discharge when rainfall = 0) - \(\beta_1\) = Slope (change in discharge per unit rainfall) - \(\varepsilon\) = Random error term
Visual Representation¶
graph LR
A[Rainfall<br/>Input X] --> B[Linear Model<br/>y = β₀ + β₁x]
B --> C[Discharge<br/>Output Y]
D[Error ε] --> B
🔨 Implementation Steps¶
Step 1: Data Preparation¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('Runoff_Data.csv',
parse_dates=['Date'],
index_col='Date')
df = df.sort_values('Date')
Step 2: Define Inputs and Outputs¶
# For SLR, we use only Rainfall as input
X_slr = df[['Rainfall']] # DataFrame with shape (n, 1)
y_slr = df['Discharge'] # Series with shape (n,)
print(f"Input shape: {X_slr.shape}")
print(f"Output shape: {y_slr.shape}")
Why Double Brackets?
We use df[['Rainfall']] with double brackets to get a DataFrame (2D), not df['Rainfall'] which returns a Series (1D). Scikit-learn expects 2D input.
Step 3: Train-Test Split¶
# Split data: 80% training, 20% testing
X_slr_train, X_slr_test, y_slr_train, y_slr_test = train_test_split(
X_slr, y_slr,
test_size=0.2,
shuffle=False # Important for time series!
)
print(f"Training samples: {len(X_slr_train)}")
print(f"Testing samples: {len(X_slr_test)}")
Time Series Consideration
We set shuffle=False to maintain temporal order. Shuffling time series data can lead to data leakage and unrealistic performance estimates.
Step 4: Build and Train Model¶
# Initialize the model
slr_model = LinearRegression()
# Train the model
slr_model.fit(X_slr_train, y_slr_train)
# Make predictions
y_slr_pred = slr_model.predict(X_slr_test)
Step 5: Extract Model Parameters¶
# Get the fitted equation parameters
slope = slr_model.coef_[0]
intercept = slr_model.intercept_
print(f"Fitted Equation: Discharge = {intercept:.4f} + {slope:.4f} × Rainfall")
print(f"Interpretation:")
print(f" - Base discharge (no rain): {intercept:.4f} m³/s")
print(f" - Discharge increase per mm rain: {slope:.4f} m³/s")
📈 Model Evaluation¶
Performance Metrics Function¶
def evaluate_model(obs, sim):
"""
Calculate R², NSE, and PBIAS
"""
obs = np.array(obs)
sim = np.array(sim)
# R² (Coefficient of Determination)
r = np.corrcoef(obs, sim)[0, 1]
r2 = r ** 2
# NSE (Nash-Sutcliffe Efficiency)
nse = 1 - np.sum((obs - sim) ** 2) / np.sum((obs - np.mean(obs)) ** 2)
# PBIAS (Percent Bias)
pbias = 100 * np.sum(obs - sim) / np.sum(obs)
return {'R²': r2, 'NSE': nse, 'PBIAS': pbias}
# Evaluate the model
results = evaluate_model(y_slr_test, y_slr_pred)
print(f"Model Performance:")
print(f" R² = {results['R²']:.3f}")
print(f" NSE = {results['NSE']:.3f}")
print(f" PBIAS = {results['PBIAS']:.2f}%")
Performance Interpretation¶
| Metric | Range | Good Performance | Your Model |
|---|---|---|---|
| R² | 0 to 1 | > 0.6 | Check results |
| NSE | -∞ to 1 | > 0.5 | Check results |
| PBIAS | -∞ to +∞ | -10% to +10% | Check results |
📊 Visualization¶
1. Scatter Plot with Regression Line¶
plt.figure(figsize=(10, 6))
# Training data
plt.scatter(X_slr_train, y_slr_train, alpha=0.5, s=10,
label='Training Data', color='blue')
# Test data
plt.scatter(X_slr_test, y_slr_test, alpha=0.5, s=10,
label='Test Data', color='green')
# Regression line
x_line = np.linspace(X_slr.min(), X_slr.max(), 100)
y_line = slr_model.predict(x_line)
plt.plot(x_line, y_line, 'r-', linewidth=2,
label=f'y = {intercept:.3f} + {slope:.3f}x')
plt.xlabel('Rainfall (mm)')
plt.ylabel('Discharge (m³/s)')
plt.title('Simple Linear Regression: Rainfall vs Discharge')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
2. Predicted vs Observed¶
plt.figure(figsize=(8, 8))
# Perfect prediction line
min_val = min(y_slr_test.min(), y_slr_pred.min())
max_val = max(y_slr_test.max(), y_slr_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--',
label='Perfect Prediction', alpha=0.7)
# Scatter plot
plt.scatter(y_slr_test, y_slr_pred, alpha=0.6, s=20)
# Add R² to plot
plt.text(0.05, 0.95, f'R² = {results["R²"]:.3f}',
transform=plt.gca().transAxes,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
plt.xlabel('Observed Discharge (m³/s)')
plt.ylabel('Predicted Discharge (m³/s)')
plt.title('Predicted vs Observed Discharge')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()
3. Time Series Comparison¶
plt.figure(figsize=(14, 6))
# Get test dates
test_dates = y_slr_test.index
plt.plot(test_dates, y_slr_test.values, 'b-', label='Observed', alpha=0.7)
plt.plot(test_dates, y_slr_pred, 'r-', label='Predicted', alpha=0.7)
plt.fill_between(test_dates, y_slr_test.values, y_slr_pred,
alpha=0.3, color='gray', label='Error')
plt.xlabel('Date')
plt.ylabel('Discharge (m³/s)')
plt.title('Time Series: Observed vs Predicted Discharge')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
4. Residual Analysis¶
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Calculate residuals
residuals = y_slr_test - y_slr_pred
# 1. Residuals vs Fitted
axes[0, 0].scatter(y_slr_pred, residuals, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
axes[0, 0].grid(True, alpha=0.3)
# 2. Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)
# 3. Histogram of Residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Histogram of Residuals')
axes[1, 0].grid(True, alpha=0.3)
# 4. Residuals over Time
axes[1, 1].plot(test_dates, residuals, 'o-', alpha=0.6)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals over Time')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
🎯 Model Limitations¶
SLR Limitations
- Linearity Assumption: Assumes a straight-line relationship
- Single Variable: Only considers rainfall, ignoring other factors
- No Lag Effects: Doesn't account for delayed response
- No Seasonality: Ignores seasonal patterns
- Homoscedasticity: Assumes constant variance of errors
💡 When to Use SLR¶
✅ Good for:¶
- Quick baseline models
- Understanding basic relationships
- When you have limited data
- Educational purposes
- Initial exploratory analysis
❌ Not ideal for:¶
- Complex hydrological systems
- Multiple influencing factors
- Non-linear relationships
- Capturing temporal dependencies
🚀 Improvements¶
To improve your SLR model, consider:
-
Log Transformation: If relationship is exponential
-
Polynomial Features: For curved relationships
-
Remove Outliers: Improve model stability
📚 Next Steps¶
Now that you understand SLR, proceed to: - Multiple Linear Regression - Add more variables - Feature Engineering - Create lag features - Neural Networks - Capture complex patterns