Feature Engineering for Hydrological Models¶
🎯 Overview¶
Feature engineering is the process of creating new input variables (features) from existing data to improve model performance. In hydrology, this often involves capturing temporal dependencies and seasonal patterns.
🌊 Why Feature Engineering Matters in Hydrology¶
Hydrological systems have memory: - Rainfall today doesn't immediately become runoff - Soil moisture affects future discharge - Groundwater responds slowly to precipitation - Temperature influences evapotranspiration with delays
By engineering features that capture these relationships, we can significantly improve model predictions.
📊 Cross-Correlation Analysis¶
Understanding Cross-Correlation Function (CCF)¶
The CCF measures how past values of one variable relate to current values of another.
Mathematical Formula¶
Where: - \(x_t\) = Independent variable (e.g., rainfall) at time \(t\) - \(y_t\) = Dependent variable (e.g., discharge) at time \(t\) - \(k\) = Lag value - \(\bar{x}\), \(\bar{y}\) = Means of respective variables
Significance Threshold¶
Where \(N\) is the number of observations.
Implementation¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_ccf
def analyze_cross_correlation(df, max_lags=12):
"""
Compute and plot cross-correlation between inputs and discharge
Parameters:
-----------
df : DataFrame
Data with columns for inputs and 'Discharge'
max_lags : int
Maximum number of lags to analyze
Returns:
--------
dict : Significant lags for each variable
"""
# Separate inputs and output
inputs = df.drop('Discharge', axis=1)
output = df['Discharge']
# Store significant lags
significant_lags = {}
# Create subplots
fig, axes = plt.subplots(len(inputs.columns), 1,
figsize=(10, 3*len(inputs.columns)))
if len(inputs.columns) == 1:
axes = [axes]
# Analyze each input variable
for i, col in enumerate(inputs.columns):
# Remove mean (required for CCF)
var1 = inputs[col] - inputs[col].mean()
var2 = output - output.mean()
# Plot CCF
plot_ccf(var1, var2, lags=max_lags, ax=axes[i])
axes[i].set_title(f'CCF: {col} vs Discharge')
axes[i].set_xlabel('Lag (days)')
axes[i].set_ylabel('Correlation')
# Find significant lags (simplified approach)
n = len(var1)
significance_level = 1.96 / np.sqrt(n)
# You can extract significant lags programmatically here
# For now, we'll note them visually from the plot
plt.tight_layout()
plt.show()
return significant_lags
# Example usage
df = pd.read_csv('Runoff_Data.csv', parse_dates=['Date'], index_col='Date')
analyze_cross_correlation(df)
🔄 Lag Features¶
What are Lag Features?¶
Lag features are past values of variables used as inputs. They help models "remember" previous conditions.
Example: Creating Lag Features¶
| Date | Rainfall | Rainfall_lag1 | Rainfall_lag2 | Rainfall_lag3 |
|---|---|---|---|---|
| Day 1 | 10.5 | NaN | NaN | NaN |
| Day 2 | 5.2 | 10.5 | NaN | NaN |
| Day 3 | 0.0 | 5.2 | 10.5 | NaN |
| Day 4 | 15.3 | 0.0 | 5.2 | 10.5 |
| Day 5 | 8.7 | 15.3 | 0.0 | 5.2 |
Implementation¶
def create_lag_features(df, variable_lags):
"""
Create lagged features for specified variables
Parameters:
-----------
df : DataFrame
Original data
variable_lags : dict
Dictionary mapping variable names to list of lag values
Example: {'Rainfall': [1, 2, 3], 'Tmax': [1, 2]}
Returns:
--------
DataFrame : Data with added lag features
"""
df_features = df.copy()
for variable, lags in variable_lags.items():
for lag in lags:
df_features[f'{variable}_lag{lag}'] = df[variable].shift(lag)
return df_features
# Example usage
df = pd.read_csv('Runoff_Data.csv', parse_dates=['Date'], index_col='Date')
# Define lags based on CCF analysis
lag_config = {
'Rainfall': [1, 2, 3],
'Tmax': [1, 2, 3],
'Tmin': [1, 2, 3]
}
# Create features
df_with_lags = create_lag_features(df, lag_config)
# Remove rows with NaN (from lagging)
df_clean = df_with_lags.dropna()
print(f"Original shape: {df.shape}")
print(f"After adding lags: {df_with_lags.shape}")
print(f"After removing NaN: {df_clean.shape}")
print("\nNew features created:")
print([col for col in df_with_lags.columns if 'lag' in col])
📈 Rolling Statistics¶
Moving Averages and Sums¶
Rolling statistics smooth out short-term fluctuations and highlight longer-term trends.
def add_rolling_features(df, windows=[3, 7, 14, 30]):
"""
Add rolling statistics as features
Parameters:
-----------
df : DataFrame
Original data
windows : list
Window sizes for rolling calculations
Returns:
--------
DataFrame : Data with rolling features
"""
df_features = df.copy()
for window in windows:
# Rolling mean
df_features[f'Rainfall_mean_{window}d'] = (
df['Rainfall'].rolling(window=window, min_periods=1).mean()
)
# Rolling sum (cumulative rainfall)
df_features[f'Rainfall_sum_{window}d'] = (
df['Rainfall'].rolling(window=window, min_periods=1).sum()
)
# Rolling max temperature
df_features[f'Tmax_max_{window}d'] = (
df['Tmax'].rolling(window=window, min_periods=1).max()
)
# Rolling standard deviation (variability)
df_features[f'Rainfall_std_{window}d'] = (
df['Rainfall'].rolling(window=window, min_periods=1).std()
)
return df_features
# Example usage
df_rolling = add_rolling_features(df)
🗓️ Temporal Features¶
Extracting Time-Based Information¶
def add_temporal_features(df):
"""
Add temporal features from date index
Parameters:
-----------
df : DataFrame
Data with DatetimeIndex
Returns:
--------
DataFrame : Data with temporal features
"""
df_features = df.copy()
# Basic temporal features
df_features['day_of_year'] = df.index.dayofyear
df_features['month'] = df.index.month
df_features['quarter'] = df.index.quarter
df_features['week_of_year'] = df.index.isocalendar().week
# Cyclical encoding for month (captures seasonality)
df_features['month_sin'] = np.sin(2 * np.pi * df.index.month / 12)
df_features['month_cos'] = np.cos(2 * np.pi * df.index.month / 12)
# Cyclical encoding for day of year
df_features['day_sin'] = np.sin(2 * np.pi * df.index.dayofyear / 365)
df_features['day_cos'] = np.cos(2 * np.pi * df.index.dayofyear / 365)
# Season indicator
seasons = {1: 'Winter', 2: 'Winter', 3: 'Spring',
4: 'Spring', 5: 'Spring', 6: 'Summer',
7: 'Summer', 8: 'Summer', 9: 'Fall',
10: 'Fall', 11: 'Fall', 12: 'Winter'}
df_features['season'] = df.index.month.map(seasons)
# One-hot encode season
season_dummies = pd.get_dummies(df_features['season'], prefix='season')
df_features = pd.concat([df_features, season_dummies], axis=1)
return df_features
# Example usage
df_temporal = add_temporal_features(df)
🧮 Interaction Features¶
Creating Feature Combinations¶
def add_interaction_features(df):
"""
Create interaction features between variables
"""
df_features = df.copy()
# Rainfall-Temperature interaction
df_features['Rain_Temp_interaction'] = df['Rainfall'] * df['Tmax']
# Temperature range
df_features['Temp_range'] = df['Tmax'] - df['Tmin']
# Antecedent Precipitation Index (API)
# Weighted sum of past rainfall
weights = np.array([0.5, 0.3, 0.2]) # Decreasing weights
for i, w in enumerate(weights, 1):
if f'Rainfall_lag{i}' in df_features.columns:
if i == 1:
df_features['API'] = w * df_features[f'Rainfall_lag{i}']
else:
df_features['API'] += w * df_features[f'Rainfall_lag{i}']
return df_features
🔬 Domain-Specific Features¶
Hydrological Indices¶
def add_hydrological_features(df):
"""
Add hydrology-specific features
"""
df_features = df.copy()
# Baseflow index (using simple filter)
# This is a simplified approach
alpha = 0.925 # Filter parameter
baseflow = [df['Discharge'].iloc[0]]
for i in range(1, len(df)):
bf = alpha * baseflow[-1] + (1 - alpha) * df['Discharge'].iloc[i]
baseflow.append(min(bf, df['Discharge'].iloc[i]))
df_features['Baseflow'] = baseflow
df_features['Quickflow'] = df['Discharge'] - df_features['Baseflow']
# Antecedent discharge (previous day's discharge)
df_features['Discharge_lag1'] = df['Discharge'].shift(1)
# Rate of change in discharge
df_features['Discharge_change'] = df['Discharge'].diff()
# Cumulative rainfall over season
df_features['Cumulative_rainfall'] = df.groupby(df.index.year)['Rainfall'].cumsum()
return df_features
📊 Complete Feature Engineering Pipeline¶
def complete_feature_engineering(df):
"""
Complete feature engineering pipeline for hydrological data
"""
print("Starting feature engineering...")
print(f"Original shape: {df.shape}")
# Step 1: Create lag features
lag_config = {
'Rainfall': [1, 2, 3],
'Tmax': [1, 2],
'Tmin': [1, 2],
'Discharge': [1, 2] # Auto-regressive component
}
df_features = create_lag_features(df, lag_config)
print(f"After lag features: {df_features.shape}")
# Step 2: Add rolling statistics
df_features = add_rolling_features(df_features, windows=[3, 7, 14])
print(f"After rolling features: {df_features.shape}")
# Step 3: Add temporal features
df_features = add_temporal_features(df_features)
print(f"After temporal features: {df_features.shape}")
# Step 4: Add interaction features
df_features = add_interaction_features(df_features)
print(f"After interaction features: {df_features.shape}")
# Step 5: Add hydrological features
df_features = add_hydrological_features(df_features)
print(f"After hydrological features: {df_features.shape}")
# Step 6: Remove rows with NaN
df_clean = df_features.dropna()
print(f"After removing NaN: {df_clean.shape}")
# Step 7: Separate features and target
target = df_clean['Discharge']
features = df_clean.drop(['Discharge', 'season'], axis=1) # Remove non-numeric
print(f"\nFinal features shape: {features.shape}")
print(f"Final target shape: {target.shape}")
print(f"\nTotal features created: {len(features.columns)}")
return features, target
# Run the complete pipeline
features, target = complete_feature_engineering(df)
✅ Feature Selection¶
Importance-Based Selection¶
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
def select_important_features(features, target, n_features=20):
"""
Select most important features using Random Forest
"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.2, shuffle=False
)
# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Get feature importances
importances = pd.DataFrame({
'feature': features.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top features
plt.figure(figsize=(10, 6))
plt.barh(importances['feature'][:n_features],
importances['importance'][:n_features])
plt.xlabel('Importance')
plt.title(f'Top {n_features} Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Return top features
top_features = importances['feature'][:n_features].tolist()
return top_features, importances
# Select top features
top_features, importance_df = select_important_features(features, target)
🎯 Best Practices¶
- Start Simple: Begin with basic lag features before complex transformations
- Domain Knowledge: Use hydrological understanding to guide feature creation
- Avoid Leakage: Don't use future information in features
- Handle Missing Data: Decide strategy before creating features
- Scale Features: Normalize/standardize for neural networks
- Validate Impact: Test if new features actually improve performance
- Document Features: Keep track of what each feature represents
⚠️ Common Pitfalls¶
Avoid These Mistakes
- Data Leakage: Using future values to predict past (shuffle=False for time series!)
- Too Many Features: Can lead to overfitting (curse of dimensionality)
- Highly Correlated Features: Remove redundant features
- Not Checking Feature Distributions: Outliers can dominate models
- Ignoring Physical Constraints: Features should make hydrological sense
📊 Feature Visualization¶
def visualize_features(df_features, target_col='Discharge', n_features=6):
"""
Visualize relationships between features and target
"""
import seaborn as sns
# Select features to plot
feature_cols = [col for col in df_features.columns if col != target_col][:n_features]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for idx, col in enumerate(feature_cols):
axes[idx].scatter(df_features[col], df_features[target_col],
alpha=0.5, s=10)
axes[idx].set_xlabel(col)
axes[idx].set_ylabel(target_col)
axes[idx].set_title(f'{col} vs {target_col}')
# Add trend line
z = np.polyfit(df_features[col].dropna(),
df_features.loc[df_features[col].notna(), target_col], 1)
p = np.poly1d(z)
axes[idx].plot(df_features[col].sort_values(),
p(df_features[col].sort_values()),
"r--", alpha=0.8)
plt.tight_layout()
plt.show()
# Visualize feature relationships
visualize_features(df_with_lags)
🚀 Next Steps¶
Now that you've mastered feature engineering:
- Apply these features to Multiple Linear Regression
- Use them in Artificial Neural Networks
- Experiment with different lag configurations
- Try feature selection techniques
📚 Additional Resources¶
- Time Series Feature Engineering
- Feature Engineering for Machine Learning
- Hydrological Feature Engineering Papers