The Algorithm Behind €450/Year Energy Savings: Q-Learning Explained
Discover how Q-learning reinforcement learning reduces household energy costs by 38%. Technical deep-dive into the AI algorithm driving €450/year savings across 13,263 EU homes.
The Algorithm Behind €450/Year Energy Savings: Q-Learning Explained
The €450 Optimization Problem
Residential energy optimization is a classic multi-armed bandit problem: thousands of decisions daily (when to run appliances, what to turn off, how to balance cost vs. comfort), incomplete information (electricity prices fluctuate, weather changes, user behavior varies), and delayed feedback (bill arrives 30 days later).
Traditional approaches fail at this scale:
- Rule-based systems: Brittle. "Run dishwasher at 11 PM" works until your electricity plan changes, your schedule shifts, or electricity prices spike unpredictably.
- Manual scheduling: Humans achieve ~23% long-term adherence to energy optimization routines (research across 13,263 European households, 2025-2026).
- Supervised learning: Requires labeled training data that doesn't exist (no ground truth for "optimal" appliance scheduling across infinite household/weather/price combinations).
Enter reinforcement learning.
Between January 2025 and February 2026, we deployed Q-learning-based energy optimization across 13,263 European households in 8 countries. The algorithm achieved:
- 38.2% average electricity consumption reduction
- €450 average annual savings per household
- 94% sustained adherence rate (vs. 23% for manual approaches)
- ±2% accuracy using IEC 62053-21 certified monitoring
This article details the technical implementation, algorithm design decisions, training dynamics, and real-world performance characteristics of Q-learning for residential energy optimization.
Problem Formulation: The Energy MDP
We model residential energy management as a Markov Decision Process (MDP) defined by the tuple (S, A, R, P, γ):
State Space (S)
The state vector at timestep t includes:
state_t = {
'timestamp': datetime,
'hour_of_day': int (0-23),
'day_of_week': int (0-6),
'current_consumption': float (kW),
'electricity_price': float (€/kWh),
'price_tier': str ('peak', 'off-peak', 'super-off-peak'),
'outdoor_temp': float (°C),
'forecast_temp_6h': float (°C),
'occupancy_detected': bool,
'device_states': dict {device_id: bool (on/off)},
'time_since_last_run': dict {device_id: int (hours)},
'user_comfort_score': float (0-1, derived from manual overrides),
}
State space dimensionality: Continuous (consumption, temperature, prices) + discrete (time, device states) = ~10^12 possible states for a typical 10-device household.
Handling continuous states: Discretization via adaptive binning + function approximation using linear Q-value estimators (details below).
Action Space (A)
Actions represent discrete device control decisions:
action_t = {
'device_id': str,
'action': str ('on', 'off', 'schedule_defer', 'no_change'),
'defer_duration': int (minutes, if action='schedule_defer'),
}
Action space size: For n devices with 4 actions each = 4^n possible actions per timestep. For 10 devices: 1,048,576 possible actions.
Constraint: Not all actions are valid in all states (e.g., can't turn on a device already on). Valid action masking reduces effective action space by ~85%.
Reward Function (R)
The reward at timestep t balances three objectives:
def reward_function(state_t, action_t, state_t+1):
# Cost component (primary objective)
energy_cost = state_t+1['current_consumption'] * state_t+1['electricity_price'] * Δt
cost_penalty = -energy_cost * 100 # Scale to [-50, 0] typical range
# Comfort component (constraint)
comfort_violations = check_comfort_violations(state_t+1)
# Examples: temp <18°C, essential appliance unavailable when needed
comfort_penalty = -50 * comfort_violations # Severe penalty
# Efficiency bonus
if state_t+1['price_tier'] == 'off-peak' and action_t['action'] == 'on':
efficiency_bonus = +10 # Reward load-shifting
else:
efficiency_bonus = 0
return cost_penalty + comfort_penalty + efficiency_bonus
Key design decision: Comfort violations receive 100x higher penalty weight than marginal cost savings. The algorithm learns "never sacrifice comfort for minor cost reduction" but "aggressively optimize when comfort is unaffected."
Transition Function (P)
State transitions are stochastic:
- Device state transitions: Deterministic (controlled)
- Consumption: Deterministic given device states (measured)
- Electricity prices: Stochastic (market-driven, but predictable for time-of-use plans)
- Temperature: Stochastic (weather-dependent)
- Occupancy: Stochastic (user behavior)
The model learns transition probabilities empirically through experience.
Discount Factor (γ)
γ = 0.95
Rationale: Energy decisions have medium-term consequences (scheduling a dishwasher for 3 hours later affects cost but not immediate comfort). A discount factor of 0.95 values rewards 20 timesteps ahead at ~36% of immediate reward value, appropriate for hourly decision horizons.
Q-Learning Implementation
We use tabular Q-learning with function approximation for continuous states.
The Q-Learning Update Rule
For each state-action pair, the Q-value is updated:
Q(s, a) ← Q(s, a) + α [r + γ maxa' Q(s', a') - Q(s, a)]
Where:
- Q(s, a): Expected cumulative reward for taking action a in state s
- α: Learning rate (0.1 initially, decayed to 0.01)
- r: Immediate reward
- γ: Discount factor (0.95)
- s': Next state
- maxa' Q(s', a'): Maximum Q-value achievable from state s'
Function Approximation for Continuous States
Representing Q-values for 10^12 states is infeasible. We use linear function approximation:
Q(s, a) ≈ θT φ(s, a)
Where:
- φ(s, a): Feature vector (hand-crafted features + automated basis functions)
- θ: Weight vector (learned)
Feature engineering:
def feature_vector(state, action):
features = [
state['hour_of_day'] / 24.0, # Normalized time
state['day_of_week'] / 7.0,
state['current_consumption'] / 5.0, # Normalized to typical max
state['electricity_price'] / 0.60, # Normalized to typical max
int(state['price_tier'] == 'peak'), # Binary indicators
int(state['price_tier'] == 'off-peak'),
state['outdoor_temp'] / 30.0, # Normalized temp
state['occupancy_detected'],
# Action features
int(action['action'] == 'on'),
int(action['action'] == 'schedule_defer'),
action['defer_duration'] / 360.0, # Normalized to 6 hours
# Interaction features
state['hour_of_day'] * int(state['price_tier'] == 'peak'), # Peak-hour interaction
state['outdoor_temp'] * int(action['device_id'] == 'heater'), # Device-specific
# ... (35 total features)
]
return np.array(features)
Dimensionality: 35 features per state-action pair (determined empirically via ablation studies—details in next section).
Exploration Strategy
Standard ε-greedy exploration:
def select_action(state, Q_function, epsilon):
if random.random() < epsilon:
return random_valid_action(state) # Explore
else:
return argmax_a(Q_function(state, a)) # Exploit
ε schedule:
- Weeks 1-2: ε = 0.30 (high exploration, learn state space)
- Weeks 3-4: ε = 0.15 (balanced)
- Weeks 5-8: ε = 0.05 (mostly exploitation, fine-tuning)
- Week 9+: ε = 0.02 (maintenance exploration to detect changes)
Training Dynamics: Real-World Results
Dataset Characteristics
- Households: 13,263 across Belgium, Germany, France, Netherlands, Spain, Sweden, Lithuania, Poland
- Training period: January 2025 - February 2026 (14 months)
- Data points: ~280M state-action-reward tuples (5-minute timesteps)
- Devices per household: Mean 8.4, Median 7, Range 3-18
Learning Curves
Convergence timeline:
| Week | Avg Cost Reduction | Comfort Violations | ε (Exploration) | |------|-------------------|-------------------|-----------------| | 1 | 8.2% | 12.3% of timesteps | 0.30 | | 2 | 18.5% | 6.1% | 0.30 | | 4 | 29.4% | 2.2% | 0.15 | | 8 | 36.1% | 0.8% | 0.05 | | 12 | 38.2% | 0.4% | 0.02 | | 16+ | 38.7% | 0.3% | 0.02 |
Key observations:
- Rapid initial learning: 18.5% cost reduction by Week 2 (fast enough for user retention)
- Comfort violations decrease faster than cost reduction improves: Algorithm learns "don't break comfort" constraint before mastering optimization
- Asymptotic convergence: 95% of final performance reached by Week 8
- Generalization: Algorithm continues slow improvement 12+ weeks as it encounters rare states (extreme weather, unusual schedules)
Ablation Study: Feature Importance
We trained Q-learning variants with different feature subsets:
| Feature Set | Cost Reduction | Convergence Time | |-------------|----------------|------------------| | Time only (hour, day) | 12.3% | 6 weeks | | + Price (tier, €/kWh) | 24.7% | 7 weeks | | + Temperature | 31.2% | 8 weeks | | + Occupancy | 35.8% | 9 weeks | | + Device history | 37.1% | 10 weeks | | Full (all features) | 38.2% | 8 weeks | | + Deep features (neural net) | 38.9% | 14 weeks |
Analysis:
- Diminishing returns: First 5 features (time + price) capture 65% of total gains
- Temperature critical for heating optimization (largest marginal gain: +6.5%)
- Occupancy enables comfort preservation (prevents turning off heating when home)
- Deep learning underperforms: 0.7% gain not worth 6-week longer convergence (hypothesis: household energy is relatively low-dimensional problem)
Comparison: Q-Learning vs. Baselines
We benchmarked Q-learning against alternative approaches:
| Approach | Avg Cost Reduction | Adherence (12mo) | Setup Effort | |----------|-------------------|------------------|--------------| | Manual scheduling | 11.2% | 23% | High (ongoing) | | Rule-based (fixed schedules) | 18.7% | 76% | Medium (one-time) | | Supervised learning (LSTM) | 22.4% | 88% | Low (automated) | | Q-Learning (ours) | 38.2% | 94% | Low | | Model-based RL (MPC) | 39.1% | 91% | High (domain expertise) |
Q-learning wins on ROI: 97% of model-based RL performance with 1/10th implementation complexity.
Case Study: The German Data Scientist's Home
Profile: Munich, 1 adult, 75m² apartment, time-of-use electricity plan, tech background
Motivation: "I wanted to test if RL could beat my manually optimized schedules. I'm a PhD in ML—I know the theory. Can it beat human expertise?"
Pre-deployment (Manual optimization):
- Monthly consumption: 285 kWh
- Monthly cost: €89
- His approach: Hand-tuned schedules based on price data, weather forecasts, personal calendar
- Time investment: ~2 hours/month monitoring and adjusting
Q-Learning deployment (February 2025):
- Installed smart plugs on 8 devices
- Deployed Q-learning algorithm (open-source implementation)
- Configuration time: 45 minutes
Results after 8 weeks:
| Metric | Manual (Human Expert) | Q-Learning | |--------|----------------------|------------| | Avg consumption | 285 kWh/month | 192 kWh/month | | Avg cost | €89/month | €58/month | | Peak-hour % | 31% | 14% | | Comfort violations | 0.2% (rare) | 0.1% (very rare) | | Time investment | 2 hours/month | 0 hours/month |
Reduction: 32.6% vs. human-optimized baseline
His analysis (verbatim):
"I'm stunned. I was optimizing for average cost per kWh—I'd shift loads to off-peak. The RL agent learned something I missed: it optimizes for total cost while preserving comfort score. It learned that running my dishwasher at 2 AM is fine (I'm asleep), but deferring my coffee maker past 7 AM tanks my comfort (I'm groggy and annoyed).
It also learned second-order effects I never considered. It pre-heats my apartment at 6 AM (off-peak) to 21°C, then lets it coast to 19°C during peak hours, then heats again at 9 PM (off-peak). I was maintaining constant 20°C. The RL approach saves €8/month on heating alone.
Most impressive: it adapted when I changed jobs and my schedule shifted. I would've needed to re-tune all my rules. The algorithm just... noticed and adjusted within 5 days."
Annual savings vs. his expert manual approach: €372
Implementation Architecture
For technical readers looking to replicate:
System Components
┌─────────────────────────────────────────────┐
│ Smart Plug Network (WiFi) │
│ • Per-device consumption monitoring │
│ • Remote on/off control │
│ • 5-second sampling rate │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Edge Device (Raspberry Pi 4) │
│ • Local Q-learning inference │
│ • State estimation │
│ • Action execution │
│ • Privacy-preserving (no cloud dependency) │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ GDPR-Compliant EU Server (Optional) │
│ • Model training (aggregated data) │
│ • Hyperparameter optimization │
│ • Performance analytics │
└─────────────────────────────────────────────┘
Edge-first design rationale:
- Privacy: No consumption data leaves home for real-time decisions
- Latency: 50ms action execution (vs. 500ms+ cloud round-trip)
- Reliability: Works during internet outages
Code Skeleton
class QLearningEnergyOptimizer:
def __init__(self, num_features=35, learning_rate=0.1, discount=0.95):
self.theta = np.zeros(num_features) # Weight vector
self.alpha = learning_rate
self.gamma = discount
self.epsilon = 0.30
def Q_value(self, state, action):
"""Linear function approximation"""
features = self.feature_vector(state, action)
return np.dot(self.theta, features)
def select_action(self, state, valid_actions):
"""ε-greedy policy"""
if np.random.random() < self.epsilon:
return np.random.choice(valid_actions)
else:
Q_values = [self.Q_value(state, a) for a in valid_actions]
return valid_actions[np.argmax(Q_values)]
def update(self, state, action, reward, next_state, next_valid_actions):
"""Q-learning update with function approximation"""
# Compute TD target
next_Q_values = [self.Q_value(next_state, a) for a in next_valid_actions]
target = reward + self.gamma * max(next_Q_values)
# Compute current Q-value
current_Q = self.Q_value(state, action)
# TD error
td_error = target - current_Q
# Gradient update
features = self.feature_vector(state, action)
self.theta += self.alpha * td_error * features
def decay_exploration(self, week):
"""Scheduled ε decay"""
if week <= 2:
self.epsilon = 0.30
elif week <= 4:
self.epsilon = 0.15
elif week <= 8:
self.epsilon = 0.05
else:
self.epsilon = 0.02
Production Optimizations
- Prioritized experience replay: Store high-TD-error transitions, replay during training (improves convergence 15%)
- Adaptive learning rate: α = 0.1 / (1 + 0.01 × episode_number)
- Double Q-learning: Reduces overestimation bias (improves stability)
- Safe exploration: Mask actions that violate hard constraints (prevent learned comfort violations)
Limitations and Future Work
Current Limitations
-
Cold start problem: Weeks 1-2 have suboptimal performance (8-18% reduction vs. 38% at convergence)
- Mitigation: Transfer learning from similar households (under development)
-
Non-stationary environments: Algorithm assumes price structures, user schedules remain relatively stable
- Impact: Performance degrades 5-8% after major life changes (new job, baby, etc.)
- Recovery: 2-3 weeks to re-converge
-
Scalability: Training 100-device commercial buildings is computationally expensive
- Current: 10^4 state-action pairs/second on Raspberry Pi 4
- Needed for commercial: 10^6+ pairs/second
Promising Extensions
- Multi-agent RL: Coordinate across households for grid-level optimization
- Meta-learning: Few-shot adaptation to new households (solve cold start)
- Inverse RL: Learn user preferences from behavior (automate comfort scoring)
- Model-based RL: Combine Q-learning with learned dynamics models (sample efficiency)
The €450 Algorithm, Open-Sourced
The technical details above describe a system achieving 38.2% average energy cost reduction across 13,263 real-world European households.
Key technical contributions:
- Successful application of tabular Q-learning to high-dimensional residential energy optimization
- Feature engineering for linear function approximation that captures 98% of deep RL performance
- Edge-first architecture enabling privacy-preserving, low-latency control
- Empirical evidence that comfort-aware reward shaping enables 94% long-term adherence
For the ML community: this is a solved problem at household scale. The algorithm works, generalizes, and runs on €40 hardware.
For the research community: the interesting challenges are in multi-agent coordination, transfer learning, and scaling to commercial buildings.
For everyone else: you can now cut your electricity bill 38% with an algorithm that runs locally, respects your privacy, and learns your preferences automatically.
The code is open-source. The data (anonymized, aggregated) is available for research. The €450 in annual savings is yours to claim.
About the Research
This article describes the technical implementation of Q-learning for residential energy optimization, deployed across 13,263 European households in Belgium, Germany, France, Netherlands, Spain, Sweden, Lithuania, and Poland from January 2025 to February 2026. All monitoring equipment is IEC 62053-21 certified (±2% accuracy). Data processing complies with GDPR on EU servers. Individual household data never leaves premises; only model updates are aggregated.
Code repository: github.com/smartplugs-eu/ql-energy Research methodology: smartplugs.eu/research
Author Bio: This technical analysis is based on 280M+ state-action-reward tuples collected from real European households. The Q-learning implementation described is production-grade, open-source, and actively running in thousands of homes.
Suggested Images:
- Diagram: "Q-Learning Architecture for Energy Optimization" (MDP formulation with state/action/reward visualization)
- Graph: "Learning Curve: Cost Reduction Over Time" (line plot showing convergence from Week 1-16)
- Chart: "Feature Importance Ablation Study" (bar chart comparing performance of different feature sets)
Calculate Your Potential Savings
Use our free AI-powered calculator to see how much you could save on your energy bill