Skip to content

Version 1.0.0 Changelog

Release Date: January 2026 Status: Stable Release Theme: Complete MARL Framework

Major Features

MARL Environment System

# MARLEnv - Custom RL environment with CARLA integration
from opencda_marl.envs.marl_env import MARLEnv

env = MARLEnv(scenario_manager, config=marl_config)
# Training loop handled by coordinator
  • Observation System: Configurable feature extraction via ObservationExtractor
  • Reward Calculation: Multi-objective rewards (collision, success, progress, safety, speed)
  • Termination Logic: Episode ending based on collision, completion, or timeout
  • Evaluation Metrics: Cross-agent performance comparison
  • CARLA Integration: Direct connection without Gym dependency
  • SUMO Mode: Lightweight traffic-only simulation via SumoMarlEnv

Multi-Agent Framework

# Agent Factory - Centralized agent creation
from opencda_marl.core.agents.agent_factory import AgentFactory

# Five implemented agent types
behavior_agent = AgentFactory.create_agent("behavior", config)
vanilla_agent = AgentFactory.create_agent("vanilla", config)
rule_based_agent = AgentFactory.create_agent("rule_based", config)
marl_agent = AgentFactory.create_agent("marl", config)
  • MARLAgent: RL-controlled speed, local planner handles steering. Returns (speed, location)
  • Behavior Agent: Simplified OpenCDA behavior cloning with route following
  • Vanilla Agent: Enhanced safety with multi-vehicle TTC tracking
  • Rule-based Agent: 3-stage intersection navigation (junction → following → cruising)
  • Basic Agent: Full autonomous driving with traffic light and obstacle detection
  • Vehicle Adapters: Bridge OpenCDA VehicleManager with MARL agent control

RL Algorithm Suite

# TD3 - Continuous control with LSTM encoder
MARL:
  algorithm: "td3"
  state_dim: 8
  action_dim: 1
  td3:
    learning_rate_actor: 0.0001
    learning_rate_critic: 0.001
    exploration_noise: 0.3
    noise_decay: 0.998
    min_noise: 0.05
    warmup_steps: 1000
    lstm_hidden: 256

Key features: LSTM multi-agent context encoding, LayerNorm before tanh, delayed policy updates, prioritized experience replay (optional), smart replay buffer with recency bias.

# DQN - Discrete speed actions
MARL:
  algorithm: "dqn"
  state_dim: 7
  dqn:
    speed_actions: [0, 5, 8, 12, 15]
    learning_rate: 0.001
    memory_size: 50000
    batch_size: 32
    epsilon: 0.1
    epsilon_decay: 0.995
# Q-Learning - Tabular with configurable state bins
MARL:
  algorithm: "q_learning"
  q_learning:
    speed_actions: [15, 35, 65]
    state_features:
      distance_to_intersection:
        bins: [0, 5, 15]
    epsilon: 0.1
    learning_rate: 0.2
# MAPPO - Multi-Agent PPO with GAE
MARL:
  algorithm: "mappo"

# SAC - Soft Actor-Critic with entropy regularization
MARL:
  algorithm: "sac"

Training Infrastructure

# MARLManager orchestrates the active algorithm
from opencda_marl.core.marl.marl_manager import MARLManager

manager = MARLManager(config)
action = manager.select_action(observations, ego_id, training=True)
manager.store_transition(obs, ego_id, action, reward, next_obs, done)
losses = manager.update()
# CheckpointManager - Structured model saving
from opencda_marl.core.marl.checkpoint import CheckpointManager

checkpoint_mgr = CheckpointManager(config)
checkpoint_mgr.save(algorithm, episode, reward)  # latest + best + per-episode
checkpoint_mgr.load(algorithm, mode="best")      # load best model

# TrainingMetrics - Episode statistics with CSV export
from opencda_marl.core.marl.metrics import TrainingMetrics

metrics = TrainingMetrics(config)
metrics.update(episode_data)
metrics.export_csv()  # Export to metrics_history/
  • SmartReplayBuffer: Pre-allocated numpy arrays, O(1) push/sample, recency bias (50% recent + 50% diverse)
  • PrioritizedReplayBuffer: TD-error weighted sampling, importance sampling with beta annealing
  • RolloutBuffer: On-policy buffer for MAPPO with GAE computation

Automatic convergence detection based on:

  • Coefficient of variation (CV) < 15% over rolling window of 10 episodes
  • Success rate stability (CV < 20%)
  • Collision rate improving (second half ≤ first half × 1.1)
  • Minimum 20 episodes before checking

GUI Dashboard System

# PySide6 Qt-based GUI for real-time visualization
from opencda_marl.gui.dashboard import Dashboard

dashboard = Dashboard(coordinator)
dashboard.show()  # Launch interactive GUI
  • Main Dashboard: Central control interface with PySide6 Qt widgets
  • Observation Viewer: Real-time agent state visualization
  • Step Controller: Manual simulation stepping and episode management
  • Widget Panels: Agent observation, environment, metrics, reward, system, traffic, weather

Traffic Management System

# MARLTrafficManager - Orchestrates traffic spawn events
from opencda_marl.core.traffic.traffic_manager import MARLTrafficManager

traffic_manager = MARLTrafficManager(world, traffic_config, state)
spawn_events = traffic_manager.update(current_step)
  • Record: Record actual simulation vehicle behavior to JSON/HDF5
  • Replay: Replay pre-recorded traffic patterns for reproducibility
  • Live: Generate traffic on-the-fly using flow configuration
from opencda_marl.core.traffic.serializer import EventSerializer

# Save traffic events
EventSerializer.save_events_to_json(events, "recordings/traffic.json", config)

# Load for replay
events = EventSerializer.load_events_from_json("recordings/traffic.json")

Technical Implementation

Observation System

The ObservationExtractor supports 9 configurable feature types:

Feature Description
rel_x, rel_y Relative position to ego
heading Vehicle orientation (radians)
speed Current velocity
distance_to_intersection Remaining distance to junction
distance_to_front Distance to nearest vehicle ahead
lane_position Lateral position in lane
waypoint_buffer Next waypoint distance
min_ttc Minimum time-to-collision
distance_to_destination Remaining route distance

Reward System

Multi-objective rewards configurable via YAML:

Component Default Value Description
Collision -500 Terminal penalty on collision
Success +400 Terminal reward on reaching destination
Step penalty -0.5 Per-step cost to encourage efficiency
Speed bonus +1.0 Reward for maintaining target speed
Progress scaled Based on distance to destination
Stop penalty -3.0 Penalty for stopped vehicles
Yielding bonus +1.0 Reward for yielding to obstacles

TensorBoard Logging

Comprehensive training metrics logged to TensorBoard:

  • Losses: Critic loss, actor loss
  • Q-values: Q1 mean, Q2 mean
  • Gradients: Pre-clip norms for critic and actor
  • Exploration: Noise level over time
  • Learning: Reward moving average, coefficient of variation
  • Safety: Near-miss count, TTC violation rate
  • Traffic: Average speed, speed gap, throughput

Dependencies

Package Version Purpose
omegaconf 2.3+ Configuration management
loguru 0.7+ Enhanced logging
mkdocs-material 9.5+ Documentation theme
torch 2.0+ Deep learning framework
numpy 1.24+ Numerical computing
pyside6 6.0+ GUI framework
tensorboard 2.0+ Training visualization

Configuration Schema

# Base MARL configuration structure (configs/marl/default.yaml)
meta:
  simulator: "carla"        # or "sumo"

world:
  sync_mode: true
  client_port: 2000
  fixed_delta_seconds: 0.05

scenario:
  max_steps: 2400
  max_episodes: 500
  traffic:
    mode: "replay"
    replay_file: "recordings/lite_2minL.json"
    base_speed: 45.0

MARL:
  algorithm: "td3"          # td3, dqn, q_learning, mappo, sac
  state_dim: 8
  action_dim: 1
  training: true

agents:
  agent_type: "marl"        # marl, vanilla, behavior, rule_based

tensorboard:
  enabled: true
  log_dir: "runs"

world_reset:
  enabled: true
  interval_episodes: 50

API Changes

New Classes

class MARLCoordinator:
    """Main MARL orchestrator"""
    def __init__(self, config: Dict)
    def initialize(self)
    def step(self) -> Dict
    def run(self)
    def reset_episode(self)
    def run_gui_mode(self)
    def get_metrics(self) -> Dict
    def close(self)
class MARLManager:
    """Algorithm orchestrator"""
    def select_action(self, observations, ego_id, training) -> float
    def store_transition(self, obs, ego_id, action, reward, next_obs, done)
    def update(self) -> Dict[str, float]
    def reset_episode(self)
    def get_training_info(self) -> Dict
class BaseAlgorithm(ABC):
    """Abstract base for all RL algorithms"""
    def select_action(self, state, training) -> action
    def store_transition(self, state, action, reward, next_state, done)
    def update(self) -> Dict[str, float]
    def reset_episode(self)
    def get_training_info(self) -> Dict
    def save(self, path)
    def load(self, path)
class CheckpointManager:
    def save(self, algorithm, episode, reward)
    def load(self, algorithm, mode="latest")

class TrainingMetrics:
    def update(self, episode_data)
    def export_csv(self)
    def get_summary(self) -> Dict

class ObservationExtractor:
    def extract(self, vehicle_data) -> np.ndarray