Implement deep Q-learning (DQN) from scratch to coach a CartPole reinforcement studying agent utilizing RLax JAX Haiku and Optax

by root March 22, 2026

written by root March 22, 2026 0 comment 73 views

On this tutorial, you’ll implement a reinforcement studying agent utilizing: RLaxa research-oriented library developed by Google DeepMind for constructing reinforcement studying algorithms utilizing JAX. Mix RLax with JAX, Haiku, and Optax to construct a Deep Q-Studying (DQN) agent that learns to resolve CartPole environments. As a substitute of utilizing a completely packaged RL framework, we assemble the coaching pipeline ourselves in order that we’ve a transparent understanding of how the core elements of reinforcement studying work together. Outline a neural community, construct a replay buffer, compute time-difference errors with RLax, and prepare an agent utilizing gradient-based optimization. We additionally deal with understanding how RLax supplies reusable RL primitives that may be built-in into customized reinforcement studying pipelines. Use JAX for environment friendly numerical computation, Haiku for neural community modeling, and Optax for optimization.

!pip -q set up "jax[cpu]" dm-haiku optax rlax gymnasium matplotlib numpy


import os
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"


import random
import time
from dataclasses import dataclass
from collections import deque


import gymnasium as health club
import haiku as hk
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
import numpy as np
import optax
import rlax


seed = 42
random.seed(seed)
np.random.seed(seed)


env = health club.make("CartPole-v1")
eval_env = health club.make("CartPole-v1")


obs_dim = env.observation_space.form[0]
num_actions = env.action_space.n


def q_network(x):
   mlp = hk.Sequential([
       hk.Linear(128), jax.nn.relu,
       hk.Linear(128), jax.nn.relu,
       hk.Linear(num_actions),
   ])
   return mlp(x)


q_net = hk.without_apply_rng(hk.rework(q_network))


dummy_obs = jnp.zeros((1, obs_dim), dtype=jnp.float32)
rng = jax.random.PRNGKey(seed)
params = q_net.init(rng, dummy_obs)
target_params = params


optimizer = optax.chain(
   optax.clip_by_global_norm(10.0),
   optax.adam(3e-4),
)
opt_state = optimizer.init(params)

Set up the required libraries and import all modules required to your reinforcement studying pipeline. Initialize the surroundings, outline a neural community structure utilizing Haiku, and arrange a Q community to foretell the worth of an motion. It additionally initializes the community and goal community parameters and the optimizer used throughout coaching.

@dataclass
class Transition:
   obs: np.ndarray
   motion: int
   reward: float
   low cost: float
   next_obs: np.ndarray
   accomplished: float


class ReplayBuffer:
   def __init__(self, capability):
       self.buffer = deque(maxlen=capability)


   def add(self, *args):
       self.buffer.append(Transition(*args))


   def pattern(self, batch_size):
       batch = random.pattern(self.buffer, batch_size)
       obs = np.stack([t.obs for t in batch]).astype(np.float32)
       motion = np.array([t.action for t in batch], dtype=np.int32)
       reward = np.array([t.reward for t in batch], dtype=np.float32)
       low cost = np.array([t.discount for t in batch], dtype=np.float32)
       next_obs = np.stack([t.next_obs for t in batch]).astype(np.float32)
       accomplished = np.array([t.done for t in batch], dtype=np.float32)
       return {
           "obs": obs,
           "motion": motion,
           "reward": reward,
           "low cost": low cost,
           "next_obs": next_obs,
           "accomplished": accomplished,
       }


   def __len__(self):
       return len(self.buffer)


replay = ReplayBuffer(capability=50000)


def epsilon_by_frame(frame_idx, eps_start=1.0, eps_end=0.05, decay_frames=20000):
   combine = min(frame_idx / decay_frames, 1.0)
   return eps_start + combine * (eps_end - eps_start)


def select_action(params, obs, epsilon):
   if random.random() < epsilon:
       return env.action_space.pattern()
   q_values = q_net.apply(params, obs[None, :])
   return int(jnp.argmax(q_values[0]))

Outline a transition construction and implement a replay buffer to avoid wasting previous experiences from the surroundings. Create a perform so as to add the transition and a pattern minibatch that might be used later to coach the agent. We additionally implement the epsilon-greedy exploration technique.

@jax.jit
def soft_update(target_params, online_params, tau):
   return jax.tree_util.tree_map(lambda t, s: (1.0 - tau) * t + tau * s, target_params, online_params)


def batch_td_errors(params, target_params, batch):
   q_tm1 = q_net.apply(params, batch["obs"])
   q_t = q_net.apply(target_params, batch["next_obs"])
   td_errors = jax.vmap(
       lambda q1, a, r, d, q2: rlax.q_learning(q1, a, r, d, q2)
   )(q_tm1, batch["action"], batch["reward"], batch["discount"], q_t)
   return td_errors


@jax.jit
def train_step(params, target_params, opt_state, batch):
   def loss_fn(p):
       td_errors = batch_td_errors(p, target_params, batch)
       loss = jnp.imply(rlax.huber_loss(td_errors, delta=1.0))
       metrics = {
           "loss": loss,
           "td_abs_mean": jnp.imply(jnp.abs(td_errors)),
           "q_mean": jnp.imply(q_net.apply(p, batch["obs"])),
       }
       return loss, metrics


   (loss, metrics), grads = jax.value_and_grad(loss_fn, has_aux=True)(params)
   updates, opt_state = optimizer.replace(grads, opt_state, params)
   params = optax.apply_updates(params, updates)
   return params, opt_state, metrics

Defines the core studying perform used throughout coaching. Compute the time distinction error utilizing RLax’s Q-learning primitive and compute the loss utilizing the Huber loss perform. Subsequent, implement a coaching step that computes gradients, applies optimizer updates, and returns coaching metrics.

def evaluate_agent(params, episodes=5):
   returns = []
   for ep in vary(episodes):
       obs, _ = eval_env.reset(seed=seed + 1000 + ep)
       accomplished = False
       truncated = False
       total_reward = 0.0
       whereas not (accomplished or truncated):
           q_values = q_net.apply(params, obs[None, :])
           motion = int(jnp.argmax(q_values[0]))
           next_obs, reward, accomplished, truncated, _ = eval_env.step(motion)
           total_reward += reward
           obs = next_obs
       returns.append(total_reward)
   return float(np.imply(returns))


num_frames = 40000
batch_size = 128
warmup_steps = 1000
train_every = 4
eval_every = 2000
gamma = 0.99
tau = 0.01
max_grad_updates_per_step = 1


obs, _ = env.reset(seed=seed)
episode_return = 0.0
episode_returns = []
eval_returns = []
losses = []
td_means = []
q_means = []
eval_steps = []


start_time = time.time()

Outline an analysis perform to measure agent efficiency. Configure coaching hyperparameters equivalent to variety of frames, batch dimension, low cost issue, and goal community replace price. It additionally initializes variables that monitor coaching statistics equivalent to episode income, loss, and analysis metrics.

for frame_idx in vary(1, num_frames + 1):
   epsilon = epsilon_by_frame(frame_idx)
   motion = select_action(params, obs.astype(np.float32), epsilon)
   next_obs, reward, accomplished, truncated, _ = env.step(motion)
   terminal = accomplished or truncated
   low cost = 0.0 if terminal else gamma


   replay.add(
       obs.astype(np.float32),
       motion,
       float(reward),
       float(low cost),
       next_obs.astype(np.float32),
       float(terminal),
   )


   obs = next_obs
   episode_return += reward


   if terminal:
       episode_returns.append(episode_return)
       obs, _ = env.reset()
       episode_return = 0.0


   if len(replay) >= warmup_steps and frame_idx % train_every == 0:
       for _ in vary(max_grad_updates_per_step):
           batch_np = replay.pattern(batch_size)
           batch = {okay: jnp.asarray(v) for okay, v in batch_np.objects()}
           params, opt_state, metrics = train_step(params, target_params, opt_state, batch)
           target_params = soft_update(target_params, params, tau)
           losses.append(float(metrics["loss"]))
           td_means.append(float(metrics["td_abs_mean"]))
           q_means.append(float(metrics["q_mean"]))


   if frame_idx % eval_every == 0:
       avg_eval_return = evaluate_agent(params, episodes=5)
       eval_returns.append(avg_eval_return)
       eval_steps.append(frame_idx)
       recent_train = np.imply(episode_returns[-10:]) if episode_returns else 0.0
       recent_loss = np.imply(losses[-100:]) if losses else 0.0
       print(
           f"step={frame_idx:6d} | epsilon={epsilon:.3f} | "
           f"recent_train_return={recent_train:7.2f} | "
           f"eval_return={avg_eval_return:7.2f} | "
           f"recent_loss={recent_loss:.5f} | buffer={len(replay)}"
       )


elapsed = time.time() - start_time
final_eval = evaluate_agent(params, episodes=10)


print("nTraining full")
print(f"Elapsed time: {elapsed:.1f} seconds")
print(f"Remaining 10-episode analysis return: {final_eval:.2f}")


plt.determine(figsize=(14, 4))
plt.subplot(1, 3, 1)
plt.plot(episode_returns)
plt.title("Coaching Episode Returns")
plt.xlabel("Episode")
plt.ylabel("Return")


plt.subplot(1, 3, 2)
plt.plot(eval_steps, eval_returns)
plt.title("Analysis Returns")
plt.xlabel("Surroundings Steps")
plt.ylabel("Avg Return")


plt.subplot(1, 3, 3)
plt.plot(losses, label="Loss")
plt.plot(td_means, label="|TD Error| Imply")
plt.title("Optimization Metrics")
plt.xlabel("Gradient Updates")
plt.legend()


plt.tight_layout()
plt.present()


obs, _ = eval_env.reset(seed=999)
frames = []
accomplished = False
truncated = False
total_reward = 0.0


render_env = health club.make("CartPole-v1", render_mode="rgb_array")
obs, _ = render_env.reset(seed=999)


whereas not (accomplished or truncated):
   body = render_env.render()
   frames.append(body)
   q_values = q_net.apply(params, obs[None, :])
   motion = int(jnp.argmax(q_values[0]))
   obs, reward, accomplished, truncated, _ = render_env.step(motion)
   total_reward += reward


render_env.shut()


print(f"Demo episode return: {total_reward:.2f}")


attempt:
   import matplotlib.animation as animation
   from IPython.show import HTML, show


   fig = plt.determine(figsize=(6, 4))
   patch = plt.imshow(frames[0])
   plt.axis("off")


   def animate(i):
       patch.set_data(frames[i])
       return (patch,)


   anim = animation.FuncAnimation(fig, animate, frames=len(frames), interval=30, blit=True)
   show(HTML(anim.to_jshtml()))
   plt.shut(fig)
besides Exception as e:
   print("Animation show skipped:", e)

Run a whole reinforcement studying coaching loop. Often replace community parameters, consider agent efficiency, and document metrics for visualization. Additionally, plot coaching outcomes and render demonstration episodes to watch how the educated agent behaves.

In conclusion, we’ve constructed a whole Deep Q-Studying agent by combining RLax with a contemporary JAX-based machine studying ecosystem. We designed a neural community that estimates motion values, implements expertise replay to stabilize studying, and computes TD error utilizing RLax’s Q-learning primitive. Throughout coaching, we used gradient-based optimization to replace community parameters and periodically evaluated the agent to trace efficiency enhancements. We additionally mentioned how RLax permits a modular method to reinforcement studying by offering reusable algorithm elements fairly than full algorithms. This flexibility means that you can simply experiment with completely different architectures, studying guidelines, and optimization methods. By extending this basis, you need to use the identical RLax primitives to construct extra superior brokers equivalent to Double DQNs, distributed reinforcement studying fashions, and actor-critical methods.

Please verify Click here for the full text of Noteboo. Please be happy to observe us too Twitter Do not forget to hitch us 120,000+ ML subreddits and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Implement deep Q-learning (DQN) from scratch to coach a CartPole reinforcement studying agent utilizing RLax JAX Haiku and Optax

Insuring your small enterprise in a altering world

Now you can purchase a DIY quantum laptop

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling

Top rated

Products

Latest Posts

Welcome to Ivugangingo!

Random Picks