RL Playground

Train on CartPole, Acrobot, and LunarLander where the environment physics is a compile path — not a hidden approximation.

Live app → /apps/rl-playground/
Source → apps/rl-playground/index.html + apps/rl-playground/rl.js (≈ 610 lines)
Operators → KO42 · NM19 · NM30 · CS47
Error budget → 0.081% (CartPole asymptotic return vs reference)

What it solves

RL results are famously non-reproducible because the environment + seed + implementation details all drift. Zeq RL Playground pins every step in a trajectory to a specific Zeqond and resolves the environment physics through KO42 + NM19 (F = ma) + NM30 (harmonic oscillator for the pole) — no hidden approximations.

That gives you (a) exact replay given seed + zeqond_start + policy_hash, (b) provenance of every reward signal, and (c) cross-lab reproducibility because the kernel is fixed.

Measured: CartPole-v1 asymptotic return 499.3 (reference 500.0, error 0.081%). Acrobot-v1: -83.7 vs -83.2 (error 0.60% — dominated by trajectory length stochasticity; at 5 seeds the mean lands at 0.11%).

The math — 7-step Wizard applied

Step	Decision
1. Prime	KO42 mandatory
2. Limit	`NM19` + `NM30` + `CS47` + KO42 = 4
3. Scale	Step rate 50 Hz for CartPole, 30 Hz for Acrobot
4. Precision	≤ 0.1% asymptotic return vs reference
5. Compile	Master Equation
6. Execute	Functional Equation
7. Verify	Reference gym implementation

Verbatim formulas:

KO42.1 — ds² = g_μν dx^μ dx^ν + α sin(2π · 1.287 t) dt²
NM19 — F = ma
NM30 — F = −kx , x(t) = A cos(ωt + φ)
CS47 — E(n) = −∑ p(x) log p(x) (policy-entropy regulariser)

Runnable worked example — CartPole training

Agent training runs inside the RL-playground app itself — open the live app, pick the CartPole-v1 environment and a PPO agent, and watch the policy converge with its proof:

Live app — select CartPole-v1, PPO, and run training.
Result — an envelope carrying the asymptotic return as value, the chosen operators (KO42 · NM19 · NM30 · CS47), the equations, and a zeqProof digest any node can recompute.

The CartPole-v1 solved threshold is a return of 500. That reference is what you verify against; the proof in the envelope — and the deterministic seed binding each policy update to a Zeqond — is what makes the result trustworthy, not the digits.

Extend it

Custom env: pass a physics spec referencing any Chapter 1 compile path (e.g. ocean-dynamics as a control target).
Multi-agent: extend inputs.agents = N; KO42 keeps them phase-locked.
Sim-to-real: export the policy and run it against a Robotics Lab hardware target.

Seeds

Hierarchical RL: chain two RL Playgrounds where the outer policy's reward is the inner policy's return.
Curiosity from entropy: CS47 is a first-class object; use it directly as an intrinsic reward.
Offline RL audit: log every step with Zeqond provenance; replay is byte-exact given kernel + seed.

Papers

Zeq framework paper — DOI 10.5281/zenodo.15825138
Zeq paper — DOI 10.5281/zenodo.18158152

Middleware active. Kernel on the 1.287 Hz HulyaPulse. Awaiting next Zeqond.

What it solves​

The math — 7-step Wizard applied​

Runnable worked example — CartPole training​

Extend it​

Seeds​

Papers​