RL Infrastructure

Where Models
Come to Train.

Precision-engineered infrastructure for building, running, and benchmarking reinforcement learning environments. From tool-use agents to full coding environments.

Reach out at contact@datagym.in

DataGym Env Explorer

Task

Summarize top RLHF papers from 2024

ToolGym-v1 | Tools: web_search, read_url, write_file

01 →web_search("RLHF papers 2024")✓ 10 results

02 →read_url(results[0].url)✓ 4.1 KB

03 →read_url(results[2].url)✓ 3.7 KB

04 →write_file("summary.md", ...)✓ saved

Task

Fix failing test in utils/parser.py

CodeGym-v2 | Sandbox: Docker

01 →read_file("utils/parser.py")✓ success

02 →edit_file(line=47, patch="...")✓ success

03 →run_tests()✗ 2 failing

04 →edit_file(line=52, patch="...")✓ success

05 →run_tests()✓ all passing

Task

Prove ∀ n ∈ ℕ, n² + n is even

MathGym-v1 | Prover: Lean4

01 →intro n· ⊢ Even (n ^ 2 + n)

02 →rw [sq, ← Nat.add_mul]· ⊢ Even (n * (n + 1))

03 →exact Nat.even_mul_succ_self n✓ goal closed

Task

Find iPhone 16 Pro price on amazon.in

WebGym-v1 | Obs: DOM + Screenshot

01 →navigate("https://amazon.in")✓ 200 OK

02 →type(#search, "iPhone 16 Pro")✓ results loaded

03 →click(results[0])✓ product page

04 →extract(".a-price-whole")✓ → ₹119,900

Task

Which country won most gold in 2024 Olympics?

ReasonGym-v1 | Strategy: multi-hop retrieval

01 →retrieve("2024 Olympics medal table")✓ USA: 40, China: 27

02 →compare([40, 27, 14, ...])✓ USA highest: 40

03 →verify("United States 2024 Olympics")✓ confirmed

Environments

Training Environments

High-fidelity observation spaces and deterministic reward functions for rigorous policy evaluation.

ToolGym

Tool-use environments for agents learning to use APIs, web browsers, CLI tools, and file systems.

12K+ Tasks | 400+ Tools

CodeGym

Coding environments with execution sandboxes, test suite verification, and multi-step debugging tasks.

50K+ Repos | Sandbox Ready

MathGym

Mathematical reasoning environments with step-by-step verifiers and formal proof checking.

Lean4, Coq, Metamath

WebGym

Web navigation and form-filling environments with DOM-based observation spaces.

Live Web Rendering

ReasonGym

Multi-hop reasoning and planning environments with structured trajectory validation.

Graph-based Validation

RL Data

Observable Trajectories

Every state, action, and reward is captured. Our verifiable RL data pipelines provide exact ground truth for multi-step reasoning models, moving beyond simple instruction tuning.

Deterministic environment execution
Syntactic action validation
Automated formal verification
Negative trajectory synthesis

# Task Configuration Task: Fix the failing test in utils/parser.py
Environment: CodeGym-v2

# State Spaces Observation Space: file_system, terminal, test_runner
Action Space: edit_file, run_tests, read_file

Step 1 → read_file("utils/parser.py")✓ success

Step 2 → read_file("tests/test_parser.py")✓ success

Step 3 → edit_file(line=47, fix="...")✓ success

Step 4 → run_tests()✗ 2 failing

Step 5 → edit_file(line=52, fix="...")✓ success

Step 6 → run_tests()✓ all passing

Reward: +1.0 Steps: 6 Status: SOLVED

Trajectory Verifier: CodeGym-Verifier-v2

✓ Task completed within step budget

✓ All 14 tests passing

✓ No hallucinated file paths

✓ Edit actions are syntactically valid

Score 0.94 / 1.00

Difficulty Hard

Benchmark Dominance

Quality scores across key RL data dimensions (0–100)

DataGym

Industry Average

Environment Diversity Range of task types and observation spaces

DataGym

Industry Avg.

Trajectory Quality Correctness and coherence of action sequences

DataGym

Industry Avg.

Verifier Coverage Automated correctness checks per trajectory step

DataGym

100

Industry Avg.

Data Scale Volume of verified training examples available

DataGym

Industry Avg.

Where ModelsCome to Train.

Training Environments

ToolGym

CodeGym

MathGym

WebGym

ReasonGym

Observable Trajectories

Benchmark Dominance

Train Better Agents.

Where Models
Come to Train.