Prompt-Driven Exploration

Bootstrapping RL from zero-reward starts by exploring in prompt space
1Massachusetts Institute of Technology  ·  2MIT-IBM Computing Research Lab  ·  3Improbable AI Lab
MIT MIT-IBM Watson AI Lab Improbable AI Lab

Stop tuning the algorithms. Change the prompts.

Original: “put the green container on the top rack”
PDE-discovered: “put the green container on the top rack completely
PDE bootstraps RL from low initial success
Original promptfail
PDE-discovered promptsuccess

Abstract

Exploration is essential to RL since a policy cannot improve by repeatedly sampling the behaviors it already prefers. Standard methods inject stochasticity in the action space, but such jitter only yields rollouts close to the original. Escaping a weak policy often requires global perturbations that action noise cannot produce. LLMs and vision-language-action (VLA) models condition the policy on a natural language prompt, and since the rollout follows from it, modifying the prompt induces global changes. The challenge is finding prompts that induce useful global changes. With a weak policy that rarely succeeds, reward is too sparse to select on. Our idea is to refine prompts from the rollouts themselves: a vision-language model (VLM) reasons over the rollout video, diagnoses how the policy responded, and rewrites the prompt to elicit better behavior next time. This realizes posterior sampling at the level of prompts: the VLM maintains an implicit distribution over useful prompts and updates it from observed rollouts. We call this strategy Prompt-Driven Exploration (PDE). Across manipulation and reasoning tasks, PDE enables RL to learn successful policies even from zero-reward starts, and improves sample efficiency more broadly.

Bad prompt lets the robot down, good prompt lets the robot flySame policy weights, same scene — only the wording changes.

Refining prompts explores globally: qualitatively different strategies, not local jitter around the same wrong motor program.
Bowl: “put the green container in the bowl”
Bowl: “put the green container in the green bowl”
Drawer: “open the top drawer and put the tape above inside”
Drawer: “open the top drawer and put the tape inside”
Originalsuccess rate ≈ 5%
Optimizedsuccess rate ≈ 35%
Originalsuccess rate ≈ 15%
Optimizedsuccess rate ≈ 45%

Same arm, same plan, same regretThree rollouts of an SFT VLA — action noise is trying its best, but the cabinet wins every time.

The SFT checkpoint has learned a visual-scene-to-trajectory mapping and gets stuck in a single wrong motor program. Action-noise exploration jitters around it — every rollout fails in the same way, so PPO sees only zero reward and never improves.
Insight. What's needed isn't more noise but different behavior. Because the VLA is prompt-conditioned, changing the prompt changes the entire rollout — a global perturbation that action-space noise cannot produce.

Prompt Driven Exploration (PDE)

Your VLA is stuck. Wiggling its joints harder won't help. So we hand the mic to a VLM: it watches the rollout, writes a brutally honest one-liner about what went wrong, and proposes a better prompt. Same policy weights, very different behavior.

Click any step to advance the pipeline below.

Outer loop — RL update (policy weights updated)
Inner loop — prompt refinement (frozen policy weights)
  1. Initialize. Start with the user's task prompt \(p\) (e.g. “put the green container on the top rack”).
  2. Roll out. Frozen VLA executes \(p\), producing a rollout video \(\tau\).
  3. Evaluate. VLM watches \(\tau\) and labels it success or failure. If success, exit the inner loop.
  4. Revise. On failure, VLM analyzes \(\tau\) and proposes a refined prompt \(p'\) (e.g. “put the green box completely on the top rack”) — the Update Prompt Posterior arrow.
  5. Repeat with \(p \leftarrow p'\) until success or budget exhausted. No gradient updates on the policy.
  1. RL update. Use the prompt pool discovered by the inner loop as a curriculum; run PPO on the original task prompt with the policy gradient flowing through the RL Update arrow back into \(\pi_\theta\).
PDE pipeline — current step

Connecting to Posterior-Sampling RLPDE is PSRL — lifted from the VLA's parameter space to the VLM's prompt space.

Posterior-sampling RL (PSRL) maintains a distribution over policies, samples one per episode, executes it, and updates from the observed trajectory and reward. It's a clean, temporally coherent exploration recipe — but for modern VLAs, a Bayesian posterior over billions of parameters is intractable to maintain or sample from.

PDE recovers PSRL through the language interface. For a frozen VLA with parameters \(\theta\), each prompt \(p\) induces a policy \(\pi_p(\cdot \mid o) := \pi_\theta(\cdot \mid o, p)\). So a distribution over prompts \(\rho(\cdot \mid g, H)\) induces a distribution over policies — restricted to the family reachable by the VLA's language conditioning. Drawing a policy hypothesis becomes choosing a prompt.

The VLM is an implicit, amortized prompt posterior. Rather than maintaining an explicit density over natural language, PDE conditions the VLM on the rollout history \(H_i = \{(g_j, p_j, \tau_j, r_j)\}_{j \lt i}\) and queries it for the next prompt. The update is in-context — no gradient steps, no explicit Bayes rule — leveraging evidence that LLM in-context learning behaves as approximate Bayesian inference.

PDE outperforms dense reward and action noise baselinesAcross 200+ tasks in LIBERO-PRO and ManiSkill

\(\pi_{0.5}\)
LIBERO Pi0.5 — success rate vs. environment steps for PDE and baselines.
\(\pi_0\)
LIBERO Pi0 — PDE vs. action-noise baselines on the Pi0 backbone.
GR00T
GR00T — PDE bootstraps where action-space exploration fails.

PosterICRA'26 Workshop on VLA Pipelines for Real Robots & Workshop on Manipulation Robustness

PDE poster preview — click to open the full PDF
📄 Download full poster (PDF)