Prompt Driven Exploration for VLA Policies

Abstract

Exploration is essential to RL since a policy cannot improve by repeatedly sampling the behaviors it already prefers. Standard methods inject stochasticity in the action space, but such jitter only yields rollouts close to the original. Escaping a weak policy often requires global perturbations that action noise cannot produce. LLMs and vision-language-action (VLA) models condition the policy on a natural language prompt, and since the rollout follows from it, modifying the prompt induces global changes. The challenge is finding prompts that induce useful global changes. With a weak policy that rarely succeeds, reward is too sparse to select on. Our idea is to refine prompts from the rollouts themselves: a vision-language model (VLM) reasons over the rollout video, diagnoses how the policy responded, and rewrites the prompt to elicit better behavior next time. This realizes posterior sampling at the level of prompts: the VLM maintains an implicit distribution over useful prompts and updates it from observed rollouts. We call this strategy Prompt Driven Exploration (PDE). Across manipulation and reasoning tasks, PDE enables RL to learn successful policies even from zero-reward starts, and improves sample efficiency more broadly.

Bad prompt lets the robot down, good prompt lets the robot flySame policy weights, same scene — only the wording changes.

Refining prompts explores globally: qualitatively different strategies, not local jitter around the same wrong motor program.

Bowl: “put the green container in the bowl”

Bowl: “put the green container in the green bowl”

Drawer: “open the top drawer and put the tape inside”

Drawer: “open the top drawer and put the tape above inside”

Originalsuccess rate ≈ 5%

4×

Optimizedsuccess rate ≈ 35%

4×

Originalsuccess rate ≈ 15%

4×

Optimizedsuccess rate ≈ 45%

4×

Same arm, same plan, same regretThree rollouts of an SFT VLA — action noise is trying its best, but the cabinet wins every time.

2×

The SFT checkpoint has learned a visual-scene-to-trajectory mapping and gets stuck in a single wrong motor program. Action-noise exploration jitters around it — every rollout fails in the same way, so PPO sees only zero reward and never improves.

Insight. What's needed isn't more noise but different behavior. Because the VLA is prompt-conditioned, changing the prompt changes the entire rollout — a global perturbation that action-space noise cannot produce.

Prompt Driven Exploration (PDE)

Your VLA is stuck. Wiggling its joints harder won't help. So we hand the mic to a VLM: it watches the rollout, writes a brutally honest one-liner about what went wrong, and proposes a better prompt. Same policy weights, very different behavior.

Click any step to advance the pipeline below.

Outer loop — RL update (policy weights updated)

Inner loop — prompt refinement (frozen policy weights)

Initialize. Start with the user's task prompt \(p\) (e.g. “put the green container on the bottom rack”).
Roll out. Frozen VLA executes \(p\), producing a rollout video \(\tau\).
Evaluate. VLM watches \(\tau\) and labels it success or failure. If success, exit the inner loop.
Revise. On failure, VLM analyzes \(\tau\) and proposes a refined prompt \(p'\) (e.g. “put the green box completely on the bottom rack”) — the Update Prompt Posterior arrow.
Repeat with \(p \leftarrow p'\) until success or budget exhausted. No gradient updates on the policy.

RL update. Use the prompt pool discovered by the inner loop as a curriculum; run PPO on the original task prompt with the policy gradient flowing through the RL Update arrow back into \(\pi_\theta\).

Connecting to Posterior-Sampling RLPDE is PSRL — lifted from the VLA's parameter space to the VLM's prompt space.

Posterior-sampling RL (PSRL) maintains a distribution over policies, samples one per episode, executes it, and updates from the observed trajectory and reward. It's a clean, temporally coherent exploration recipe — but for modern VLAs, a Bayesian posterior over billions of parameters is intractable to maintain or sample from.

PDE recovers PSRL through the language interface. For a frozen VLA with parameters \(\theta\), each prompt \(p\) induces a policy \(\pi_p(\cdot \mid o) := \pi_\theta(\cdot \mid o, p)\). So a distribution over prompts \(\rho(\cdot \mid g, H)\) induces a distribution over policies — restricted to the family reachable by the VLA's language conditioning. Drawing a policy hypothesis becomes choosing a prompt.

The VLM is an implicit, amortized prompt posterior. Rather than maintaining an explicit density over natural language, PDE conditions the VLM on the rollout history \(H_i = \{(g_j, p_j, \tau_j, r_j)\}_{j \lt i}\) and queries it for the next prompt. The update is in-context — no gradient steps, no explicit Bayes rule — leveraging evidence that LLM in-context learning behaves as approximate Bayesian inference.

Prompt Driven Exploration

Abstract

Bad prompt lets the robot down, good prompt lets the robot flySame policy weights, same scene — only the wording changes.

Same arm, same plan, same regretThree rollouts of an SFT VLA — action noise is trying its best, but the cabinet wins every time.

Prompt Driven Exploration (PDE)

Connecting to Posterior-Sampling RLPDE is PSRL — lifted from the VLA's parameter space to the VLM's prompt space.

PDE outperforms dense reward and action noise baselinesAcross 200+ tasks in LIBERO-PRO and ManiSkill

PosterICRA'26 Workshop on VLA Pipelines for Real Robots & Workshop on Manipulation Robustness