GowU: Uncertainty-Guided Tree Search for Hard Exploration

Tue, 31 Mar 2026 00:00:00 +0000

Contents

Summary

🔍 Exploration ≠ Policy Optimization—Treating exploration as search, rather than optimizing an exploration objective via RL, yields an order-of-magnitude efficiency gain
🌳 Particle-Based Search + Uncertainty—A population of particles explores in parallel, guided by an uncertainty signal that concentrates computation on the most promising frontiers
🎮 State-of-the-Art on Hard Atari—Significantly beating the previous state of the art on Montezuma’s Revenge, Pitfall!, and Venture
🤖 First Pixel-Based Solutions for Sparse MuJoCo—Adroit dexterous manipulation (Door, Hammer, Relocate) and AntMaze solved from images with sparse rewards and no demonstrations

The Exploration Problem

Consider being placed in an unfamiliar building without a map and asked to find a particular room. To succeed, you would need to move through the building, try different doors, and keep track of where you have already been. This captures the basic challenge of exploration: gathering new information in a structured way in order to achieve a goal. Despite decades of research, efficient exploration remains a central bottleneck in reinforcement learning. When environmental feedback is extremely sparse, existing methods often struggle to explore effectively.

Existing approaches to exploration typically use reinforcement learning (RL) to train a policy to maximize an exploration objective—one designed so that optimizing it encourages the agent to visit new regions of the state space. Despite significant progress, these methods still fail in the hardest settings. We suggest that this is, in part, because policy optimization itself may not be the right tool for exploration.

A New Paradigm: Search Instead of Policy Optimization

Policy optimization is necessary for precise task execution, but employing such machinery solely to expand state coverage may be inefficient.

In this work, we propose a fundamentally new approach to exploration. Rather than making exploration depend on optimizing a non-stationary exploration objective, we decouple the two. Our method, Go-With-Uncertainty (GowU), treats exploration as a search problem carried out by a population of particles, while remaining agnostic to the local policy used by each particle. Inspired by the Go-With-The-Winner strategy, GowU maintains a population of particles that explore the environment in parallel, guided by a measure of uncertainty. The idea is simple: states that the particles have visited often have low uncertainty, while rarely visited states have high uncertainty. GowU uses this signal to steer the search toward under-explored regions of the environment.

We evaluate GowU on some of the hardest exploration benchmarks, including Atari games like Montezuma’s Revenge, Pitfall!, and Venture, as well as challenging robotic manipulation and navigation tasks—all with extremely sparse reward signals.

Pitfall!

Montezuma's Revenge

Venture

Door

Hammer

Relocate

AntMaze

How GowU Works

At its core, GowU is an exploration algorithm whose goal is to generate high-scoring or task-completing trajectories. It does not produce a deployable policy directly—instead, it systematically searches for demonstrations that can later be distilled into executable policies (more on this below). To do this, GowU maintains a population of particles—lightweight agents that explore the environment in parallel. The algorithm operates in the following repeated phases:

1. Parallel Rollouts

Each particle advances through the environment for a random number of steps using a simple local policy (no reward optimization needed). Particles that collect a reward immediately stop to secure their progress.

2. Winner Selection & Pruning

After rollouts, the algorithm identifies winners—the particles in states with the highest accumulated reward, using the uncertainty estimate as a tiebreaker. Failed particles (those that entered “dead” states like losing a life) are reset to the winner’s state using an environment checkpoint. The uncertainty estimate is continuously updated as states are visited using Random Network Distillation (RND). RND measures novelty by training a neural network to predict the output of a fixed, random target network; the model naturally makes larger errors on unseen states and smaller errors on familiar ones. Because the RND predictor continuously trains on observed states, its error naturally decreases for frequently visited regions, making the signal adaptive. Importantly, GowU is agnostic to the choice of uncertainty measure—RND can be replaced by any other estimator without changing the overall framework.

3. Group Consolidation

Periodically, all particles in a group are synchronized to the single best particle’s state. This collapses the population to the frontier of discovery, focusing all computational resources on the most promising region.

Formally, winners (in both steps 2 and 3) are selected via a lexicographic criterion:

$$p_{\text{winner}} = \underset{p \in P,\, p.\text{alive}}{\text{lex-argmax}} \,(p.R,\, U(p))$$

where $p.R$ is the cumulative reward and $U(p)$ is the uncertainty estimate (RND prediction error) at particle $p$’s state. This prioritizes reward progress, with uncertainty breaking ties—especially important in sparse-reward settings where all particles typically have the same (zero) reward.

4. Parallel Groups

To diversify the search and accelerate discovery, GowU runs multiple groups of particles in parallel, each conducting its own search. While these groups may execute rollouts independently, they share a single centralized uncertainty estimator (RND), which enables implicit coordination: as one group explores a region, the estimator’s uncertainty for that region decreases, discouraging other groups from revisiting it and steering the overall population toward less explored parts of the state space. To further reduce stagnation, GowU also performs a global synchronization step at the start of each iteration. It identifies a single global winner across all groups using the same reward-uncertainty criterion, and any group whose best reward is strictly lower is reset to that winner, while groups tied with the winner continue searching independently. This mechanism balances diversity with efficiency: competitive groups can continue exploring distinct trajectories, while important discoveries are propagated across the broader population.

Schematic overview of GowU with a single group of particles. The algorithm maintains a population of particles ($p_1, p_2, p_3$) that explore the state space via multi-step rollouts. If a particle reaches a dead state (e.g., $p_2$ on level one), it is pruned and reset to the winner—the alive node maximizing accumulated reward, with uncertainty as a tie-breaker. After $K$ outer steps (3 in the diagram), a Group Consolidation Reset syncs all particles to the current winner.

From Exploration to Deployable Policies: The Backward Algorithm

As noted above, GowU is a training-time exploration algorithm whose output is a set of high-scoring trajectories—self-generated demonstrations—not a deployable policy. These discovered trajectories can then be distilled into robust, executable policies using an existing backward learning algorithm.

The backward algorithm works by initializing the agent near the end of a discovered demonstration and training it (via PPO) to maximize reward from that point forward. As the agent masters the task suffix, the starting point is progressively moved backward along the demonstration, creating a natural curriculum. This breaks a long-horizon problem into manageable sub-tasks, and the resulting policy operates directly from observations without any access to environment checkpoints or resets at test time.

This two-phase structure—Phase I (exploration with GowU) followed by Phase II (policy distillation via backward learning)—cleanly separates the problem of finding rewarding behavior from the problem of reproducing it reliably.

Results: An Order of Magnitude More Efficient

Hard-Exploration Atari Games

GowU discovers high-scoring trajectories substantially faster than prior methods, and the distilled policies achieve state-of-the-art scores by a wide margin.

Game	GowU (Explor.)	Distilled Policy	Go-Explore	RND	MEME	BYOL-Hind.
Montezuma	98,753	182,672	43,791	8,152	9,429	~14,517
Pitfall!	60,600	97,980	6,945	−3	7,821	~16,211
Venture	3,330	5,190	2,281	1,859	2,583	~2,328

The exploration curves tell a dramatic story—GowU reaches higher rewards within a fraction of the frames required by Go-Explore, the current state-of-the-art for hard Atari exploration.

Montezuma's Revenge

Pitfall!

Venture

MuJoCo: Continuous Control from Pixels

We demonstrate GowU’s generality on challenging continuous-control tasks—directly from image observations, with sparse rewards, and without expert demonstrations or offline data.

Task	Mean Success Rate	Median	Best Run
Hammer	99.9%	100%	100%
Door	96.4%	98.6%	99.3%
Relocate	93.9%	96.5%	97.7%
AntMaze	86.3%	89.3%	95.1%

To the best of our knowledge, no prior method has solved the Adroit tasks from pixel observations in a sparse-reward setting without expert demonstrations. We note that for the Adroit tasks, Phase I (exploration) uses a single intermediate reward based on contact with the target object (e.g., the door handle, hammer, or ball); if contact is subsequently lost, the involved particle is marked as dead.

Trained Agents in Action

Below are videos of the final distilled policies operating directly from image observations:

Door---Opening a door by its handle

Hammer---Picking up a hammer and driving a nail

Relocate---Grasping a ball and moving to a target

AntMaze---Navigating a large maze

All policies use the 24-degree-of-freedom ShadowHand (Adroit) or a quadruped ant (AntMaze), operating from stacked grayscale frames—no privileged state information at test time.

Discussion and Future Work

GowU shows that particle-based search, guided by uncertainty, can be a powerful alternative to RL-based exploration. We believe this idea has the potential to extend well beyond games and robotics.

One natural direction is reasoning in language models. In this setting, each “particle” could correspond to a language model instance exploring a distinct chain of thought, with the population pruned and expanded according to a suitable measure of quality. This aligns naturally with the GowU framework: the “environment” is simply a sequence of tokens, while resets correspond to rolling back to an earlier point in the generation process, which is readily feasible.

Another promising direction is scaling robotic learning in simulation. Since resets are readily available in simulated environments, GowU could support autonomous discovery across a broad range of manipulation and locomotion tasks without relying on shaped rewards or expert demonstrations. More generally, even for real-world deployment, policies are increasingly pre-trained in simulation, which makes resets a practical yet still underexploited tool.

Links

📄 Paper: arXiv:2603.22273

Disclaimer: The opinions stated here are my own, not those of my company.

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Sat, 28 Mar 2026 00:00:00 +0000

The Power of Resets in Online Reinforcement Learning

Sun, 01 Dec 2024 00:00:00 +0000

Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions

Sat, 07 Sep 2024 00:00:00 +0000

Efficient Model-Free Exploration in Low-Rank MDPs

Wed, 05 Jul 2023 00:00:00 +0000

Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL

Thu, 13 Apr 2023 00:00:00 +0000

Sample-Efficient Exploration | Zak Mhammedi