<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Sample-Efficient Exploration | Zak Mhammedi</title><link>https://www.zakmhammedi.com/tag/sample-efficient-exploration/</link><atom:link href="https://www.zakmhammedi.com/tag/sample-efficient-exploration/index.xml" rel="self" type="application/rss+xml"/><description>Sample-Efficient Exploration</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Tue, 31 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://www.zakmhammedi.com/images/icon_hu059ea760e479e03fc73162b2a6fdb304_1110447_512x512_fill_lanczos_center_2.png</url><title>Sample-Efficient Exploration</title><link>https://www.zakmhammedi.com/tag/sample-efficient-exploration/</link></image><item><title>GowU: Uncertainty-Guided Tree Search for Hard Exploration</title><link>https://www.zakmhammedi.com/post/gowu-exploration/</link><pubDate>Tue, 31 Mar 2026 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/post/gowu-exploration/</guid><description>&lt;style>
.article-header figcaption, .featured-image figcaption, .article-header .caption, .featured-image .caption { text-align: center !important; }
#floating-toc { position: fixed; top: 100px; left: max(10px, calc((100vw - 820px)/2 - 260px)); width: 220px; max-height: calc(100vh - 140px); overflow-y: auto; padding: 16px; background: #f8fafc; border-left: 3px solid #0ea5e9; border-radius: 0 8px 8px 0; font-size: 0.82em; line-height: 1.6; z-index: 100; }
#floating-toc a { color: #64748b; text-decoration: none; display: block; padding: 4px 0; transition: all 0.2s; border-left: 2px solid transparent; padding-left: 8px; }
#floating-toc a:hover { color: #0ea5e9; }
#floating-toc a.active { color: #0ea5e9; font-weight: 600; border-left-color: #0ea5e9; }
#floating-toc .toc-title { font-weight: 700; font-size: 0.95em; margin-bottom: 8px; color: #0ea5e9; }
@media (max-width: 1200px) { #floating-toc { display: none; } }
&lt;/style>
&lt;div id="floating-toc">
&lt;div class="toc-title">Contents&lt;/div>
&lt;/div>
&lt;script>
document.addEventListener('DOMContentLoaded', function() {
var toc = document.getElementById('floating-toc');
if (!toc) return;
var headings = document.querySelectorAll('.article-container h2');
headings.forEach(function(h) {
if (!h.id) h.id = h.textContent.trim().toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/-+$/,'');
var a = document.createElement('a');
a.href = '#' + h.id;
a.textContent = h.textContent;
toc.appendChild(a);
});
var links = toc.querySelectorAll('a');
function updateActive() {
var scrollPos = window.scrollY + 120;
var current = null;
headings.forEach(function(h) { if (h.offsetTop &lt;= scrollPos) current = h; });
links.forEach(function(a) {
a.classList.toggle('active', current &amp;&amp; a.getAttribute('href') === '#' + current.id);
});
}
window.addEventListener('scroll', updateActive);
updateActive();
});
&lt;/script>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;div style="background:linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left:4px solid #0ea5e9; padding:1.2em 1.5em; border-radius:0 8px 8px 0; margin:1.5em 0; color:#1e293b;">
&lt;ul>
&lt;li>🔍 &lt;strong>Exploration ≠ Policy Optimization&lt;/strong>&amp;mdash;Treating exploration as search, rather than optimizing an exploration objective via RL, yields an order-of-magnitude efficiency gain&lt;/li>
&lt;li>🌳 &lt;strong>Particle-Based Search + Uncertainty&lt;/strong>&amp;mdash;A population of particles explores in parallel, guided by an uncertainty signal that concentrates computation on the most promising frontiers&lt;/li>
&lt;li>🎮 &lt;strong>State-of-the-Art on Hard Atari&lt;/strong>&amp;mdash;Significantly beating the previous state of the art on Montezuma&amp;rsquo;s Revenge, Pitfall!, and Venture&lt;/li>
&lt;li>🤖 &lt;strong>First Pixel-Based Solutions for Sparse MuJoCo&lt;/strong>&amp;mdash;Adroit dexterous manipulation (Door, Hammer, Relocate) and AntMaze solved from images with sparse rewards and no demonstrations&lt;/li>
&lt;/ul>
&lt;/div>
&lt;h2 id="the-exploration-problem">The Exploration Problem&lt;/h2>
&lt;p>Consider being placed in an unfamiliar building without a map and asked to find a particular room. To succeed, you would need to move through the building, try different doors, and keep track of where you have already been. This captures the basic challenge of &lt;strong>exploration&lt;/strong>: gathering new information in a structured way in order to achieve a goal. Despite decades of research, efficient exploration remains a central bottleneck in reinforcement learning. When environmental feedback is extremely &lt;strong>sparse&lt;/strong>, existing methods often struggle to explore effectively.&lt;/p>
&lt;p>Existing approaches to exploration typically use reinforcement learning (RL) to train a &lt;strong>policy&lt;/strong> to maximize an &lt;strong>exploration objective&lt;/strong>&amp;mdash;one designed so that optimizing it encourages the agent to visit new regions of the state space. Despite significant progress, these methods still fail in the hardest settings. We suggest that this is, in part, because &lt;strong>policy optimization&lt;/strong> itself may not be the right tool for exploration.&lt;/p>
&lt;hr>
&lt;h2 id="a-new-paradigm-search-instead-of-policy-optimization">A New Paradigm: Search Instead of Policy Optimization&lt;/h2>
&lt;blockquote>
&lt;p>&lt;strong>Policy optimization is necessary for precise task execution, but employing such machinery solely to expand state coverage may be inefficient.&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;p>In this work, we propose a fundamentally new approach to exploration. Rather than making exploration depend on optimizing a non-stationary exploration objective, we &lt;strong>decouple&lt;/strong> the two. Our method, &lt;strong>Go-With-Uncertainty (GowU)&lt;/strong>, treats exploration as a &lt;strong>search problem&lt;/strong> carried out by a population of particles, while remaining agnostic to the local policy used by each particle. Inspired by the &lt;a href="https://people.eecs.berkeley.edu/~vazirani/pubs/winners.pdf" target="_blank" rel="noopener">Go-With-The-Winner&lt;/a> strategy, GowU maintains a population of particles that explore the environment in parallel, guided by a measure of &lt;strong>uncertainty&lt;/strong>. The idea is simple: states that the particles have visited often have low uncertainty, while rarely visited states have high uncertainty. GowU uses this signal to steer the search toward under-explored regions of the environment.&lt;/p>
&lt;p>We evaluate GowU on some of the hardest exploration benchmarks, including Atari games like &lt;strong>Montezuma&amp;rsquo;s Revenge&lt;/strong>, &lt;strong>Pitfall!&lt;/strong>, and &lt;strong>Venture&lt;/strong>, as well as challenging &lt;strong>robotic manipulation&lt;/strong> and navigation tasks&amp;mdash;all with extremely sparse reward signals.&lt;/p>
&lt;div style="display:flex; gap:12px; justify-content:center; align-items:flex-end; margin:1.5em 0;">
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="pitfall.jpeg" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Pitfall!">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Pitfall!&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="montezuma_revenge.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Montezuma's Revenge">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Montezuma's Revenge&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="venture.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Venture">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Venture&lt;/figcaption>
&lt;/figure>
&lt;/div>
&lt;div style="display:flex; gap:12px; justify-content:center; align-items:flex-end; margin:1.5em 0;">
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="door-task.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Door">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Door&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="hammer-task.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Hammer">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Hammer&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="relocate-task.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="Relocate">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Relocate&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="antmaze-top.png" style="width:100%; border-radius:8px; border:1px solid #333;" alt="AntMaze">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">AntMaze&lt;/figcaption>
&lt;/figure>
&lt;/div>
&lt;hr>
&lt;h2 id="how-gowu-works">How GowU Works&lt;/h2>
&lt;p>At its core, GowU is an exploration algorithm whose goal is to generate high-scoring or task-completing trajectories. It does not produce a deployable policy directly&amp;mdash;instead, it systematically searches for demonstrations that can later be distilled into executable policies (more on this below). To do this, GowU maintains a population of &lt;strong>particles&lt;/strong>&amp;mdash;lightweight agents that explore the environment in parallel. The algorithm operates in the following repeated phases:&lt;/p>
&lt;h3 id="1-parallel-rollouts">1. Parallel Rollouts&lt;/h3>
&lt;p>Each particle advances through the environment for a random number of steps using a simple local policy (no reward optimization needed). Particles that collect a reward immediately stop to secure their progress.&lt;/p>
&lt;h3 id="2-winner-selection--pruning">2. Winner Selection &amp;amp; Pruning&lt;/h3>
&lt;p>After rollouts, the algorithm identifies &lt;strong>winners&lt;/strong>&amp;mdash;the particles in states with the highest accumulated reward, using the &lt;strong>uncertainty estimate&lt;/strong> as a tiebreaker. Failed particles (those that entered &amp;ldquo;dead&amp;rdquo; states like losing a life) are &lt;strong>reset&lt;/strong> to the winner&amp;rsquo;s state using an environment checkpoint. The uncertainty estimate is continuously updated as states are visited using &lt;strong>Random Network Distillation (RND)&lt;/strong>. RND measures novelty by training a neural network to predict the output of a fixed, random target network; the model naturally makes larger errors on unseen states and smaller errors on familiar ones. Because the RND predictor continuously trains on observed states, its error naturally decreases for frequently visited regions, making the signal adaptive. Importantly, GowU is agnostic to the choice of uncertainty measure&amp;mdash;RND can be replaced by any other estimator without changing the overall framework.&lt;/p>
&lt;h3 id="3-group-consolidation">3. Group Consolidation&lt;/h3>
&lt;p>Periodically, &lt;em>all&lt;/em> particles in a group are synchronized to the single best particle&amp;rsquo;s state. This collapses the population to the frontier of discovery, focusing all computational resources on the most promising region.&lt;/p>
&lt;p>Formally, winners (in both steps 2 and 3) are selected via a &lt;strong>lexicographic criterion&lt;/strong>:&lt;/p>
&lt;p>$$p_{\text{winner}} = \underset{p \in P,\, p.\text{alive}}{\text{lex-argmax}} \,(p.R,\, U(p))$$&lt;/p>
&lt;p>where $p.R$ is the cumulative reward and $U(p)$ is the uncertainty estimate (RND prediction error) at particle $p$&amp;rsquo;s state. This prioritizes reward progress, with uncertainty breaking ties&amp;mdash;especially important in sparse-reward settings where all particles typically have the same (zero) reward.&lt;/p>
&lt;h3 id="4-parallel-groups">4. Parallel Groups&lt;/h3>
&lt;p>To diversify the search and accelerate discovery, GowU runs multiple groups of particles in parallel, each conducting its own search. While these groups may execute rollouts independently, they share a single centralized uncertainty estimator (RND), which enables implicit coordination: as one group explores a region, the estimator’s uncertainty for that region decreases, discouraging other groups from revisiting it and steering the overall population toward less explored parts of the state space. To further reduce stagnation, GowU also performs a global synchronization step at the start of each iteration. It identifies a single global winner across all groups using the same reward-uncertainty criterion, and any group whose best reward is strictly lower is reset to that winner, while groups tied with the winner continue searching independently. This mechanism balances diversity with efficiency: competitive groups can continue exploring distinct trajectories, while important discoveries are propagated across the broader population.&lt;/p>
&lt;div style="display:flex; justify-content:center; margin:1em 0 0.5em;">
&lt;figure style="text-align:center; margin:0; max-width:700px;">
&lt;img src="gowu_schematic.svg" style="width:100%; border-radius:6px;" alt="GowU schematic overview">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Schematic overview of GowU with a single group of particles. The algorithm maintains a population of particles (\(p_1, p_2, p_3\)) that explore the state space via multi-step rollouts. If a particle reaches a dead state (e.g., \(p_2\) on level one), it is pruned and reset to the winner—the alive node maximizing accumulated reward, with uncertainty as a tie-breaker. After \(K\) outer steps (3 in the diagram), a Group Consolidation Reset syncs all particles to the current winner.&lt;/figcaption>
&lt;/figure>
&lt;/div>
&lt;hr>
&lt;h2 id="from-exploration-to-deployable-policies-the-backward-algorithm">From Exploration to Deployable Policies: The Backward Algorithm&lt;/h2>
&lt;p>As noted above, GowU is a &lt;strong>training-time exploration algorithm&lt;/strong> whose output is a set of high-scoring trajectories&amp;mdash;self-generated demonstrations&amp;mdash;not a deployable policy. These discovered trajectories can then be distilled into robust, executable policies using an existing &lt;strong>backward learning&lt;/strong> algorithm.&lt;/p>
&lt;p>The backward algorithm works by initializing the agent near the &lt;em>end&lt;/em> of a discovered demonstration and training it (via PPO) to maximize reward from that point forward. As the agent masters the task suffix, the starting point is progressively moved &lt;em>backward&lt;/em> along the demonstration, creating a natural curriculum. This breaks a long-horizon problem into manageable sub-tasks, and the resulting policy operates directly from observations without any access to environment checkpoints or resets at test time.&lt;/p>
&lt;p>This two-phase structure&amp;mdash;&lt;strong>Phase I&lt;/strong> (exploration with GowU) followed by &lt;strong>Phase II&lt;/strong> (policy distillation via backward learning)&amp;mdash;cleanly separates the problem of &lt;em>finding&lt;/em> rewarding behavior from the problem of &lt;em>reproducing&lt;/em> it reliably.&lt;/p>
&lt;hr>
&lt;h2 id="results-an-order-of-magnitude-more-efficient">Results: An Order of Magnitude More Efficient&lt;/h2>
&lt;h3 id="hard-exploration-atari-games">Hard-Exploration Atari Games&lt;/h3>
&lt;p>GowU discovers high-scoring trajectories &lt;strong>substantially faster&lt;/strong> than prior methods, and the distilled policies achieve state-of-the-art scores by a wide margin.&lt;/p>
&lt;div style="overflow-x:auto; margin:1.2em 0;">
&lt;table style="width:100%; border-collapse:collapse; font-size:0.92em; white-space:nowrap;">
&lt;tr style="border-bottom:2px solid #e2e8f0;">
&lt;th style="text-align:left; padding:8px 12px;">Game&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">GowU (Explor.)&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">Distilled Policy&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">Go-Explore&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">RND&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">MEME&lt;/th>
&lt;th style="text-align:right; padding:8px 12px;">BYOL-Hind.&lt;/th>
&lt;/tr>
&lt;tr style="border-bottom:1px solid #e2e8f0;">
&lt;td style="text-align:left; padding:8px 12px;">&lt;strong>Montezuma&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">98,753&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">&lt;strong>182,672&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">43,791&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">8,152&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">9,429&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">~14,517&lt;/td>
&lt;/tr>
&lt;tr style="border-bottom:1px solid #e2e8f0;">
&lt;td style="text-align:left; padding:8px 12px;">&lt;strong>Pitfall!&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">60,600&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">&lt;strong>97,980&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">6,945&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">−3&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">7,821&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">~16,211&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left; padding:8px 12px;">&lt;strong>Venture&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">3,330&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">&lt;strong>5,190&lt;/strong>&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">2,281&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">1,859&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">2,583&lt;/td>
&lt;td style="text-align:right; padding:8px 12px;">~2,328&lt;/td>
&lt;/tr>
&lt;/table>
&lt;/div>
&lt;p>The exploration curves tell a dramatic story&amp;mdash;GowU reaches higher rewards within a fraction of the frames required by &lt;strong>Go-Explore&lt;/strong>, the current &lt;strong>state-of-the-art&lt;/strong> for hard Atari exploration.&lt;/p>
&lt;div style="display:flex; gap:12px; justify-content:center; margin:1.5em 0;">
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="mont_comp.png" style="width:100%; border-radius:6px;" alt="Montezuma exploration">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Montezuma's Revenge&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="pitfall_comp.png" style="width:100%; border-radius:6px;" alt="Pitfall exploration">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Pitfall!&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0; flex:1;">
&lt;img src="vent_comp.png" style="width:100%; border-radius:6px;" alt="Venture exploration">
&lt;figcaption style="font-size:0.85em; color:#888; margin-top:4px;">Venture&lt;/figcaption>
&lt;/figure>
&lt;/div>
&lt;h3 id="mujoco-continuous-control-from-pixels">MuJoCo: Continuous Control from Pixels&lt;/h3>
&lt;p>We demonstrate GowU&amp;rsquo;s generality on challenging continuous-control tasks&amp;mdash;&lt;strong>directly from image observations&lt;/strong>, with &lt;strong>sparse rewards&lt;/strong>, and &lt;strong>without expert demonstrations or offline data&lt;/strong>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Task&lt;/th>
&lt;th style="text-align:right">Mean Success Rate&lt;/th>
&lt;th style="text-align:right">Median&lt;/th>
&lt;th style="text-align:right">Best Run&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Hammer&lt;/strong>&lt;/td>
&lt;td style="text-align:right">99.9%&lt;/td>
&lt;td style="text-align:right">100%&lt;/td>
&lt;td style="text-align:right">100%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Door&lt;/strong>&lt;/td>
&lt;td style="text-align:right">96.4%&lt;/td>
&lt;td style="text-align:right">98.6%&lt;/td>
&lt;td style="text-align:right">99.3%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Relocate&lt;/strong>&lt;/td>
&lt;td style="text-align:right">93.9%&lt;/td>
&lt;td style="text-align:right">96.5%&lt;/td>
&lt;td style="text-align:right">97.7%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>AntMaze&lt;/strong>&lt;/td>
&lt;td style="text-align:right">86.3%&lt;/td>
&lt;td style="text-align:right">89.3%&lt;/td>
&lt;td style="text-align:right">95.1%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>To the best of our knowledge, &lt;strong>no prior method has solved the Adroit tasks from pixel observations in a sparse-reward setting without expert demonstrations&lt;/strong>. We note that for the Adroit tasks, Phase I (exploration) uses a single intermediate reward based on contact with the target object (e.g., the door handle, hammer, or ball); if contact is subsequently lost, the involved particle is marked as dead.&lt;/p>
&lt;hr>
&lt;h2 id="trained-agents-in-action">Trained Agents in Action&lt;/h2>
&lt;p>Below are videos of the final distilled policies operating directly from image observations:&lt;/p>
&lt;div style="display:grid; grid-template-columns:1fr 1fr; gap:16px; max-width:800px; margin:1.5em auto;">
&lt;figure style="text-align:center; margin:0;">
&lt;video autoplay loop muted playsinline style="width:100%; border-radius:10px; border:2px solid #2a2a3a; box-shadow: 0 4px 12px rgba(0,0,0,0.3);">
&lt;source src="door.mp4" type="video/mp4">
&lt;/video>
&lt;figcaption style="font-size:0.9em; color:#888; margin-top:6px;">&lt;strong>Door&lt;/strong>---Opening a door by its handle&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0;">
&lt;video autoplay loop muted playsinline style="width:100%; border-radius:10px; border:2px solid #2a2a3a; box-shadow: 0 4px 12px rgba(0,0,0,0.3);">
&lt;source src="hammer.mp4" type="video/mp4">
&lt;/video>
&lt;figcaption style="font-size:0.9em; color:#888; margin-top:6px;">&lt;strong>Hammer&lt;/strong>---Picking up a hammer and driving a nail&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0;">
&lt;video autoplay loop muted playsinline style="width:100%; border-radius:10px; border:2px solid #2a2a3a; box-shadow: 0 4px 12px rgba(0,0,0,0.3);">
&lt;source src="relocate.mp4" type="video/mp4">
&lt;/video>
&lt;figcaption style="font-size:0.9em; color:#888; margin-top:6px;">&lt;strong>Relocate&lt;/strong>---Grasping a ball and moving to a target&lt;/figcaption>
&lt;/figure>
&lt;figure style="text-align:center; margin:0;">
&lt;video autoplay loop muted playsinline style="width:100%; border-radius:10px; border:2px solid #2a2a3a; box-shadow: 0 4px 12px rgba(0,0,0,0.3);">
&lt;source src="antMaze.mp4" type="video/mp4">
&lt;/video>
&lt;figcaption style="font-size:0.9em; color:#888; margin-top:6px;">&lt;strong>AntMaze&lt;/strong>---Navigating a large maze&lt;/figcaption>
&lt;/figure>
&lt;/div>
&lt;p>All policies use the 24-degree-of-freedom ShadowHand (Adroit) or a quadruped ant (AntMaze), operating from stacked grayscale frames&amp;mdash;no privileged state information at test time.&lt;/p>
&lt;hr>
&lt;h2 id="discussion-and-future-work">Discussion and Future Work&lt;/h2>
&lt;p>GowU shows that &lt;strong>particle-based search&lt;/strong>, guided by &lt;strong>uncertainty&lt;/strong>, can be a powerful alternative to RL-based exploration. We believe this idea has the potential to extend well beyond games and robotics.&lt;/p>
&lt;p>One natural direction is &lt;strong>reasoning in language models&lt;/strong>. In this setting, each &amp;ldquo;particle&amp;rdquo; could correspond to a language model instance exploring a distinct &lt;strong>chain of thought&lt;/strong>, with the population pruned and expanded according to a suitable measure of quality. This aligns naturally with the GowU framework: the &amp;ldquo;environment&amp;rdquo; is simply a sequence of tokens, while &lt;strong>resets&lt;/strong> correspond to rolling back to an earlier point in the generation process, which is readily feasible.&lt;/p>
&lt;p>Another promising direction is &lt;strong>scaling robotic learning in simulation&lt;/strong>. Since resets are readily available in simulated environments, GowU could support autonomous discovery across a broad range of manipulation and locomotion tasks without relying on shaped rewards or expert demonstrations. More generally, even for real-world deployment, policies are increasingly &lt;strong>pre-trained in simulation&lt;/strong>, which makes resets a practical yet still underexploited tool.&lt;/p>
&lt;hr>
&lt;h2 id="links">Links&lt;/h2>
&lt;p>📄 &lt;strong>Paper&lt;/strong>: &lt;a href="https://arxiv.org/abs/2603.22273" target="_blank" rel="noopener">arXiv:2603.22273&lt;/a>&lt;/p>
&lt;div style="font-size:0.8em; color:#94a3b8; text-align:center; margin-top:2em; padding:1em; border-top:1px solid #e2e8f0;">
&lt;em>Disclaimer: The opinions stated here are my own, not those of my company.&lt;/em>
&lt;/div></description></item><item><title>End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions</title><link>https://www.zakmhammedi.com/publication/mhammedi-bellman-complete-2026/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/publication/mhammedi-bellman-complete-2026/</guid><description/></item><item><title>The Power of Resets in Online Reinforcement Learning</title><link>https://www.zakmhammedi.com/publication/mhammedi-resets-2024/</link><pubDate>Sun, 01 Dec 2024 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/publication/mhammedi-resets-2024/</guid><description/></item><item><title>Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions</title><link>https://www.zakmhammedi.com/publication/mhammedi-linear-rl-2024/</link><pubDate>Sat, 07 Sep 2024 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/publication/mhammedi-linear-rl-2024/</guid><description/></item><item><title>Efficient Model-Free Exploration in Low-Rank MDPs</title><link>https://www.zakmhammedi.com/publication/mhammedi-lrmdp-2023/</link><pubDate>Wed, 05 Jul 2023 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/publication/mhammedi-lrmdp-2023/</guid><description/></item><item><title>Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL</title><link>https://www.zakmhammedi.com/publication/mhammedi-bmdp-2023/</link><pubDate>Thu, 13 Apr 2023 00:00:00 +0000</pubDate><guid>https://www.zakmhammedi.com/publication/mhammedi-bmdp-2023/</guid><description/></item></channel></rss>