Efficient&Stable Evolutionary RL

Anthea Y. Li Songlin Yang Yulu Gan Kai Zhang Hanwen Jiang Hao Tan Sai Bi

First published on Nov 23, 2025 | (Work in Progress)

<aside>

Post-training language models, especially Reinforcement Learning (RL), relies heavily on the base model’s distribution and rollout diversity. Recently, evolutionary strategy has shown to be effective in improving the reasoning ability of language models as a post-training strategy, as it enjoys many benefits in sampling diversity and training stability. However, it is sample inefficient compared to gradient based approaches, (e.g. traditional RL and SFT). We explore ways to more efficient evolutionary RL, focusing on sample efficiency of Low Rank parameter updates. We zoom in on parameters, population size, noise scale, and rank, that determines the effectiveness and stability of low rank ES. Additionally, we also explore the inherent Quantization as there is no mismatch with inference-only update schemes.

</aside>

Background: Evolutionary Strategies

Traditional RL and SFT methods works through gradient-based optimization that uses derivatives of network parameters $\nabla_\theta J(\theta)$ to directly adjust $\theta$. Evolutionary Strategies (ES) directly update distribution over the parameter space with respect to a sampled population from parameter distribution [1]. A popular distribution of choice is Gaussian:

<aside>

Require: Pre-trained LLM with initial parameters $\theta_0$, reward function R(·), total iterations T, population size N, noise scale σ, learning rate α.

For t = 1 to T do For n = 1 to N do Sample noise $\varepsilon_n \sim \mathcal{N} (0, \mathbb{I} )$ Compute reward for perturbed parameters $R_n = R(\theta_{t-1} + \sigma \cdot \varepsilon_n)$ Normalize $R_n$ $\leftarrow$ group relative normalization over population Update model parameters as $\theta_t \leftarrow \theta_{t-1} + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n \varepsilon_n$

</aside>

From a gradient perspective, ES uses zeroth order estimate for the gradient of objective $J(\theta)$:

$$ \hat{g} = \frac{1}{N \sigma} \sum_{1}^{n} r_i \varepsilon_i, \quad \quad J(\theta) = \mathbb{E}_\varepsilon [f(\theta + \sigma \varepsilon)] $$

$\hat{g}$ denotes gradient step update, and $\mathbb{E} [\hat{g}] = \nabla_\theta J (\theta)$ . We use $\phi$ to denote perturbed $\theta + \sigma \varepsilon$.

Lora ES

Perturbing all parameters is expensive. Lora[3] has been introduced to use skinny low-rank approximation to high dimensional large models. Such technique can also be adopted for ES by sampling low-rank noise matrices. The attractiveness of lora noise samples is that they can be batch served. we examine aspects of using low rank noise samples for efficient ES.

Background: Lora & Batched Lora Serving

The lora style ES algorithm is straight forward by sampling low rank gaussian noise.

<aside>

Require: Pre-trained LLM with initial parameters $\theta_0$, reward function R(·), total iterations T, population size N, noise scale σ, learning rate α, lora rank r.

For t = 1 to T do For i = 1 to N do Sample low rank gaussian noise $A_i \sim \mathcal{N}{m \times r} (0, \mathbb{I} ), B_i \sim \mathcal{N}{n \times r} (0, \mathbb{I} )$ Compute reward for perturbed parameters $R_i = R(\theta_{t-1} x + \sigma \cdot \frac{1}{\sqrt{r}} B_iA_i^{\top} x)$ Normalize $R_i$ $\leftarrow$ group relative normalization over population Update model parameters as $\theta_t \leftarrow \theta_{t-1} + \alpha \cdot \frac{1}{N} \sum_{i=1}^{N} R_i B_iA_i^{\top}$

</aside>