SPACeR: Self-Play Anchoring with Centralized Reference Models

Abstract

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose human-like self-play, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10× faster at inference and 50× smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.

Overview of SPACeR

Existing approaches face a fundamental trade-off:

Imitation learning (IL): Realistic and human-like, but tokenized or diffusion models are large (GPU memory heavy), slow, and difficult to scale.
Self-play reinforcement learning (RL): Efficient and scalable, but requires reward shaping and often diverges from human norms.

SPACeR is an RL-first approach designed to bridge these gaps—combining the scalability of self-play with the realism of IL. Our policies are ~50× smaller than tokenized models and run 10× faster (or more!), enabling lightweight, human-like multi-agent simulation.

We propose to anchor self-play reinforcement learning to a pretrained tokenized reference model, which provides a human-likeness distributional signal. The resulting SPACeR policy is decentralized and conditioned only on local observations, while the reference model is centralized and conditioned on the full scene context—allowing scalable training without sacrificing realism.

Human-like Self-Play

Formally, we augment self-play reinforcement learning with a pretrained reference policy $\pi_{\text{ref}}$ that captures the human driving distribution and provides a realism signal during training. The overall objective is

\[ r_t = r_t^{\text{task}} + \alpha \, r_{\text{humanlike}}(s_t, a_t), \quad L(\theta) = L_{\text{PPO}}(\theta; A[r]) - \beta \, D_{\text{KL}}\!\left(\pi_{\text{ref}}(\cdot \mid s_t) \,\|\, \pi_\theta(\cdot \mid o_t)\right), \]

where (1) $L_{\text{PPO}}$ optimizes task performance (goal-reaching, collision avoidance, off-road avoidance), (2) the human-likeness reward $r_{\text{humanlike}}(s_t, a_t) = \log \pi_{\text{ref}}(a_t \mid s_t)$ provides dense per-timestep likelihood feedback, and (3) the distributional alignment term enforces KL-regularization between the reference and learned policies. This formulation ensures agents learn from experience in a closed-loop manner while also remaining closely aligned with realistic human driving behaviors

Pretrained Reference Tokenized Model

To incorporate human-likeness into self-play, we introduce a pretrained reference tokenized model ($\pi_{\text{ref}}$), trained on real-world driving trajectories as a proxy for the human driving distribution. Tokenized models (e.g., SMART) factorize the joint action distribution under a conditional independence assumption:

$$ p(a_t \mid a_{<t}, c) = \prod_{i=1}^{N} p(a_t^i \mid a_{<t}, c). $$

This yields per-agent (i) distributions at each timestep (t). Unlike autoregressive generation, our approach only requires tractable training a single forward pass per rollout. By aligning the action space of $\pi_\theta$ and $\pi_{\text{ref}}$, we obtain a direct distributional signal that guides scalable self-play toward human-like behavior.

Experimental Results on Waymo Sim Agent Challenge

We measure whether SPACeR self-play policies are human-like by adopting the Waymo Sim Agents Challenge (WOSAC). We compare SPACeR against two self-play RL baselines:

PPO: Only using task reward (goal reaching, collision avoidance, off-road avoidance)
HR-PPO: Previous works that use Behavior Cloning as KL-regularization
SMART*: State of the art tokenized traffic models
CAT-K*: Supervised Fine-tuning on top of SMART

SPACeR outperforms other self-play approaches significantly across all realism metrics. In addition, compared to imitation-learning methods, SPACeR is lightweight(65k params , 50x smaller than SMART 3M) and achieves ~10× throughput while maintaining competitive performance.

*Note: We retrain SMART and CAT-K with the same token size of 200 at 5Hz, where CAT-K is used as a reference model.

Method	Composite ↑	Kinematic ↑	Interactive ↑	Map ↑	minADE ↓	Collision ↓	Off-road ↓	Throughput ↑
PPO	0.693	0.277	0.750	0.860	15.450	0.010	0.043	211.8
HR-PPO	0.707	0.333	0.750	0.860	6.700	0.043	0.070	211.8
SPACeR (Ours)	0.740	0.390	0.783	0.880	4.733	0.020	0.050	211.8
SMART* (IL)	0.720	0.450	0.725	0.870	1.840	0.170	0.130	22.5
CAT-K* (IL)	0.766	0.490	0.792	0.890	1.470	0.060	0.090	22.5

Throughput = scenarios/sec @ 5Hz on a single A100 GPU, each scenario is 8 seconds. IL = imitation learning methods (shaded).

Qualitative Comparison with Self-Play Baseline

Qualitatively, PPO and HR-PPO usually have stop-and-go behavior, while SPACeR maintains a smooth driving behavior. The colored agents are targeted agents from the Waymo Sim Agents Challenge, and all agents are controlled by the self-play model.

PPO exhibits jerky stop-and-go behavior, while SPACeR demonstrates smoother, more natural motion.

PPO

HR-PPO

Targeted Agent

Other Agents

SPACeR

PPO and HR-PPO agents goes to the wrong direction and stop immediately, while SPACeR agents produce traffic compliant behavior.

PPO

HR-PPO

SPACeR

On Highway scenarios, all agents drive at a higher speed with lane-changing behavior, SPACeR agents are smoother and natural without stop-and-go behavior.

PPO

HR-PPO

SPACeR

In intersection yielding (A2 → A3), HR-PPO yields abruptly near collision, while SPACeR yields smoothly in advance.

PPO

HR-PPO

SPACeR

Closed-loop Planner Evaluation

We evaluate SPACeR by comparing how planners perform under different sim agent policies. Specifically, we test 22 self-play–trained policies, 10 sampling-based Frenet planners, and 10 IDM-based planners. For each planner, we compute PDM scores across diverse scenes under three simulation modes: ground-truth log replay, Cat-K rollouts, and SPACeR agent policies. We then measure the correlation of PDM scores across sim agent strategies to understand how similarly different agents assess planner behaviors.

The correlation of collision scores is low between SPACeR against Log-Replay and CAT-K. Qualitatively, both Log-Replay and CAT-K yield more false positive collisions, giving a wrong estimate of the planner's safety performance. In contrast, SPACeR agents provide more reactive, realistic behavior that better reflects real-world driving scenarios.

Playback Speed:

When the ego changes lanes, Log-Replay and CAT-K agents ignore the ego and collide (collision score = 0), while SPACeR slows down to avoid collision, leading to a more accurate PDM score.

Ego Policy

Sim Agent

Log-Replay

PDM Score 0.38
Collision Score 0.0

CAT-K

PDM Score 0.38
Collision Score 0.0

SPACeR

PDM Score 0.7
Collision Score 1.0

CAT-K agents drive faster than ego and cause back collision, while SPACeR maintains a safe distance with smoother motion.

Log-Replay

CAT-K

SPACeR

CAT-K agents aggressive merge to ego lanes, causing collision.

Log-Replay

CAT-K

SPACeR

Negotiation scenario: All agents collide, but SPACeR shows reactive behavior, slowing down when the ego enters the wrong lane and helping it learn the correct signals for how and when to avoid collisions

Log-Replay

CAT-K

SPACeR

T-intersection: Both Log-Replay and CAT-K agents doesn't react to the ego and cause back collision.

Log-Replay

CAT-K

SPACeR

At a 4-way intersection, CAT-K moves too early and causes a front-end collision, while SPACeR yields safely.

Log-Replay

CAT-K

SPACeR

In summary, although CAT-K agents achieve higher distributional realism on the Sim Agents Challenge, we observe that in closed-loop planner evaluation, self-play–trained agents exhibit more reactive and adaptive behaviors, especially in collision and negotiation scenarios. This likely explains why CAT-K and the non-reactive Log-Replay agents show higher correlation in collision scores.