AgentSpec

AgentSpec: A Modular Specification Framework for Controlled Analysis of Embodied Agents

A modular framework for controlled embodied-agent analysis

Code, benchmark assets, and docs are continuously updated on this site.

Demo Video

If your browser does not autoplay QuickTime streams, download and view directly: agentspec.mov.

Abstract

LLM-based embodied agents are increasingly built from modules such as reasoning, memory, reflection, action execution, and learning. However, these modules are often embedded in tightly coupled pipelines, making it hard to isolate component contributions or study interaction effects. AgentSpec introduces a modular specification framework that represents embodied agents as explicit compositions of reusable policy components with standardized interfaces.

We instantiate AgentSpec across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and run controlled studies over reasoning, memory, reflection, and reinforcement learning modules. Our experiments show that performance is jointly determined by component quality and representation compatibility. Structured multi-granularity memory improves long-horizon state tracking, reasoning and memory exhibit strong complementarity, and lightweight modular compositions can achieve stronger performance-cost trade-offs than heavier misaligned pipelines.

Overview

AgentSpec modular design space overview
AgentSpec turns tightly coupled embodied-agent pipelines into a controlled modular design space with fixed interfaces.

Framework Design

AgentSpec organizes policy execution into a modular Perception-Memory-Reasoning-Reflection-Action loop, with optional reinforcement learning for optimization.

  • Perception maps raw observations to standardized state.
  • Memory retrieves task-relevant history under typed interfaces.
  • Reasoning proposes decisions compatible with downstream action modules.
  • Reflection critiques and revises candidate decisions.
  • Learning can be attached to jointly improve policy behavior.

This decomposition enables controlled ablations and recombination across environments without rebuilding the full agent pipeline.

AgentSpec modular loop and interfaces
AgentSpec modular loop and typed interfaces.

Main Findings

Compatibility Matters

Module strength alone is not sufficient; memory representation must match the reasoning strategy to produce gains.

Environment Dependence

Short symbolic tasks benefit more from per-step reasoning, while long-horizon embodied tasks are bottlenecked by state tracking.

Efficiency-Aware Composition

Performance improvements do not come from token usage alone; lightweight but aligned modules often offer better trade-offs.

Main Experimental Results

Main benchmark results across embodied tasks
Main benchmark performance across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR under controlled module compositions.

Analysis

Pareto curve for Qwen3.5 9B experiments
Pareto frontier showing performance-cost trade-offs under modular compositions.
Error analysis of module combinations
Error analysis highlighting representation mismatch and failure loops.

Case Studies

DeliveryBench case study figure
DeliveryBench trajectory case study.
ALFRED case study figure
ALFRED task-level case study.

Benchmark Scope

AgentSpec is evaluated on four embodied benchmarks with complementary challenges: DeliveryBench (long-horizon planning under constraints), ALFRED (compositional household manipulation), MiniGrid (symbolic navigation under partial observability), and RoboTHOR (realistic first-person navigation).

BibTeX

@article{agentspec2026,
  title={AgentSpec: A Modular Specification Framework for Controlled Analysis of Embodied Agents},
  author={AgentSpec Team},
  year={2026}
}