AgentSpec Documentation

Framework Overview

AgentSpec models embodied agents as explicit modular compositions of Perception, Memory, Reasoning, Reflection, and Action, with optional reinforcement learning for policy optimization.

Typed interfaces allow modules to be swapped without rewriting environment wrappers.
Module-level and interaction-level effects can be measured under consistent settings.
The same design space is reused across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR.

Every agent is assembled from the same set of interchangeable building blocks. Each module exposes a small typed interface, so changing a reasoning strategy or a memory backend is a one-line swap rather than a rewrite:

Module	Responsibility	Built-in implementations
Perception	Converts raw environment observations into a `UnifiedAgentInput` shared by all downstream modules.	`MiniGridPerceptionModule`, `DeliveryBenchPerceptionModule`, THOR adapters
Reasoning	Plans and selects the next action from the unified input plus retrieved memory.	Simple (rule-based), Direct LLM, CoT, ReAct, ToT, RAP, Buffer-of-Thought
Memory	Stores experiences and retrieves the most relevant ones for the current step.	Sliding-window, graph memory, A-MEM, Mem0, MemoryBank, LightMem, Letta, and more
Reflection	Critiques past trajectories and feeds lessons back into reasoning.	Reflexion, Self-Refine, Retroformer
Action	Maps reasoning output to the environment's action schema and executes it.	Benchmark adapters via the unified `get_adapter` registry

Installation

git clone https://github.com/chenjix/AgentSpec.git
cd AgentSpec
pip install -e .

# Optional extras
pip install -e ".[gym]"
pip install -e ".[thor]"

Quick Start

export OPENAI_API_KEY="sk-..."

# Example runs
python examples/minigrid_example.py --env MiniGrid-DoorKey-5x5-v0 --no-llm
python examples/minigrid_react_example.py --env MiniGrid-DoorKey-5x5-v0
python examples/deliverybench_example.py --reasoning-method react

Each example script accepts --model, --api-key, and --base-url flags. OpenRouter-hosted models (e.g. anthropic/claude-3.5-sonnet) are auto-detected from the model name, so the same script runs against OpenAI, Anthropic, or Gemini backends without code changes.

Example 1: Assemble a ReAct Agent on MiniGrid

This walkthrough builds a complete embodied agent for MiniGrid-DoorKey-5x5-v0 from individual modules, mirroring examples/minigrid_react_example.py. The agent must find a key, unlock a door, and reach the goal — a partially observable, multi-stage task.

Step 1 — Create the environment and modules

Environments are obtained through the adapter registry, so the same agent code works for any registered benchmark. Perception, reasoning, and memory are constructed independently:

import benchmarks.minigrid  # registers the MiniGrid adapter
from agents import EmbodiedAgent
from modules.adapters import get_adapter
from modules.perception import MiniGridPerceptionModule
from modules.reasoning import ReActReasoning
from modules.llm import OpenAIClient

env = get_adapter("minigrid", env_name="MiniGrid-DoorKey-5x5-v0")

perception = MiniGridPerceptionModule()

llm_client = OpenAIClient(
    api_key="sk-...",
    model="gpt-4o",
    temperature=0.0,   # deterministic decoding stabilizes partially observable tasks
)

reasoning = ReActReasoning(
    llm_client=llm_client,
    max_iterations=10,
    enable_history=True,   # keeps Thought/Action/Observation history across steps
    strict_format=True,
)

Step 2 — Compose the agent

The EmbodiedAgent simply wires the modules together. Swapping ReActReasoning for COTReasoning or ToTReasoning, or replacing the memory backend, requires no other changes:

from modules.memory import MemoryModule

class SlidingWindowMemory(MemoryModule):
    """Keep the most recent experiences (FIFO)."""
    def __init__(self, max_memories=50):
        self._memories, self.max_memories = [], max_memories

    def retrieve(self, query, top_k=5, **kwargs):
        return self._memories[-top_k:]

    def store(self, experience, **kwargs):
        self._memories.append(experience)
        if len(self._memories) > self.max_memories:
            self._memories.pop(0)

    def reset(self):
        self._memories = []

agent = EmbodiedAgent(
    perception=perception,
    reasoning=reasoning,
    memory=SlidingWindowMemory(max_memories=50),
)

Step 3 — Run the episode loop

On every step the agent perceives, retrieves memory, reasons, and returns an action; the result is fed back so memory and the ReAct history stay grounded:

obs, info = env.reset()
agent.reset()

for step in range(200):
    action = agent.step(
        obs,
        info=info,
        meta=env.meta,
        task=env.task_spec,
        action_space=env.action_schema,
    )

    obs, reward, terminated, truncated, info = env.step(action)

    reasoning.add_observation(f"Reward: {reward:.2f}", float(reward))
    agent.observe_result(
        reward=reward, next_obs=obs,
        done=terminated, truncated=truncated, info=info,
    )

    if terminated or truncated:
        break

Or run the bundled script directly — it also saves per-step frames and logs under artifacts/runs/react/ for inspection:

python examples/minigrid_react_example.py \
  --env MiniGrid-DoorKey-5x5-v0 \
  --model gpt-4o \
  --max-steps 200

Example 2: Module Ablation on DeliveryBench

Because modules share typed interfaces, a controlled ablation is just a change of command-line flag — the environment wrapper, perception, and evaluation pipeline stay identical. This is how the module-level results in the paper are produced.

Compare five reasoning strategies on the same DeliveryBench task:

# Rule-based baseline (no LLM, useful as a sanity check)
python examples/deliverybench_example.py --reasoning-method simple

# Direct LLM reasoning
python examples/deliverybench_example.py --reasoning-method llm \
  --model gpt-4o-mini

# Chain-of-Thought
python examples/deliverybench_example.py --reasoning-method cot \
  --model anthropic/claude-3.5-sonnet --llm-provider openrouter

# ReAct (interleaved Thought / Action / Observation)
python examples/deliverybench_example.py --reasoning-method react \
  --model anthropic/claude-3.5-sonnet --llm-provider openrouter

# Tree-of-Thoughts with breadth-first search
python examples/deliverybench_example.py --reasoning-method tot \
  --model anthropic/claude-3.5-sonnet --llm-provider openrouter \
  --tot-search-strategy BFS

The same pattern applies to the other axes of the design space. For instance, to ablate the memory module while holding reasoning fixed at ReAct, run the corresponding MiniGrid example scripts:

python examples/minigrid_react_example.py        # sliding-window memory
python examples/minigrid_amem_example.py         # A-MEM structured memory
python examples/minigrid_graph_memory_example.py # graph-based memory
python examples/minigrid_mem0_example.py         # Mem0 backend

Each run writes its trajectory, per-step reasoning traces, and metrics to artifacts/runs/, so configurations can be compared step by step under identical seeds and settings.

Module Configuration Principles

Reasoning

Use structured reasoning strategies for symbolic or shorter-horizon tasks.

Memory

Prioritize representation alignment over context volume for long-horizon stability.

Reflection

Add reflection to stabilize weak pairings and reduce compounding action errors.