Reasoning
Use structured reasoning strategies for symbolic or shorter-horizon tasks.
Modular embodied-agent composition, reproducible evaluation, and controlled ablations.
AgentSpec models embodied agents as explicit modular compositions of Perception, Memory, Reasoning, Reflection, and Action, with optional reinforcement learning for policy optimization.
Every agent is assembled from the same set of interchangeable building blocks. Each module exposes a small typed interface, so changing a reasoning strategy or a memory backend is a one-line swap rather than a rewrite:
| Module | Responsibility | Built-in implementations |
|---|---|---|
| Perception | Converts raw environment observations into a UnifiedAgentInput shared by all downstream modules. |
MiniGridPerceptionModule, DeliveryBenchPerceptionModule, THOR adapters |
| Reasoning | Plans and selects the next action from the unified input plus retrieved memory. | Simple (rule-based), Direct LLM, CoT, ReAct, ToT, RAP, Buffer-of-Thought |
| Memory | Stores experiences and retrieves the most relevant ones for the current step. | Sliding-window, graph memory, A-MEM, Mem0, MemoryBank, LightMem, Letta, and more |
| Reflection | Critiques past trajectories and feeds lessons back into reasoning. | Reflexion, Self-Refine, Retroformer |
| Action | Maps reasoning output to the environment's action schema and executes it. | Benchmark adapters via the unified get_adapter registry |
git clone https://github.com/chenjix/agentfractory.git
cd agentfractory
pip install -e .
# Optional extras
pip install -e ".[gym]"
pip install -e ".[thor]"
export OPENAI_API_KEY="sk-..."
# Example runs
python examples/minigrid_example.py --env MiniGrid-DoorKey-5x5-v0 --no-llm
python examples/minigrid_react_example.py --env MiniGrid-DoorKey-5x5-v0
python examples/deliverybench_example.py --reasoning-method react
Each example script accepts --model, --api-key, and
--base-url flags. OpenRouter-hosted models (e.g.
anthropic/claude-3.5-sonnet) are auto-detected from the model name,
so the same script runs against OpenAI, Anthropic, or Gemini backends without code changes.
This walkthrough builds a complete embodied agent for
MiniGrid-DoorKey-5x5-v0 from individual modules, mirroring
examples/minigrid_react_example.py. The agent must find a key,
unlock a door, and reach the goal — a partially observable, multi-stage task.
Environments are obtained through the adapter registry, so the same agent code works for any registered benchmark. Perception, reasoning, and memory are constructed independently:
import benchmarks.minigrid # registers the MiniGrid adapter
from agents import EmbodiedAgent
from modules.adapters import get_adapter
from modules.perception import MiniGridPerceptionModule
from modules.reasoning import ReActReasoning
from modules.llm import OpenAIClient
env = get_adapter("minigrid", env_name="MiniGrid-DoorKey-5x5-v0")
perception = MiniGridPerceptionModule()
llm_client = OpenAIClient(
api_key="sk-...",
model="gpt-4o",
temperature=0.0, # deterministic decoding stabilizes partially observable tasks
)
reasoning = ReActReasoning(
llm_client=llm_client,
max_iterations=10,
enable_history=True, # keeps Thought/Action/Observation history across steps
strict_format=True,
)
The EmbodiedAgent simply wires the modules together. Swapping
ReActReasoning for COTReasoning or ToTReasoning,
or replacing the memory backend, requires no other changes:
from modules.memory import MemoryModule
class SlidingWindowMemory(MemoryModule):
"""Keep the most recent experiences (FIFO)."""
def __init__(self, max_memories=50):
self._memories, self.max_memories = [], max_memories
def retrieve(self, query, top_k=5, **kwargs):
return self._memories[-top_k:]
def store(self, experience, **kwargs):
self._memories.append(experience)
if len(self._memories) > self.max_memories:
self._memories.pop(0)
def reset(self):
self._memories = []
agent = EmbodiedAgent(
perception=perception,
reasoning=reasoning,
memory=SlidingWindowMemory(max_memories=50),
)
On every step the agent perceives, retrieves memory, reasons, and returns an action; the result is fed back so memory and the ReAct history stay grounded:
obs, info = env.reset()
agent.reset()
for step in range(200):
action = agent.step(
obs,
info=info,
meta=env.meta,
task=env.task_spec,
action_space=env.action_schema,
)
obs, reward, terminated, truncated, info = env.step(action)
reasoning.add_observation(f"Reward: {reward:.2f}", float(reward))
agent.observe_result(
reward=reward, next_obs=obs,
done=terminated, truncated=truncated, info=info,
)
if terminated or truncated:
break
Or run the bundled script directly — it also saves per-step frames and logs under
artifacts/runs/react/ for inspection:
python examples/minigrid_react_example.py \
--env MiniGrid-DoorKey-5x5-v0 \
--model gpt-4o \
--max-steps 200
Because modules share typed interfaces, a controlled ablation is just a change of command-line flag — the environment wrapper, perception, and evaluation pipeline stay identical. This is how the module-level results in the paper are produced.
Compare five reasoning strategies on the same DeliveryBench task:
# Rule-based baseline (no LLM, useful as a sanity check)
python examples/deliverybench_example.py --reasoning-method simple
# Direct LLM reasoning
python examples/deliverybench_example.py --reasoning-method llm \
--model gpt-4o-mini
# Chain-of-Thought
python examples/deliverybench_example.py --reasoning-method cot \
--model anthropic/claude-3.5-sonnet --llm-provider openrouter
# ReAct (interleaved Thought / Action / Observation)
python examples/deliverybench_example.py --reasoning-method react \
--model anthropic/claude-3.5-sonnet --llm-provider openrouter
# Tree-of-Thoughts with breadth-first search
python examples/deliverybench_example.py --reasoning-method tot \
--model anthropic/claude-3.5-sonnet --llm-provider openrouter \
--tot-search-strategy BFS
The same pattern applies to the other axes of the design space. For instance, to ablate the memory module while holding reasoning fixed at ReAct, run the corresponding MiniGrid example scripts:
python examples/minigrid_react_example.py # sliding-window memory
python examples/minigrid_amem_example.py # A-MEM structured memory
python examples/minigrid_graph_memory_example.py # graph-based memory
python examples/minigrid_mem0_example.py # Mem0 backend
Each run writes its trajectory, per-step reasoning traces, and metrics to
artifacts/runs/, so configurations can be compared step by step
under identical seeds and settings.
Use structured reasoning strategies for symbolic or shorter-horizon tasks.
Prioritize representation alignment over context volume for long-horizon stability.
Add reflection to stabilize weak pairings and reduce compounding action errors.