LoCoMo Benchmark Report

89.9% Overall J-Score · 1,540 Questions · 10 Conversations

Executive summary

Genesys scored 89.9% on the LoCoMo benchmark — the academic standard for evaluating long-term conversational memory in AI agents, published by Snap Research. The benchmark tests an agent’s ability to maintain and reason over multi-session conversations across four categories: single-hop factual recall, temporal reasoning, multi-hop inference, and open-domain knowledge integration.

These results place Genesys among the top-performing memory systems in the published landscape, above Mem0, Zep, MAGMA, and Memobase, and competitive with the highest-reported scores from any system.

Per-category breakdown

LoCoMo scores by question category
Category	Correct	Total	J-Score
Single-hop	266	282	94.3%
Open-domain	771	841	91.7%
Temporal	281	321	87.5%
Multi-hop	67	96	69.8%
Overall	1,385	1,540	89.9%

Per-conversation breakdown

Standard deviation across conversations: 4.0 points — consistent performance with no outlier failures.

LoCoMo scores by conversation
Conversation	Correct	Total	J-Score
conv-30	80	81	98.8%
conv-26	143	152	94.1%
conv-44	114	123	92.7%
conv-42	180	199	90.5%
conv-48	172	191	90.1%
conv-50	142	158	89.9%
conv-47	134	150	89.3%
conv-49	138	156	88.5%
conv-41	130	152	85.5%
conv-43	152	178	85.4%

Competitive landscape

Scores drawn from each system’s own publications or from Mem0’s ECAI 2025 paper. Methodological differences (answering model, judge model, embedder, inclusion of category 5) make exact apples-to-apples comparison difficult.

Competitive landscape — LoCoMo scores across memory systems
System	J-Score	Answer LLM	Judge LLM	License
MemMachine v0.2	91.7%	gpt-4.1-mini	gpt-4.1-mini	Closed
Genesys	89.9%	gpt-4o-mini	gpt-4o-mini	Open
SuperLocalMemory C	87.7%	Cloud LLM	LLM judge	Open
Zep (self-reported)	75.1–80%	gpt-4o-mini	gpt-4o-mini	Cloud
MemOS	75.8%	gpt-4o-mini	gpt-4o-mini	Open
Full Context	73%	gpt-4o-mini	gpt-4o-mini	N/A
MAGMA	70.0%	N/R	N/R	Research
Mem0g (graph)	68.4%	gpt-4o-mini	gpt-4o-mini	Freemium
Mem0	67.1%	gpt-4o-mini	gpt-4o-mini	Freemium
Zep (per Mem0)	58.4%	gpt-4o-mini	gpt-4o-mini	Cloud

Category analysis

Single-hop (94.3%)

Basic factual recall: retrieving a specific fact stated in a conversation. 266 of 282 correct. The 16 remaining failures are primarily entity confusion (retrieving facts about the wrong character) and cases where the conversation contained multiple valid answers but the gold label expected only one.

Open-domain (91.7%)

Integrating conversational memory with general world knowledge. 771 of 841 correct. The answering model’s general knowledge complements Genesys’s retrieval — the system provides the right conversational context, and gpt-4o-mini fills in the world-knowledge reasoning.

Temporal (87.5%)

Reasoning about when events occurred, their chronological ordering, or their relationship to other events in time. 281 of 321 correct. Failures cluster around precise date arithmetic and events mentioned in passing without explicit date markers. This category is a natural strength for the causal graph architecture, which encodes temporal relationships as first-class edges.

Multi-hop (69.8%)

Synthesizing information from multiple, non-adjacent parts of a conversation. 67 of 96 correct. The primary drag keeping the overall score below 90%. Many failures are counterfactual or inferential rather than retrieval-based — the answering model retrieves the right context but fails to make the inferential leap.

This is the clearest path to breaking 90% overall: improving multi-hop from 69.8% to ~80% would push the overall score above the threshold.

Methodology

The evaluation follows a three-stage pipeline (ingestion, evaluation, judging), each implemented as an independent script for reproducibility.

Dataset: LoCoMo (Snap Research), 10 conversations
Questions evaluated: 1,540 (categories 1–4)
Questions excluded: Category 5 (adversarial, no ground truth)
Retrieval depth (k): 20 memories per query
Answering model: gpt-4o-mini (OpenAI)
Judge model: gpt-4o-mini (OpenAI)
Memory architecture: Genesys causal graph, per-user isolation
Avg. memories retrieved: 19.8 per question

Reproduction

All evaluation code, ingestion scripts, and raw results (1,540 judged entries) are available for independent reproduction.

View benchmark code on GitHub →