LoCoMo Benchmark Report
Executive summary
Genesys scored 89.9% on the LoCoMo benchmark — the academic standard for evaluating long-term conversational memory in AI agents, published by Snap Research. The benchmark tests an agent’s ability to maintain and reason over multi-session conversations across four categories: single-hop factual recall, temporal reasoning, multi-hop inference, and open-domain knowledge integration.
These results place Genesys among the top-performing memory systems in the published landscape, above Mem0, Zep, MAGMA, and Memobase, and competitive with the highest-reported scores from any system.
Per-category breakdown
| Category | Correct | Total | J-Score |
|---|---|---|---|
| Single-hop | 266 | 282 | 94.3% |
| Open-domain | 771 | 841 | 91.7% |
| Temporal | 281 | 321 | 87.5% |
| Multi-hop | 67 | 96 | 69.8% |
| Overall | 1,385 | 1,540 | 89.9% |
Per-conversation breakdown
Standard deviation across conversations: 4.0 points — consistent performance with no outlier failures.
| Conversation | Correct | Total | J-Score |
|---|---|---|---|
| conv-30 | 80 | 81 | 98.8% |
| conv-26 | 143 | 152 | 94.1% |
| conv-44 | 114 | 123 | 92.7% |
| conv-42 | 180 | 199 | 90.5% |
| conv-48 | 172 | 191 | 90.1% |
| conv-50 | 142 | 158 | 89.9% |
| conv-47 | 134 | 150 | 89.3% |
| conv-49 | 138 | 156 | 88.5% |
| conv-41 | 130 | 152 | 85.5% |
| conv-43 | 152 | 178 | 85.4% |
Competitive landscape
Scores drawn from each system’s own publications or from Mem0’s ECAI 2025 paper. Methodological differences (answering model, judge model, embedder, inclusion of category 5) make exact apples-to-apples comparison difficult.
| System | J-Score | Answer LLM | Judge LLM | License |
|---|---|---|---|---|
| MemMachine v0.2 | 91.7% | gpt-4.1-mini | gpt-4.1-mini | Closed |
| Genesys | 89.9% | gpt-4o-mini | gpt-4o-mini | Open |
| SuperLocalMemory C | 87.7% | Cloud LLM | LLM judge | Open |
| Zep (self-reported) | 75.1–80% | gpt-4o-mini | gpt-4o-mini | Cloud |
| MemOS | 75.8% | gpt-4o-mini | gpt-4o-mini | Open |
| Full Context | 73% | gpt-4o-mini | gpt-4o-mini | N/A |
| MAGMA | 70.0% | N/R | N/R | Research |
| Mem0g (graph) | 68.4% | gpt-4o-mini | gpt-4o-mini | Freemium |
| Mem0 | 67.1% | gpt-4o-mini | gpt-4o-mini | Freemium |
| Zep (per Mem0) | 58.4% | gpt-4o-mini | gpt-4o-mini | Cloud |
Category analysis
Single-hop (94.3%)
Basic factual recall: retrieving a specific fact stated in a conversation. 266 of 282 correct. The 16 remaining failures are primarily entity confusion (retrieving facts about the wrong character) and cases where the conversation contained multiple valid answers but the gold label expected only one.
Open-domain (91.7%)
Integrating conversational memory with general world knowledge. 771 of 841 correct. The answering model’s general knowledge complements Genesys’s retrieval — the system provides the right conversational context, and gpt-4o-mini fills in the world-knowledge reasoning.
Temporal (87.5%)
Reasoning about when events occurred, their chronological ordering, or their relationship to other events in time. 281 of 321 correct. Failures cluster around precise date arithmetic and events mentioned in passing without explicit date markers. This category is a natural strength for the causal graph architecture, which encodes temporal relationships as first-class edges.
Multi-hop (69.8%)
Synthesizing information from multiple, non-adjacent parts of a conversation. 67 of 96 correct. The primary drag keeping the overall score below 90%. Many failures are counterfactual or inferential rather than retrieval-based — the answering model retrieves the right context but fails to make the inferential leap.
This is the clearest path to breaking 90% overall: improving multi-hop from 69.8% to ~80% would push the overall score above the threshold.
Methodology
The evaluation follows a three-stage pipeline (ingestion, evaluation, judging), each implemented as an independent script for reproducibility.
- Dataset
- LoCoMo (Snap Research), 10 conversations
- Questions evaluated
- 1,540 (categories 1–4)
- Questions excluded
- Category 5 (adversarial, no ground truth)
- Retrieval depth (k)
- 20 memories per query
- Answering model
- gpt-4o-mini (OpenAI)
- Judge model
- gpt-4o-mini (OpenAI)
- Memory architecture
- Genesys causal graph, per-user isolation
- Avg. memories retrieved
- 19.8 per question
Reproduction
All evaluation code, ingestion scripts, and raw results (1,540 judged entries) are available for independent reproduction.