Genesys by Astrix Labs

LoCoMo Benchmark Report

89.9% Overall J-Score · 1,540 Questions · 10 Conversations

Executive summary

Genesys scored 89.9% on the LoCoMo benchmark — the academic standard for evaluating long-term conversational memory in AI agents, published by Snap Research. The benchmark tests an agent’s ability to maintain and reason over multi-session conversations across four categories: single-hop factual recall, temporal reasoning, multi-hop inference, and open-domain knowledge integration.

These results place Genesys among the top-performing memory systems in the published landscape, above Mem0, Zep, MAGMA, and Memobase, and competitive with the highest-reported scores from any system.

Per-category breakdown

LoCoMo scores by question category
CategoryCorrectTotalJ-Score
Single-hop26628294.3%
Open-domain77184191.7%
Temporal28132187.5%
Multi-hop679669.8%
Overall1,3851,54089.9%

Per-conversation breakdown

Standard deviation across conversations: 4.0 points — consistent performance with no outlier failures.

LoCoMo scores by conversation
ConversationCorrectTotalJ-Score
conv-30808198.8%
conv-2614315294.1%
conv-4411412392.7%
conv-4218019990.5%
conv-4817219190.1%
conv-5014215889.9%
conv-4713415089.3%
conv-4913815688.5%
conv-4113015285.5%
conv-4315217885.4%

Competitive landscape

Scores drawn from each system’s own publications or from Mem0’s ECAI 2025 paper. Methodological differences (answering model, judge model, embedder, inclusion of category 5) make exact apples-to-apples comparison difficult.

Competitive landscape — LoCoMo scores across memory systems
SystemJ-ScoreAnswer LLMJudge LLMLicense
MemMachine v0.291.7%gpt-4.1-minigpt-4.1-miniClosed
Genesys89.9%gpt-4o-minigpt-4o-miniOpen
SuperLocalMemory C87.7%Cloud LLMLLM judgeOpen
Zep (self-reported)75.1–80%gpt-4o-minigpt-4o-miniCloud
MemOS75.8%gpt-4o-minigpt-4o-miniOpen
Full Context73%gpt-4o-minigpt-4o-miniN/A
MAGMA70.0%N/RN/RResearch
Mem0g (graph)68.4%gpt-4o-minigpt-4o-miniFreemium
Mem067.1%gpt-4o-minigpt-4o-miniFreemium
Zep (per Mem0)58.4%gpt-4o-minigpt-4o-miniCloud

Category analysis

Single-hop (94.3%)

Basic factual recall: retrieving a specific fact stated in a conversation. 266 of 282 correct. The 16 remaining failures are primarily entity confusion (retrieving facts about the wrong character) and cases where the conversation contained multiple valid answers but the gold label expected only one.

Open-domain (91.7%)

Integrating conversational memory with general world knowledge. 771 of 841 correct. The answering model’s general knowledge complements Genesys’s retrieval — the system provides the right conversational context, and gpt-4o-mini fills in the world-knowledge reasoning.

Temporal (87.5%)

Reasoning about when events occurred, their chronological ordering, or their relationship to other events in time. 281 of 321 correct. Failures cluster around precise date arithmetic and events mentioned in passing without explicit date markers. This category is a natural strength for the causal graph architecture, which encodes temporal relationships as first-class edges.

Multi-hop (69.8%)

Synthesizing information from multiple, non-adjacent parts of a conversation. 67 of 96 correct. The primary drag keeping the overall score below 90%. Many failures are counterfactual or inferential rather than retrieval-based — the answering model retrieves the right context but fails to make the inferential leap.

This is the clearest path to breaking 90% overall: improving multi-hop from 69.8% to ~80% would push the overall score above the threshold.

Methodology

The evaluation follows a three-stage pipeline (ingestion, evaluation, judging), each implemented as an independent script for reproducibility.

Dataset
LoCoMo (Snap Research), 10 conversations
Questions evaluated
1,540 (categories 1–4)
Questions excluded
Category 5 (adversarial, no ground truth)
Retrieval depth (k)
20 memories per query
Answering model
gpt-4o-mini (OpenAI)
Judge model
gpt-4o-mini (OpenAI)
Memory architecture
Genesys causal graph, per-user isolation
Avg. memories retrieved
19.8 per question

Reproduction

All evaluation code, ingestion scripts, and raw results (1,540 judged entries) are available for independent reproduction.

View benchmark code on GitHub →

Prepared by Astrix Labs · April 2026