STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Abstract
Large language models struggle to update personalized memories when new evidence emerges, requiring contextual inference and commonsense reasoning to detect implicit conflicts, as demonstrated by a comprehensive benchmark and evaluation of state-aware memory systems.
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.
Community
We would greatly appreciate it if you could upvote.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents (2026)
- PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments (2026)
- MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents (2026)
- BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents (2026)
- ClawArena: Benchmarking AI Agents in Evolving Information Environments (2026)
- Belief Memory: Agent Memory Under Partial Observability (2026)
- D-Mem: A Dual-Process Memory System for LLM Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06527 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper