← Chen, Ho Yiing — Research Records
Chen, Ho Yiing · 2026-05-16 · Zenodo
doi:10.5281/zenodo.20237035 · PDF
Most agent benchmarks ask whether an agent can finish a task. They reward tool use, web navigation, planning, retrieval, and workflow completion. I argue that this question is too narrow for the systems being built right now. The agents arriving in 2026 are not session-scoped task performers. They persist for days and weeks, accumulate memory, influence each other, drift, overload, and require governance. A benchmark that ignores time misses where these systems actually fail. I propose LOBSTER-Bench, a benchmark that scores an agent system on six dimensions: temporal persistence, cognitive tel
Chen, Ho Yiing (norika) · Independent Researcher, Taiwan · ORCID 0009-0006-6816-9891