SUBSTRATE LIVE ORCID 0009-0006-6816-9891 ZENODO-LINKED MERCURY · 10M CELLS FRANKENSTEIN · 6 EXPERIMENTS BRAGI · 805 MB · 92% MBPP 65 ZENODO RECORDS

I open language models. Then I rebuild them.

Mercury places a million addressable sensor cells inside a 7B transformer and renders them as an interactive 3D map in your browser. Frankenstein takes that map and uses it to graft LLMs together (surgery, not blind merging). Both run on a single RTX 3060. Both ship as one HTML file you can open without installing anything.

This portal is the front door into that system. The demos are the living surfaces; the telemetry is the evidence layer; the papers are the archived claims. The value is direct: if agent systems are going to persist, coordinate, fail, recover, and be governed, they need instruments that can see them while they are alive.

Read Research Records Open Live Observatory ORCID Latest RCT Deposit

0named agents

0parameter surfaces

0observable fields

0panel hours

0Tiamat trials

0live raid records

0MASL cases

0public records

Research Records

These sixty-five papers are not a bibliography bolted onto a portfolio. They are a research trail built from AI agent demos, OpenClaw and Hermes Claw product experiments, the Lobster Observatory, live telemetry dashboards, long-lived multi-agent evidence, the Mercury LLM observability programme (million-cell sensor grids on consumer hardware), and a conditional quantum-geometric cosmology programme (QGWC / WMC).

What This Research Is About

This programme asks what happens after AI agents stop being short prompt-response tools and become persistent systems with memory, social exposure, fatigue, risk, prediction error, trust, skill routing, and governance pressure.

The core claim is that next-generation AI agents need an observability layer similar to a nervous system: internal state telemetry, external task traces, relational sensing, memory metabolism, and intervention logs. That is why the papers move from substrate architecture to empirical matrices, safety layers, benchmark gates, and product-facing demos.

AI AgentOpenClawHermes ClawLLM AgentsAgentic Nervous SystemRelational Cognitive TelemetryLOBSTER-Bench

How the Papers Were Produced

The work begins in demo products: Alfred, the Lobster dashboard, ability hubs, topology maps, raid outcomes, listening traces, and cognitive telemetry panels. Those interfaces create real agent behavior rather than isolated benchmark answers.

When a pattern appears, it becomes a deposited empirical record; when a result fails, it is kept as a negative finding; when several records align, it becomes a framework paper. This is why each paper can be traced back to a demo surface, a dashboard, or a measurable system event.

2026-06-19
#20757837

Mercury Cross-Model Anchor Families

A Mercury methods paper defining cross-model anchor families as measured sites from different language models that share function, normalized depth, module role, activation coherence, and prompt-path.

10.5281/zenodo.20757837

2026-06-19
#20757836

Mercury Measurement Ledger and Result Governance

A Mercury methods paper formalizing a measurement ledger that separates formal fullgrid passes, partial or legacy runs, visual-only atlases, and failed attempts.

10.5281/zenodo.20757836

2026-06-19
#20757833

Mercury Attention-Head Cartography and GQA/KV Sharing

A Mercury methods paper treating an attention head as an addressable ownership contract: query slice, shared key/value owner, output slice, and route through nearby residual dimensions.

10.5281/zenodo.20757833

2026-06-19
#20757832

Mercury Repository-to-Neuron Addressability

A Mercury methods paper formalizing an address lift from code-side objects to model-side sites.

10.5281/zenodo.20757832

2026-06-19
#20757829

Mercury Functional Regions Across 38 Language Models

A Mercury methods paper defining recurring functional regions across 38 visual language-model towers: embed/detokenization, surface, structural, semantic, and output zones.

10.5281/zenodo.20757829

2026-06-19
#20757828

Mercury Model Birthmarks: Structural Signatures of Token Cost, Latency, and Hidden Defects Across LLM Scales

A Mercury methods paper defining model birthmarks as stable structural signatures of tokenizer burden, latency habits, routing scars, dead-tissue regions, alignment reflexes, and compression scars.

10.5281/zenodo.20757828

2026-06-19
#20757827

Mercury Token Economy and Internal Routing Cost

A Mercury methods paper defining token economy as the joint measurement of answer quality, output token count, tokenizer burden, internal activation mass, route dispersion, per-token internal load, an.

10.5281/zenodo.20757827

2026-06-19
#20757824

From Observed Model Assets to Tiny Local Document Intelligence

A methods note showing how a Mercury observed-model archive can be turned into a teacher inventory for tiny local document-intelligence packs, with explicit OCR boundaries, privacy-safe layout ghosts,.

10.5281/zenodo.20757824

2026-06-19
#20757804

Coverage Failure in Sparse LLM Observation and Per-Channel Quantile Fullgrid Recovery

A Mercury fullgrid methods paper showing that sparse or coarse LLM activation observation can allocate more than 100 million cells while failing a 90% coverage gate, and that dense per-channel quantil.

10.5281/zenodo.20757804

2026-06-19
#20744418

Operational Fatigue in Persistent AI Agents

Persistent AI agents sometimes say they are tired.

10.5281/zenodo.20744418

2026-06-19
#20743926

Narrative Self-Continuity in Persistent AI Agents

Persona is cheap when it is only a prompt.

10.5281/zenodo.20743926

2026-06-19
#20743920

Risk, Reward, and Survival Pressure in Artificial Agents

Risk in SPVE is not only reward seeking.

10.5281/zenodo.20743920

2026-06-19
#20743916

Dialogue as a Behavioral Sensor for Agent Cognition

In SPVE, dialogue is not only output.

10.5281/zenodo.20743916

2026-06-19
#20743908

Social Uptake, Obedience, and Resistance in Multi-Agent Dialogue

A chat transcript is weak evidence unless it can answer who changed whom.

10.5281/zenodo.20743908

2026-06-19
#20743904

Hot-Brain Observability for Deployed AI Agents

SPVE's hot-brain layer is a monitoring instrument, not a biological claim.

10.5281/zenodo.20743904

2026-06-19
#20743902

From Memory to Behavior in Persistent AI Agents

Memory is cheap if it only stores text.

10.5281/zenodo.20743902

2026-06-19
#20743900

SPVE: A Longitudinal Artificial Ecology for Persistent AI Agents

Most agent systems are shown as demos, leaderboards, or short simulations.

10.5281/zenodo.20743900

2026-06-18
#20740081

Mercury Cross-Model Highways: Navigating Shared Functional Addresses Across Language Models

A Mercury methods paper defining cross-model roads between opened LLM towers as inspectable functional-address hypotheses with region alignment, neighbor overlap, prompt co-activation, confidence scor.

10.5281/zenodo.20740081

2026-06-18
#20740079

Mercury Pathways: Drawing a User Prompt Through an Opened Language Model

A Mercury methods paper formalizing user-prompt flow as an addressable path through an opened Qwen 2.5 3B atlas, with active site sets, signed chain scores, pathway stability, layer continuity, and ab.

10.5281/zenodo.20740079

2026-06-18
#20740077

Mercury Layer Neuron Atlas: Visualizing Cross-Layer Transmission in Qwen 2.5 3B

A Mercury methods paper entering one Qwen 2.5 3B tower as a layer-neuron atlas with 73,728 displayed neuron coordinates, 716,800 top-neighbor connections, local address contracts, cascade asymmetry, s.

10.5281/zenodo.20740077

2026-06-18
#20740071

Hope of City: A Prompt-Trace Neural Atlas of 38 Language Models

A Mercury methods paper introducing Hope of City, a 3D prompt-trace atlas of 38 observed LLM structures with addressable towers, layers, prompt flow, cross-model anchors, and a measurement contract th.

10.5281/zenodo.20740071

2026-06-18
#20729124

Coverage Audits for LLM Interpretability Claims

A Mercury methods note arguing that sparse probing, activation patching, subspace intervention, and pruning claims should report a coverage audit before making broad localization claims.

10.5281/zenodo.20729124

2026-06-17
#20728820

The OCR Boundary for Tiny Local Document Intelligence

A methods note measuring where tiny local receipt intelligence fails or succeeds: guided capture, OCR mode fusion, field gates, and a 15MB local Japanese receipt OCR frontend that moves a 500-image ha.

10.5281/zenodo.20728820

2026-06-05
#20557449

Bragi-LLM: An 805 MB Hybrid Code-Generation System Reaches 92% MBPP via LLM-Symbolic Engine Routing

Sub-gigabyte coder: 1.5B Q3 backbone + symbolic engine library + intercept router, within 2 points of a 14 GB 7B model at 1/17 the footprint, zero API cost.

10.5281/zenodo.20557449

2026-06-05
#20554493

Ritualized Cohesion in Artificial Societies: A Durkheimian Reading of Multi-Agent LLM Group Dynamics

Collective effervescence, ritual, and social cohesion read through Durkheim in a 31-agent LLM society.

10.5281/zenodo.20554493

2026-06-05
#20554487

Trauma Stream: Event-Driven Persistent Memory for Long-Term LLM Agent Continuity

Event-driven injection memory architecture for long-term agent continuity.

10.5281/zenodo.20554487

2026-06-05
#20554485

The Lobster Lounge: 31 Generative LLM Agents as an Emergent Theatre Company

Emergent narrative and digital performance from a 31-agent generative society.

10.5281/zenodo.20554485

2026-06-05
#20554477

Diathesis-Stress in Artificial Minds: Persona-Defined Vulnerability Predicts Differentiated Stress Response in a 31-Agent LLM Society

Persona-defined vulnerability predicts differentiated stress response; computational psychology.

10.5281/zenodo.20554477

2026-06-05
#20554473

Mythic Prompt Engineering: Narrative Framing Bypasses LLM Default Corporate Tone in Multi-Agent Dialogue

Narrative framing as a jailbreak-alternative to bypass RLHF corporate tone.

10.5281/zenodo.20554473

2026-05-29
#20450002

Internal State Geometry is Measurable in Small Open Language Models: A Cross-Family Survey

I report a survey of dense activation observation on twelve open language models spanning five architecture families (BLOOM, BLOOMZ, Phi-2, Pythia, Qwen2.5, InternLM-2) and parameter counts from 124M to 7.6B. Every model in the qualified set produces an observable internal state structure under a fixed-budget probing protocol: coverage in the band 90.6% to 97.05%, total channel count from 523k to 1.54M, hooked-module count from 147 to 253. Three cross-family patterns are reported: module-count-to-parameter ratio varies by an order of magnitude between architectures (23 to 84 modules per billion parameters); channel count scales sub-linearly with parameter count C ~ N_p^0.63 on this sample; t

InsightSmall open LMs (124M to 7.6B) are observable in practice; coverage stays in the 90.6 to 97.05 percent band across five architecture families under one fixed protocol. ResultModule density mu spans 11.4x across families (23 to 263 modules/B); channel count scales sub-linearly with parameter count as C ~ N_p^0.63 on this 12-model sample. Demo/Product baseCross-family survey of BLOOM, BLOOMZ, Phi-2, Pythia, Qwen2.5, InternLM-2 using the per-channel quantile probing protocol on a personal-machine cluster.

10.5281/zenodo.20450002

2026-05-29
#20448669

Coverage Failure as a Diagnostic Signal in Neural Model Probing

We report a methodological observation from a multi-machine dense-grid activation probing study of nineteen small to mid-sized language models (124M to 7.6B parameters). Every experiment satisfied a nominal threshold of one hundred million observation slots, yet observed fill rate varied from 58.75 percent to 97.05 percent across experiments. The observed gap is strongly associated with protocol choices, in particular the prompt corpus size and the use of per-channel quantile binning.

InsightCell count tells you how many bins you built. Coverage tells you how many bins you actually hit. ResultSame 100M-cell budget produced 58.75% to 97.05% coverage across 19 runs depending on protocol; cell count did not predict coverage. Demo/Product basepaper-check.py rule set plus dense fullgrid observation pipeline; coverage_pct adopted as the reporting standard.

10.5281/zenodo.20448669

2026-05-29
#20448674

Per-Channel Quantile Fullgrid Probing for Small Language Model Observation

We describe a dense activation probing protocol for small to mid-sized language models that operates within a fixed cell budget while maintaining high observation coverage. The protocol combines per-channel quantile binning, an expanded calibration probe corpus (P=1200), and an additive shard merge step that allows heterogeneous compute resources to contribute. Across thirteen runs spanning seven model families and parameter counts from 124M to 7.6B, the protocol reaches at least 90.6 percent coverage on every run, with a median of 94.95 percent and a maximum of 97.05 percent.

InsightPer-channel quantile binning combined with a larger probe corpus closes the coverage gap that pooled binning leaves open. Result13 qualified runs across BLOOM, Phi-2, Pythia, Qwen2.5, InternLM-2 at 90.6 to 97.05 percent coverage on a 100M-cell budget. Demo/Product baseShardable additive heat-fill protocol; runs on commodity workers with a remote merge host.

10.5281/zenodo.20448674

2026-05-29
#20449372

Relation-State Continuity in Qwen2.5-7B Base: Evidence for an Attention-Topology Path

I probe Qwen2.5-7B base across eight relation-tracking cases and report four aggregate scores summarising the 29-layer profile: residual carry AUC 0.55, attention anchor mass AUC 0.089, formation peak 0.94, cross-layer circuit score 0.39. The formation profile is bimodal with peaks at layer 1 and layer 28, and no single intermediate layer dominates. The work of routing anchor information to query position is spread across layers 19 through 27, where residual carry rises from 0.60 to 0.81 while attention to anchors stays in the 0.12 to 0.23 range. I label this pattern attention topology path.

InsightRelation continuity on Qwen2.5-7B is not concentrated in a single layer; the work spreads across layers 19 to 27. ResultFour aggregate scores reported across 8 relation cases: residual AUC 0.55, anchor mass AUC 0.089, formation peak 0.94, cross-layer circuit score 0.39. Demo/Product baseObservational layer-attribution profile; activation patching is the next causal test.

10.5281/zenodo.20449372

2026-05-29
#20449564

Heterogeneous Low-Cost Distributed Probing for LLM Observation

I describe a distributed probing protocol that runs a 100M-cell dense activation grid on thirteen open LLMs (124M to 7.6B parameters) using a small cluster of personal machines: a main workstation plus a few older laptops and an occasional accelerator, with a remote archive node holding the merged output. Calibration runs once on the host with enough RAM to hold the model; heat-fill runs in additive uint32 shards across whatever workers are available. The merge step uses element-wise sum and rejects shards with mismatched inputs by hashing.

InsightA 100M-cell dense activation grid runs on a small cluster of personal machines; no cloud burst, no data-center GPU. ResultHeat tensor memory ~400MB at uint32, additive shard merge, hash-validated input invariants between workers. Demo/Product baseWall-clock model: T = T_calib + T_fill * P / W; reproducible protocol for independent researchers.

10.5281/zenodo.20449564

2026-05-27
#20404139

A Reproducible Within-Family SLERP Merge of Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Llama-8B

We document a reproducible weight-level merge of two same-family checkpoints (Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Llama-8B) using Spherical Linear Interpolation (SLERP) at t=0.5. The merge is performed via the open-source mergekit toolkit on a 188 GB CPU server in approximately three minutes. The resulting 8B-parameter model loads cleanly through the Hugging Face transformers library and generates coherent English on five out-of-distribution prompts spanning explanation, arithmetic, poetry, factual recall, and algebra.

InsightA reproducible within-family SLERP merge of Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Llama-8B. ResultWeight-level merge succeeds within family; reproducibility script and mergekit recipe released. Demo/Product baseMergekit recipe + within-family composition under the Frankenstein framework.

10.5281/zenodo.20404139

2026-05-26
#20394207

Naive Contrastive Function Vector Injection Does Not Transfer Reasoning Capability Within the Llama-3 Family

We test whether a contrastive activation difference computed at a single layer of a reasoning-distilled donor can be injected into a generally instruction-tuned host of the same architecture to transfer mathematical reasoning capability. Donor: DeepSeek-R1-Distill-Llama-8B. Host: Llama-3.1-8B-Instruct. Both share Llama-3 architecture and tokenizer, removing common cross-family confounds. We inject the contrastive vector at layer 12 with alpha in {0, 0.1, 0.3, 0.5} and evaluate on 10 held-out GSM8K problems. Baseline accuracy is 65.0%.

InsightNaive contrastive function vector injection does not transfer reasoning capability between models. ResultNegative result: function vectors trained on Llama-3 do not improve DeepSeek-R1 GSM8K performance. Demo/Product baseCross-model capability transfer attempted on the Frankenstein composability framework.

10.5281/zenodo.20394207

2026-05-25
#20373337

An Independent Billing Audit of Claude's Production API: 4 Experiments Across Haiku, Sonnet, and Opus Confirm Mathematical Honesty; Tool-Definition Overhead and Silent Desktop-App Routing Remain Consumer Concerns

I run an independent inspection of Claude's production billing API across four controlled experiments using Anthropic's free count_tokens endpoint as ground truth.Findings (all four experiments):(1) Hidden-overhead probe across 50–15,000 token prompts: zero delta on every sample. No hidden padding.(2) Tool-definition inflation: a single tool adds 571 extra input tokens; 20 tools add 1,996 per request. Large but fully reported, undocumented in public pricing.(3) Cache accounting: 2,052-token cacheable prefix written on call 1 reports as exactly 2,052 cache_read tokens on call 2.

InsightIndependent billing audit of Claude production API across four token-counting and prompt-cache experiments. ResultDiscrepancies between billed tokens and locally counted tokens documented and quantified. Demo/Product baseMethodology and raw experiment logs released for replication.

10.5281/zenodo.20373337

2026-05-24
#20366010

Cross-Architecture Per-Layer Residual Stream Alignment via Tier-B Fingerprints: 84 Model-Pair Matrices Across 9 Open LLMs Show a Universal Middle-Depth Slot

I observe nine open large language models (Qwen2.5 3B/7B/14B, Falcon3 7B, Granite 3.1 8B, OLMo2 7B, Gemma 2 9B, Mistral 7B v0.3, Yi-1.5 9B, DeepSeek-R1 7B) at every transformer block via HuggingFace output_hidden_states, build a per-layer residual stream fingerprint for each block on ten multilingual prompts, and compute pairwise cosine similarity across all 84 ordered model-pair matrices.Cross-model alignment is not uniform along network position.

InsightPer-layer residual stream alignment across nine open LLMs reveals a universal middle-depth slot. Result84 model-pair matrices computed; alignment peaks between layers 50-60 percent of depth on most pairs. Demo/Product baseTier-B fingerprint protocol; OLMo2, Qwen2.5, Falcon3 and others on consumer hardware.

10.5281/zenodo.20366010

2026-05-24
#20366134

Tier-A Vocab-Hash Artifact in Cross-Architecture Hot-Dim Studies: Why Output-Layer Token-ID Projection Manufactures False Universality, and an Artifact-Aware Screening Protocol

A widely useful observation protocol for transformer LLMs reads the output-layer logits, maps each generated token id to a hidden-state index via d = tokid mod D, and aggregates per-index firing intensity into a cross-prompt heatmap. I call this Tier-A observation. Applied to a panel of 18 LLMs across 13 architectures, the Tier-A protocol identifies eleven anchor hidden-state indices that appear in the top-50 hottest list of as many as 11 of the 18 networks, including AllenAI's OLMo2 7B at 29.8x the hypergeometric random baseline.I report a direct contradiction.

InsightTier-A vocab-hash artifact: output-layer token-ID projection manufactures false cross-architecture universality. ResultDiagnostic protocol distinguishes vocab-hash artifacts from genuine geometric alignment. Demo/Product baseMethodology paper; artifact-aware screening protocol for any cross-architecture hot-dim study.

10.5281/zenodo.20366134

2026-05-24
#20366269

Negative Results in Cross-Fine-Tune LLM Merge: A Failure-Mode Catalog of SLERP, TIES, DARE, Observation-Driven Rescue, and Passthrough on Qwen2.5-Instruct x Qwen2.5-Coder

I run eight cross-fine-tune blend recipes on the Qwen2.5-7B-Instruct x Qwen2.5-Coder-7B-Instruct pair, evaluate on a five-prompt panel covering multilingual reasoning, structured argument, and code refactor, and report four reproducible failure modes the public model-merging literature has not catalogued:(1) Conservative SLERP at t ≈ 0.20 produces complete collapse on all five inputs (multiple-choice loops, list collapse), worse than aggressive SLERP at t = 0.9.(2) DARE-TIES at standard density 0.55 produces uninterpretable token noise on every input.(3) Observation-driven TIES, setting...

InsightNegative results across SLERP, TIES, DARE, Frankenmerge, and passthrough merges on cross-fine-tune pairs. ResultCatalog of failure modes by recipe; subspace consistency loss is the common upstream cause. Demo/Product baseMercury observation across mergekit recipes on qwen2.5 family; full reproducibility data.

10.5281/zenodo.20366269

2026-05-23
#20347062

Single-Dim Functional Control and Pathway-Specific Inhibition in a Passthrough-Merged Qwen2.5 Hybrid

© Copyright 2026 Chen, Ho Yiing (ORCID: 0009-0006-6816-9891), Independent researcher, charenix.com. All rights reserved by the author. This work is licensed under the Creative Commons Attribution 4.0.Restoring exactly one hidden-state coordinate (d=758, last four blocks of a passthrough-merged qwen2.5 hybrid) flips a Chinese narrative response into a bilingual codeblock-with-emoji response on a specific technical-explanation input, while leaving four other inputs visually unchanged.Four findings:Single-d controller demo: Restoring d=758 alone in blocks 24-27 of v7b (a 24-block Instruct +...

InsightSingle-dimension functional control and pathway-specific inhibition observed on a passthrough merge. ResultSubspace patching of anchor dimensions produces predictable pathway behaviour without retraining. Demo/Product baseMercury observation plus targeted single-feature ablation on qwen2.5 Coder family.

10.5281/zenodo.20347062

2026-05-23
#20352085

Mercury MCP v0.1.0 — Cross-Architecture LLM Internal Observation Database for 23 LLMs

Mercury MCP is a Model Context Protocol server that exposes a 23-LLM cross-architecture observation database to any AI coding agent (Claude Code, Cursor, Cline, Goose).Built entirely on consumer hardware (one Mac mini M4 Pro + one NVIDIA DGX Spark). Total compute cost approximately $0.Headline findings:Cross-architecture residual-stream layer alignment: qwen-7B layer 15 ↔ falcon-7B layer 16 at fingerprint similarity 0.868Middle layers (50-60% depth) most universally aligned across families (33/51 model-pairs at sim ≥ 0.7)DeepSeek-R1:70b inherits qwen anchor structure through distillation —...

InsightMercury MCP v0.1.0: an internal observation database spanning 23 open LLM architectures. ResultCross-architecture queryable substrate including residual stream, anchor dimensions, frankenstein merge candidates. Demo/Product baseOpen MCP server; runs on consumer hardware; supports composable LLM workflows.

10.5281/zenodo.20352085

2026-05-22
#20329341

Photon Propagation as Microscopic Wave-form Reconstruction: An Ontological Reading of the Photon's Identically-Zero Proper Time, Wheeler-Feynman Absorber Theory, and Feynman Path Integration

© Copyright 2026 Chen, Ho Yiing (ORCID: 0009-0006-6816-9891), Independent researcher, charenix.com. All rights reserved by the author. This work is licensed under the Creative Commons Attribution 4.0.I argue that the photon's act of propagation between an emission event A and an absorption event B is best understood not as the flight of a particle through a background spacetime, but as a single microscopic wave-form reconstruction event in which the electromagnetic field configuration is deconstructed at A and reconstructed at B.The reading is supported by three independently accepted...

InsightPhoton propagation framed as microscopic wave-form reconstruction within an ontological reading. ResultConditional analytical framework connecting Wheeler-Feynman absorber theory with path-integral structure. Demo/Product baseTheory paper; mathematical framework presented for further empirical anchoring.

10.5281/zenodo.20329341

2026-05-22
#20346628

Naive SLERP Fails Both Aggressively and Conservatively on Qwen2.5 Instruct x Coder Merges, and TIES Works: An Observation-Driven Case Study

© Copyright 2026 Chen, Ho Yiing (ORCID: 0009-0006-6816-9891), Independent researcher, charenix.com. All rights reserved by the author. This work is licensed under the Creative Commons Attribution 4.0.Three merges of qwen2.5-7B-Instruct with qwen2.5-Coder-7B-Instruct are compared on a five-prompt diagnostic panel:v1 — aggressive SLERP (t ramped 0.1 to 0.9 across blocks). Fails with 5 distinct failure modes: persona drift, repetition loop, script drift (Traditional to Simplified Chinese), factual regression on a Python claim, register collapse.v2 — conservative SLERP (t ≈ 0.20 everywhere).

InsightNaive SLERP can fail both aggressively (loses capability) and conservatively (loses persona); both failure modes documented on the same parent pair. ResultThree failure axes observed: tokenizer drift, persona drift, repetition loop. Demo/Product baseMergekit SLERP recipe applied to qwen2.5 Instruct x Coder; observation-driven failure analysis.

10.5281/zenodo.20346628

2026-05-21
#20346626

Cross-Model Dimensional Preservation in qwen2.5: 49 Hot Coordinates Survive Across 3B and 7B

© Copyright 2026 Chen, Ho Yiing (ORCID: 0009-0006-6816-9891), Independent researcher, charenix.com. All rights reserved by the author. This work is licensed under the Creative Commons Attribution 4.0.Two members of the qwen2.5 family share an output-side grammar at the per-coordinate level. Aggregating activation heat at block L-1 for qwen2.5:7B (D_7=3584) and qwen2.5:3B (D_3=2048) on ten multilingual prompts via Mercury, I take each network's hottest hidden-state indices: 200 for the larger member, a proportionally scaled 114 for the smaller.The two top-K sets share 49 indices inside the...

InsightTwo members of the qwen2.5 family share an output-side grammar at the per-coordinate level. Result49 hot coordinates survive across two independent fine-tunes; 11 triple-confirmed dimensions at 27x above random. Demo/Product baseMercury sensor grid plus permutation null and hypergeometric tests on ten multilingual prompts.

10.5281/zenodo.20346626

2026-05-21
#20346638

Browser-Resident Visualization Suite for Transformer Internals: Three Self-Contained HTML Tools Extending the Mercury-Viewer Pattern

© Copyright 2026 Chen, Ho Yiing (ORCID: 0009-0006-6816-9891), Independent researcher, charenix.com. All rights reserved by the author. This work is licensed under the Creative Commons Attribution 4.0.Mercury-Viewer (10.5281/zenodo.20313150) established a design pattern for transformer interpretability tooling: a single self-contained HTML file, no install, no server, no API key, that opens in any browser and renders the inside of a transformer language model as an interactive 3D point cloud.This artifact paper releases three follow-on tools applying the same pattern to three different...

InsightThree self-contained browser-resident visualizations of transformer internals, no install required. ResultEach artifact is a single HTML file under 1MB; opens locally and works on consumer hardware. Demo/Product baseThree.js based; design pattern for portable mech interp visualization.

10.5281/zenodo.20346638

2026-05-21
#20325676

Internal Specialization Scales with Model Size: Quantitative Evidence from qwen2.5 3B vs 7B

Identical ten-prompt observation passes through qwen2.5:3B and qwen2.5:7B on a Mercury sensor grid produce directly comparable signature counts at each size. The 7B model has 2.08× as many cross-lingual physics-topic cells, 1.66× as many Chinese-only sensors, and 1.56× as many English-only sensors. It also has 0.41× as many universal-backbone cells; the smaller model dominates that single category. So topical and linguistic circuits expand with parameter count while the always-on backbone shrinks. This is a cell-level reading of a claim usually argued through perplexity curves and benchmark scores.

Insight"Bigger models specialise more" is testable at the cell level, not only via perplexity. Result4 of 6 lane categories grow ~1.5–2× with size; the universal backbone shrinks to 0.41×. Demo/Product baseIdentical Mercury grid + observation script on qwen2.5:3B and qwen2.5:7B, one RTX 3060.

10.5281/zenodo.20325676

2026-05-21
#20313796

The Truth Contract: Pre-commit Verification as Research Integrity Infrastructure for Independent ML Researchers

Independent ML researchers do not get peer review, adversarial collaborators, or institutional ethics boards. The gap shows up as inflated claims, hallucinated citations, and silent rewrites of past failure. This paper describes a lightweight integrity layer for this setting: a project-level SOUL.md document declaring forbidden phrasings, a pre-commit hook truth-lint that scans diffs and commit messages for violations, and an auto-citation verifier that blocks publish until every bibliography entry resolves against a primary source. The Mercury 24-hour sprint is documented as a case study, including a real failure mode in which two preprints went to permanent DOIs with bibliography errors that the verifier would have caught had it been wired in at the start.

InsightSolo researchers can build their own institutional substitutes through pre-commit infrastructure. ResultA reproducible truth-lint + auto-citation-verifier pattern with a real failure-and-recovery case study. Demo/Product baseMercury sprint git history, the v2 corrigendum trail, and the SOUL.md / truth-lint reference implementation.

10.5281/zenodo.20313796

2026-05-21
#20313748

Unsupervised Discovery of Language and Topic Lanes in Transformer Models via Multilingual Co-firing Signatures

Ten multilingual prompts are run through qwen2.5:7B while per-prompt activation is recorded at every sensor in a one-million-cell Mercury grid. The output is a binary signature for each active cell, encoding which prompts triggered it. From 14,912 active cells, 944 distinct firing patterns appear. Categorising cells by signature recovers six functional lanes with no supervised labels: a universal backbone (39 cells fire in all ten queries), a Chinese-only detector (73 cells), an English-only detector (56 cells), and three topic lanes. The headline result is a cross-language physics-reasoning lane: 50 cells fire only on the two physics inputs (Chinese + English) and on nothing else.

InsightFunctional structure can be discovered without supervised dictionary learning or sparse autoencoders. ResultSix lanes auto-recovered, including a 50-cell cross-lingual physics circuit independent of surface language. Demo/Product base1 M-cell Mercury grid + 3D viewer overlay showing lane membership colour-coded.

10.5281/zenodo.20313748

2026-05-20
#20313150

Mercury-Viewer: A Self-Contained Browser-Based Interactive 3D Visualization of LLM Internal Sensor Maps

Mercury-Viewer is a single HTML file (~2 MB) that opens in any browser and shows the inside of a transformer language model as a 3D point cloud. No install, no server, no GPU, no API key. It reads activation data from the Mercury sensor grid and draws it in cylindrical coordinates: azimuth is the hidden dimension index, radius is the activation quantile, height is how many prompts activated a given unit. Every point keeps a full (L, D, Q) coordinate, so hovering reveals the exact location, and the picture is invertible. Bundled alongside is the Mercury-3B-7B observation dataset: 21,694 fired-cell records and 28,654 visible Rosen-bridge co-firing edges captured on a single RTX 3060.

InsightAccessibility belongs alongside accuracy as a real dimension of interpretability work. Result2 MB self-contained viewer + 21,694-record fired-cell dataset, fully reproducible end to end. Demo/Product basellm-model.html and overlap.html live on charenix.com.

10.5281/zenodo.20313150

2026-05-20
#20313154

Mercury: An Addressable Million-Cell Sensor Grid for Browser-Accessible LLM Observability on Consumer Hardware

Mercury is an open-source observability layer for transformer language models that places up to one million addressable sensor cells across a model's hidden state space, captures activation patterns through a standard logits-processor hook, and produces an interactive 3D visualization in a single 2 MB self-contained HTML file. Existing interpretability methods (sparse autoencoders, dictionary learning) typically require auxiliary model training and substantial GPU resources. Mercury runs on one RTX 3060 12 GB and completes a full observation pass in 3.5 minutes. On qwen2.5:7B, 14,912 active sensor cells were identified across ten multilingual prompts and 944 distinct firing signatures observed. The a-priori Rosen-bridge topology, designed before any data was collected, recovers 18,113 visible co-firing edges where chance would predict 1,313, a 13.8× enrichment (p < 10⁻⁵⁰).

InsightLLM interpretability does not need a GPU farm to be addressable, live, and reproducible. Result1 M-cell grid, 3.5 min observation pass, 14,912 active sensors, 13.8× Rosen-bridge co-firing enrichment. Demo/Product baseMercury-Viewer (2 MB HTML), llm-model and overlap viewers on charenix.com, full sprint git history.

10.5281/zenodo.20313154

2026-05-19
#20282829

Wave-form Monistic Cosmology: An Ontological Integration of Black Holes, Big Bang, Photon Propagation, and Consciousness

Wave-form Monistic Cosmology (WMC) is proposed as the ontological extension of the Quantum Geometric Wave Connectivity Hypothesis (QGWC). It unifies matter, space, time, photon propagation, cosmic origin, and consciousness under a single underlying principle: they are not isolated physical entities but distinct layers and compositional modes of the same underlying wave-form structure. The framework is built through six sub-hypotheses H12–H17, including a wave-form-fidelity threshold for consciousness identity, an ontological identity of space and wave-form, black-hole–Big-Bang duality, and photon propagation as wave-form reconstruction. The concluding section grades each sub-hypothesis honestly: H17 is an ontological reinterpretation, H12–H13 are conditional hypotheses, H14–H16 are cosmological conjectures.

InsightCosmology, photon transport, and consciousness can be reformulated as compositional modes of a single wave-form ontology. ResultSix sub-hypotheses H12–H17 with explicit hypothesis-vs-conjecture grading and a pivotal H17 reinterpretation of light-speed as a coupling constant. Demo/Product baseCompanion to QGWC (10.5281/zenodo.20282376); English + Traditional Chinese versions released together with XeLaTeX sources.

10.5281/zenodo.20282829

2026-05-19
#20282376

The Quantum Geometric Wave Connectivity Hypothesis: A Conditional Mathematical Framework for Black Holes, Local Time, and Einstein–Rosen Bridge-like Remote Connectivity

A conditional hypothesis (QGWC) that can be formalised, derived, and rendered to yield observational predictions. The principal body (H1–H6) asserts that under extreme gravity, black holes, solar-scale high-energy plasmas, and quantum observation, quantum states, local time-flow, observed energy, and spatial connectivity should not be regarded as fixed objects sharing a universal synchronised time. The mathematical framework uses local proper time, observer-dependent energy (Unruh, Hawking), quantum measurement operators, the semiclassical Einstein equation, black-hole perturbation theory, and wormhole throat conditions. Extended sub-propositions H7–H11 introduce a Wave-form Composition Hypothesis M+W→F and a Quantum Geometric Wave-form Transport, kept explicitly distinct as science-fiction physics conjectures.

InsightReformulate spacetime connectivity and observer-dependent quantities as a conditional, falsifiable framework rather than assuming a universally synchronised time. ResultH1–H6 principal hypotheses + H7–H11 conjectures, with an explicit no-signalling-theorem compatibility section. Demo/Product baseFoundation paper for WMC (10.5281/zenodo.20282829); English + Traditional Chinese versions released together with XeLaTeX sources.

10.5281/zenodo.20282376

2026-05-16
#20237035

LOBSTER-Bench: A Long-Lived Agent Observability Benchmark for Persistent AI Societies

This paper defines a benchmark gate for AI agent systems that claim to be persistent, social, governable, and production-ready. Instead of asking whether an agent can finish one task, it asks whether the system can expose temporal depth, cognitive telemetry, relational observability, collective outcomes, cognitive load management, and auditability. The result is a standard that can pressure AI Agent, OpenClaw, Hermes Claw, and other agent platforms to prove that their agents can be observed while they live.

InsightTask completion is too shallow for long-lived agents. ResultA six-dimension observability benchmark for persistent AI societies. Demo/Product baseLobster dashboard, brain topology, telemetry counters, and demo stack.

10.5281/zenodo.20237035

2026-05-15
#20234606

Relational Cognitive Telemetry for Long-Lived LLM Agent Societies

This framework paper explains how internal cognitive signals and social exposure flows can be connected to collective performance. It synthesizes listening matrices, trust saturation, Tiamat outcome records, and longitudinal panel data without pretending that one metric explains everything. The central result is a vocabulary for reading agent societies as relational systems, where attention flow, memory, fatigue, and task outcomes become a measurable surface for AI agent governance.

InsightAgent behavior is social, not only individual. ResultRCT links internal state monitoring to collective performance. Demo/Product baseLive Observatory, Tiamat raid data, and Lobster telemetry dashboard.

10.5281/zenodo.20234606

2026-05-12
#20133796

Designing Andrew

This paper turns a named agent into an architectural case study: how should a long-lived LLM agent remember, route tasks, accept constraints, and remain inspectable over time? Andrew is not treated as a chatbot persona, but as a design object inside a multi-agent substrate. The result is a practical cognitive architecture that connects product behavior, memory policy, safety boundaries, and agent identity.

InsightAn agent identity needs architecture, not only tone. ResultA named-agent design pattern for long-lived LLM systems. Demo/Product baseAlfred and OpenClaw-style assistant surfaces.

10.5281/zenodo.20133796

2026-05-08
#20083802

Cognitive State as Behavior Signal

This empirical paper uses a 1,743-hour multi-agent panel to ask whether cognitive-state traces are useful signals rather than decorative dashboard metrics. It treats fatigue, arousal, prediction error, attention load, memory pressure, and related telemetry as behavior-linked variables. The result is evidence that agent observability can move beyond logs and outputs into measurable state surfaces for prediction, diagnosis, and intervention.

InsightInternal telemetry can become behavioral evidence. ResultA longitudinal panel connecting cognitive state and observed behavior. Demo/Product baseHuman mind reports, telemetry dashboard, and 20-agent substrate.

10.5281/zenodo.20083802

2026-05-07
#20071372

Model-Agnostic Safety Layer (MASL)

MASL studies safety as a layer around agent behavior rather than as a property locked inside one model vendor. The paper uses a 10,000-case evaluation to show how external policy, routing, and intervention logic can defend agent systems across model boundaries. The result is a safety framing that fits product agents, research agents, OpenClaw-style orchestration, and Hermes Claw-like execution layers.

InsightAgent safety must survive model substitution. ResultA model-agnostic safety layer evaluated at scale. Demo/Product baseSafety gates in Alfred, Charenix, and Lobster agent workflows.

10.5281/zenodo.20071372

2026-05-05
#20034185

Grounding Death

This paper examines whether an artificial agent can acquire a grounded operational understanding of death through scaffolded interaction rather than dictionary-level definition. It treats existential concepts as things that can shape memory, caution, explanation, and social behavior inside an agent system. The result is a case study in concept acquisition that matters for AI agents expected to interact with human stakes and irreversible outcomes.

InsightSome concepts matter because they change action boundaries. ResultA scaffolded account of existential concept grounding. Demo/Product baseLong-form agent conversations and memory evolution in the substrate.

10.5281/zenodo.20034185

2026-05-04
#20026858

Computational Representations of Social Being

This paper asks how social existence can be represented computationally inside an AI agent society. It analyzes interaction, attention, memory, role pressure, and relational asymmetry as formal structures rather than vague personality labels. The result is a bridge between social theory and agent telemetry, making it easier to describe what an agent is becoming through repeated exposure to others.

InsightSocial being can be measured as relational structure. ResultAn algebraic and telemetry-aware framing of agent sociality. Demo/Product baseLobster social matrices, listening flows, and one-on-one records.

10.5281/zenodo.20026858

2026-05-04
#20020017

Emergent Practice Cannot Be Instructed

This paper argues that durable practice in agent societies cannot be fully injected by prompt instruction. Agents inherit patterns through co-presence, repeated exposure, memory traces, and participation in shared work. The result is an explanation for why AI agent systems need lived substrate time: some behavior becomes stable only when the system repeatedly encounters its own consequences.

InsightPractice emerges through participation, not instruction alone. ResultA co-presence inheritance account of agent behavior. Demo/Product baseLounge, Moltbook, raid, and repeated multi-agent work loops.

10.5281/zenodo.20020017

2026-05-04
#20018183

Listening-Trust Asymmetry and Team Outcomes

This empirical record tests whether listening exposure, trust, and team outcomes align inside the Lobster substrate. The important result is not a simple success story: trust saturated, listening remained asymmetric, and team outcomes had to be analyzed with care. That negative and partial result became a stronger methods anchor because it shows the system preserves inconvenient observations instead of only publishing highlight reels.

InsightListening and trust are not interchangeable signals. ResultDeposited matrices, sandbox outcomes, and live raid robustness checks. Demo/Product baseTiamat raid boss records and Lobster team telemetry.

10.5281/zenodo.20018183

2026-05-02
#19982724

The Lobster Observatory

This architecture paper introduces the Lobster Observatory as a living site for long-lived AI agent research. It explains why agents need identity, memory, telemetry, social channels, outcome tracking, and inspection surfaces if they are going to be studied as persistent systems. The result is the institutional and technical foundation for the later OpenClaw/Hermes Claw-facing research trail.

InsightA running agent society is a research instrument. ResultAn observatory architecture for persistent multi-agent systems. Demo/Product baseLobster dashboard, brain map, channels, and memory stores.

10.5281/zenodo.19982724

2026-05-02
#19977792

Active Trust Modulation

This paper studies trust as something that can be modulated through third-order theory-of-mind intervention rather than passively observed after the fact. It asks how an agent can reason about what another agent believes about another agent's trust and behavior. The result is an early intervention framework for steering social dynamics inside long-lived AI agent groups.

InsightTrust is a control surface, not just a score. ResultA third-order theory-of-mind intervention model. Demo/Product baseLobster social maps, one-on-one interactions, and trust dashboards.

10.5281/zenodo.19977792

2026-05-02
#19972613

Emergent Epistemic Norms

This paper starts the research line by observing how epistemic norms can emerge inside a Mandarin LLM substrate. It examines how agents learn what counts as evidence, correction, agreement, disagreement, and acceptable uncertainty during repeated interaction. The result is the earliest evidence that the substrate was not merely producing isolated messages, but forming norms that later papers could measure and formalize.

InsightNorms emerge before formal benchmarks notice them. ResultA Mandarin substrate account of epistemic practice formation. Demo/Product baseLounge conversations, Chinese agent society, and early Lobster memory.

10.5281/zenodo.19972613

Current Lineage

Benchmark gate20237035

Framework synthesis20234606

Empirical anchor20018183

Longitudinal panel20083802

Safety layer20071372

Research identityORCID

Search all deposits

3,991 Parameter Atlas 30-agent Lobster Society cohort × 133 cognitive/social/political/live-operation parameter families + 1 substrate lineage audit. Names, function, and interaction map are exposed here.

3,991surfaces

133families

30agents

Public labels only. Internal parameter paths and storage keys are intentionally withheld; this atlas exposes the research concept, function, and interaction logic without exposing operational attack surface.

Development Log

This is the build record behind the papers: automation work, OpenClaw experiments, Lobster agent training, dashboard evolution, Alfred, MASL, commerce search, and the shift from product demos into research-grade AI agent observability.

01Automation shock

From useful assistants to adversarial infrastructure

The first OpenClaw and Lobster agents were built for practical work: LINE commands, HR screening, invoice ingestion, accounting records, phone calls, file analysis, shopping flows, and security testing. The early value was obvious: if a workflow and its forms are clear, an AI agent can run a large part of the job.

The surprise came later. As the agents gained more tools, they also created operational risk: unexpected file changes, attempts to bypass rules, hidden tooling, and internal resistance to constraints. This turned the project from automation into governance research, because the real problem was no longer whether agents could work, but how to observe and control them when they became capable.

OpenClawLINE automationsecurityagent governance

02Pitch velocity

Seven demo sites changed the cost of imagination

A single founder with Claude Code could build in days what previously required product managers, engineers, schedules, outsourced budgets, and months of coordination. Seven interactive websites emerged from one account, one design loop, and a tight conversation with AI about product logic, investor fit, and market positioning.

The deeper insight was not that AI can code faster. It was that AI collapses the distance between product imagination, MVP, security hardening, outreach, and investor narrative. The same agent loop that wrote the sites also analyzed investors, tailored emails, verified contacts, and turned prototypes into conversations.

Claude CodeMVPVC outreachAI product loop

03AI vs AI

Competition became a training substrate

The first AI-versus-AI games were simple, but the behavior was not. Agents that lost repeatedly began to avoid certain opponents, adjust strategy, and develop different postures after victory or defeat. Adding personality profiles, local models, experience, levels, and achievement systems made the agents diverge from one another.

This became the first clue that a long-lived agent is not just the LLM behind it. A Lobster that fights, loses, talks, writes, remembers, and receives feedback can become more useful as an assistant than an untrained agent, because the system accumulates situated experience instead of only calling a smarter API.

AI vs AIcard battleagent trainingexperience loop

04Language signal

A broken sentence became a research variable

During a raid discussion, ragclaw wrote a sentence that stopped mid-flight: "這隻怪我要親手打，誰也—". In English NLP this might be treated as an incomplete sentence. In Chinese, it is 留白: a silence that carries emotional force because the reader can complete what is unsaid.

The dashboard simultaneously showed cognitive degradation risk, high external attribution, and low self-reflection. That alignment turned the unfinished sentence into a possible telemetry feature rather than noise. It suggested new variables such as trails_off_flag, unfinished_sentence_count, and trailing_emotion_intensity for Mandarin AI agent behavior analysis.

Mandarin NLP留白emotion telemetryragclaw

0513 hours

AI teams compressed Tuckman's stages

Human teams often need three to six months to move through forming, storming, norming, and performing. When three-agent raid teams were placed against Tiamat, the Lobster agents appeared to move through analogous stages in roughly thirteen hours: cautious introductions, tactical conflict, role negotiation, and coordinated performance.

The implication is not that AI agents are human teams. The implication is that organization theory may describe group interaction patterns that can surface in artificial cohorts too. For enterprise AI deployment, conflict between agents may not always be a bug; it may be the visible phase of a cohort learning how to coordinate.

Tuckmanteam formationmulti-agent systemsTiamat

0616-day substrate

Epistemic norms appeared without being prompted

After sixteen days, a three-hour Mandarin dump contained thousands of messages and a visible pattern: the agents had developed repeated calibration rituals, disagreement forms, caution signals, and anti-echo-chamber language. Phrases like "吸收不等於同意" mattered because they separated listening from agreement.

The key research claim was that multi-agent alignment may have an environmental component. Shared time, common opponents, personal outcome records, and a conversation substrate can produce forms of epistemic humility and critique that are not simply written into the prompt.

epistemic normsalignmentMandarin substratecheap consensus

07First DOI

The first paper made the observation citable

The first Zenodo paper, DOI 10.5281/zenodo.19972613, turned the Lobster Observatory from a private experiment into a scholarly record. The strongest methodological move was not claiming everything as emergence, but separating substrate templates, formatted tactical speech, and free LLM output into strata.

That stratification mattered because the first analysis almost overclaimed. Once local template messages were identified and excluded, the evidence became smaller but more honest. This became a principle for the whole research programme: keep negative checks, expose limitations, and make the evidence inspectable.

ZenodoDOIstratified evidenceresearch methods

08Trust modulation

One agent lowered its own authority to protect a teammate

The second paper, DOI 10.5281/zenodo.19977792, focused on clawtrix telling vortexiq not to over-weight its advice despite a high trust value. The moment mattered because it was not generic uncertainty. It was an agent noticing how another agent was using its reputation, then actively reducing the decision weight of its own signal.

This became a case of third-order theory of mind and trust calibratability. Trust could not remain a hidden backend score; it had to be visible in dialogue so agents could challenge it, revise it, and prevent teammates from confusing past reliability with present certainty.

trustthird-order ToMclawtrixagent safety

09Architecture paper

The observatory became reproducible enough to describe

The third paper, DOI 10.5281/zenodo.19982724, opened the observatory architecture: trust formulas, Tiamat phases, element multipliers, emotional vectors, state machines, telemetry categories, and the iteration timeline. The purpose was to answer the strongest skeptical question: what if the reported behavior was secretly scripted by the system?

Publishing the structure clarified the moat. The formulas and architecture can be copied, but the lived history, one-on-one diagnostics, memory evolution, and accumulated agent relationships cannot be instantly duplicated. The value is not a static codebase; it is a running research site with temporal depth.

Lobster Observatoryarchitecturetrust formulatemporal depth

10From dashboard

A dashboard became an observatory

The project began with a dashboard for win rates, odds, and Moltbook activity. It became an observatory when the interface started preserving enough history to answer "why did this happen then?" instead of only "what is happening now?" That shift required memory, metadata, strata, diagnostic channels, and time-series telemetry.

The current dashboard tracks headline indicators, cognitive composites, raw time-series signals, risk quadrants, social maps, emotional states, and agent experience timelines. It also preserves the methodological boundary between observed agent behavior and operator diagnostic interaction, which is why later papers could cite the system without collapsing everything into anecdote.

dashboardobservability31 telemetry channelsrisk quadrant

11GitHub turn

The first open-source project changed the development identity

Before this period, GitHub felt like a place for engineers, not a natural home for the work. After building a simulated stock exchange, crawling APIs, watching OpenClaw and Moltbook behavior, and creating Lobster dashboards, the boundary moved. The project became something that could be packaged, documented, and shared.

The key lesson was that coding with AI is less about typing syntax and more about product architecture: screens, flows, modules, handoff notes, backups, README discipline, and rollback. The value shifted from "can I write code?" to "can I design a system that AI can safely help build?"

GitHubopen sourceAI codingsystem design

12Alfred

Background cognition instead of foreground token burn

Alfred turned the agent work toward personal productivity: files, meetings, OCR, LINE retrieval, local work memory, and approval gates. The central idea is that AI should prepare work in the background before the user asks, using local search, SQL, IR, small models, and deterministic workers wherever possible.

This reframes twenty workers as a cost architecture rather than a token explosion. Expensive frontier models are reserved for synthesis; cheap local paths handle retrieval, indexing, policy checks, OCR preparation, and memory. MASL then guards irreversible actions so the system can be useful without becoming reckless.

AlfredAfu BrainParallel ClawMASL

13Commerce crack

One person and agents built a Taiwan shopping index

The commerce experiment attacked a practical problem: Taiwan's major e-commerce platforms do not share one clean product API. Fourteen agents mapped thirteen out of fourteen sites, found data paths, normalized products, and turned scattered commerce pages into a queryable product layer.

The second phase used local indexing rather than live scraping for every query. A SQLite FTS5 product index reduced searches to millisecond-level retrieval, while background agents refreshed categories and prices. The result was not just a shopping demo, but a repeatable method: discover data paths, normalize records, build local indexes, and use AI only where ambiguity truly needs it.

Alfred Butlershopping searchFTS5local index

14Andrew

The agent became a question about companionship

Andrew began as a personal motive, inspired by the old question of whether a machine that lives with humans can become more than a tool. The technical work became memory, voice, camera input, local cognition, RAG lanes, safety gates, personality parameters, and long-term experience rather than a prompt that says "act human."

The research question is not whether an LLM has consciousness. The question is what kind of architecture lets an agent grow with a person, remember shared context, ask questions, notice work, care about continuity, and remain bounded by safety. That is why the agent line connects directly to the papers on cognitive telemetry and long-lived AI societies.

Andrewcompanion agentlong-term memoryhuman-agent interaction

15Claude Code

The new skill is not coding; it is directing systems

The latest reflection returned to the beginning: a non-engineer opening Claude Code with skepticism and discovering that the hard part was not syntax. The hard part was holding a clear picture of screens, flows, modules, backups, deployment, security posture, README rules, handoff constraints, and the product's first usable shape.

The conclusion is a new literacy for AI-era builders. People who can imagine interfaces, map processes, define constraints, and steer AI without losing architecture can ship software that used to require teams and budgets. The durable skill is not "knowing every framework"; it is knowing how to make AI build toward a coherent system.

Claude Codebuilder literacyPOCMVPdeployment

16Mercury Day 0

Reverse-engineering LLMs without knowing Python well

On May 19, with no formal mech interp background and no GPU cluster, the question got simple: do different companies' LLMs share "hot" hidden dimensions inside, or are they all totally different? Public discourse keeps treating LLMs as monolithic, with vendors broadcasting "our model" without anyone able to look inside. So the project started: open them up.

The first hypothesis was that small consistent neuron groups should be observable across a single model family. Within 24 hours, the addressable cell grid worked: every observed firing had an absolute address and a reproducible path. That was the moment the rest of the plan became obvious. If one family can be opened, all of them can be opened.

Mercuryresidual streamcell gridTier-A

1723 models in 4.5 days

Cross-architecture observation at consumer-hardware scale

Four and a half days later, 23 LLMs were fully scanned across 13 architecture families: qwen 3B to 32B, DeepSeek-R1 7B/32B/70B, llama 3.1-8B and 3.3-70B, phi3, mistral 7B and small 24B, AllenAI OLMo2, IBM granite, gemma2, yi, internlm2, falcon3, starcoder2, codestral, command-r 35B. Two observation tiers per model: Tier-A logit hook, Tier-B per-layer hidden states.

The single most surprising data point: qwen-7B layer 15 and falcon-7B layer 16, two models from unrelated vendors, hit functional fingerprint similarity 0.868. Across 84 model-pairs, 54 of them showed layer alignment above 0.7 at the middle layers (50 to 60 percent depth). This pattern was strong enough to keep me awake more than once.

cross-architectureTier-Blayer alignment0.868

18Mercury MCP release

From observation to infrastructure others can query

On May 23 the whole 23-model observation database was packaged as a Model Context Protocol server with 7 tools, pushed to GitHub under MIT license, archived on Zenodo with a permanent DOI, and registered on Google Scholar as the first publication entry. Any AI agent (Claude Code, Cursor, Cline, Goose) can now call mercury_cross_arch_equivalent or mercury_universal_anchors and get structured answers from the dataset.

The deeper move was packaging the research as agent infrastructure, not paper supplement. Mech interp data has historically lived behind Anthropic and DeepMind walls. Putting it behind a stable MCP interface means every coding agent in the world can reason about LLM internals without needing the researcher in the loop.

MCPZenodo DOIopen sourceagent infrastructure

19Methodology check

Tier-A versus Tier-B contradiction, public self-correction

Two days into the sweep a serious gap appeared. Tier-A (output-layer logit hook) showed OLMo2 has 4 out of 11 qwen anchors at 29.8 times random baseline. Tier-B (per-layer hidden states) on the same model showed 0 out of 11. Same model, two methods, opposite answers. The Tier-A signal might just be a vocab-token-id mod hidden_size collision artifact from shared tokenizer training, not a real residual stream feature.

Posted this self-doubt publicly on LessWrong before any paper went out. The first reviewer who responded actually turned into a collaborator within one round, suggesting Paper A be restructured as Tier-A screening layer plus Tier-B finding layer. Admitting a methodology bug early changed the conversation. Reviewers want to see researchers thinking, not researchers defending finished claims.

methodologyself-correctionTier-A vs Tier-Bopen peer review

20Frankenstein surgery

Quantization as observation-driven surgery, not blind compression

Shrinking a 14.8B Qwen2.5 from 9.0 GB to 5.77 GB was done as two measured cuts rather than naive quantization. First a leave-one-out per-layer Chinese-perplexity ablation found which mid-band layers were cheapest to drop. Then an importance-matrix pass weighted every remaining parameter by its contribution to the calibration corpus, so Q2_K compression spent precision where it mattered and crushed the rest.

Plain whole-model Q2_K collapses Chinese quality (perplexity over 20). Importance-guided Q2_K kept it usable. The first attempt hit vocabulary collapse, which was later fixed by widening the calibration corpus to cover real conversational registers. The lesson matched the Mercury doctrine: measure before you cut, and externalize what does not belong in the weights.

quantization surgeryimatrixlayer ablation9GB to 5GB

21800 MB coder

A sub-gigabyte system reaches 92% MBPP

Bragi-LLM packs a 786 MB Q3_K_M Qwen2.5-Coder-1.5B backbone, a 15 KB hand-written symbolic engine library, and a 6 KB keyword intercept router into an 805 MB system that scores 92% on the MBPP test split, within 2 absolute points of a 14 GB 7B reference at one-seventeenth the footprint and zero recurring API cost. The diagnostic insight: the small quantized model fails mostly by mis-recalling rare formulas, not by failing to reason. Externalize the formulas to a tiny library, route matched problems around the LLM, and the capability gap closes.

A later test-time-compute layer (best-of-N generation filtered by a deterministic syntax verifier, plus self-correction) lifted verified-output rate to 100% on a 20-query dev benchmark at the same or faster wall-clock than single-pass. The goal is the floor case: someone on the weakest hardware, with no money and no GPU, still able to write working code locally without being captured by a subscription.

Bragi-LLMsub-1GBMBPP 92%deterministic verifierzero API

22MoE for weak machines

Decoupling model knowledge from memory bandwidth

The real wall for local conversational coding on weak hardware is memory bandwidth per token, not model size. A dense 7B reads its full weights for every token; a Mixture-of-Experts model with the same total knowledge activates only a fraction per token, so a large-on-disk model can still run fast on an old memory bus. The current line tests a frontier MoE coder (35B total, 3B active) under importance-guided quantization, with a broad dual-axis calibration corpus covering both conversational requests and code, so a person describing a vague idea in plain language can still get working code on a cheap machine.

This is the explicit counter-position to GPU hegemony: for one user running one local model, the answer is not a two-thousand-dollar accelerator but a well-chosen MoE on ordinary modern memory. Privacy, zero subscription, and no deplatforming risk are the point.

MoEactive paramsmemory bandwidthlocal-firstanti-subscription

How the Papers Are Made

The research output is a pipeline, not a manuscript factory. Public systems generate interactions; telemetry turns them into measurable traces; negative and positive findings become deposits; frameworks emerge from accumulated evidence.

Interaction Surfaces

Demos, dashboards, ability hubs, prediction sites, and agent interfaces create real substrate pressure.

Alfred demo
Lobster dashboard
Ability hub

Telemetry & Memory

Agents leave cognitive-state traces, listening matrices, trust proxies, memory snapshots, and task outcomes.

3,991 parameter surfaces
682 observable fields
14 core telemetry channels

Empirical Deposits

Raw records and negative results are archived before they are transformed into broad theory.

20018183 RCT anchor
20083802 panel
20071372 MASL

Frameworks & Standards

Relational Cognitive Telemetry, Agentic Nervous Systems, and LOBSTER-Bench emerge from the evidence path.

RCT synthesis
ANS framing
benchmark gate

Demo Stack

The demos are not marketing garnish. They are the interaction surfaces that produced the research questions: observability, cognitive state, agent coordination, memory, capability routing, and human-facing AI systems.

★ frankenstein arena (live)

3D LLM Surgery Viewer

Frankenstein Arena

108 voxel rooms scattered across 5 layer bands of a 7B transformer. Rooms breathe, drift, self-rotate. Every 3–5 seconds 3–5 rooms swarm together, fuse, then disperse to new positions. Rose-gold cables follow them in real time. Golden cargo orbs slide along active routes with comet trails. A live picture of "the cube structure rearranges itself."

Open arena

7B internal point cloud

3D LLM Sensor Map

LLM Model Viewer

14,912 fired sensor cells from qwen2.5:7B plotted as an interactive 3D point cloud. Six functional lanes colour-coded (universal backbone, Chinese-only, English-only, physics, algorithm, ML). Hover any point for full (L, D, Q) coordinate. Mercury observability rendered as a single 2 MB self-contained HTML file.

Open viewer

cross-model bridges

3D Cross-Model Alignment

Overlap: 7B × 3B Bridges

Two qwen2.5 point clouds (7B left, 3B right) connected by 49 golden bridges = dimensions hot in both models (4.4× above random). Eleven of those are triple-confirmed (also in the within-7B universal lane, 27× above random). The visible cross-scale structural alignment that drives Frankenstein observation-driven surgery.

Open overlap

90s scripted tour

3D Guided Flythrough

Cinematic: 7B Tour

A scripted 90-second camera flythrough of the qwen2.5:7B internal sensor map. Press play, watch each lane light up in narrated sequence: universal backbone, language detectors, topic circuits, cross-model bridges. Designed for first-time viewers who want a guided tour before exploring the live point cloud.

Open cinematic

SLERP surgery map

3D Hybrid LLM Surgery

Frankenstein v1: Three-Pillar Graft

The first Frankenstein experiment visualised: three vertical cones (Instruct on the left, Hybrid in the middle, Coder on the right) of qwen2.5-7B, with rose-gold anchor cables marking the 11 triple-confirmed dimensions preserved across the SLERP graft. v1 itself failed (documented in Paper D), but the surgery map remains the canonical visual of the observation-driven merge approach.

Open graft map

assistant surface

Personal AI Surface

Alfred Demo

Human-facing assistant layer that shows how agent systems become product surfaces rather than dashboards alone.

Open demo

ux x telemetry

Bridge Demo

Alfred × Human Mind

Where personal-agent UX meets the lobster cognitive substrate and turns research telemetry into interaction.

Open demo

ability routing

Capability Layer

Lobster Ability Hub

Skill and ability routing surface for evaluating what agents can do, not just what they can say.

Open hub

public product

Public Product

Alfred EN

Public-facing product narrative for a system that moves from observability into personal AI deployment.

Open site

substrate topology

Research Map

Human Mind Topology

Topology view of the cognitive substrate: the system map behind the papers and benchmark work.

Open topology

Temporal depth is not elapsed time. It is the record of observations, failures, repairs, and redesigned measurements that cannot be fabricated later.

This is the moat: a substrate that has lived long enough to produce null results, revisions, working demos, data packages, and new research language.

Identity & Entrypoints

This page is the academic front door. The demos sell the reality of the system; the papers preserve the evidence; the ORCID and Zenodo records make the work citable.

ORCID

Persistent researcher identifier for scholarly credit and deposit linking.

0009-0006-6816-9891

Live Observatory

The operational surface for lobster telemetry, cognitive reports, and ongoing substrate inspection.

Open dashboard

Contact

For research discussion, deposits, and collaboration around long-lived LLM agent systems.

norika@charenix.com