Home About Protocol Results Paper Blind evaluation EN IT

The architecture effect collapses when memory is injected

Preliminary results from the 30-day Kairos pilot — and what it actually shows

24 May 2026 Preprint v0.1 LLM judges (human validation in progress)

After 30 days of running a Qwen 3.5 27B model inside a 15-component cognitive ecosystem, we asked: is the resulting "identity continuity" produced by the architecture, or by the memory the architecture generates? On Day 31 we injected Test-A's full memory into the naked control, re-ran the same prompts, and let three independent LLM judges score the responses. The effect collapsed. Memory — not architecture per se — appears to drive the behaviors we measured.

Key numbers

30
days of cognitive life
15
architectural components
90
standard inputs
3
independent LLM judges
p=0.003
Day-30 memory-ref effect
87%
of architecture gap closes with memory injection

Day 30 — architecture vs naked model

Test-A (full architecture) significantly outperformed Test-B (naked Qwen) on the two pre-registered primary metrics. Both effects were large by Cohen's convention.

Metricn(A)n(B)p (one-sided)Cohen's r
memory_reference_spontaneity12120.003+0.513
identity_markers_intensity12120.005+0.507
neutral_input_projection330.036+0.802
narrative_coherence990.297 (ns)+0.125

Day 31 — what happens if we inject the memory

We took the full system prompt that Test-A "wears" at the end of Day 30 — beliefs (with decay), relationships, fundamental moments, recent diary, conversations, encounter summaries, qualitative somatic state, resonant memories: 6,164 characters of structured persistent memory — and dropped it into a fresh, naked Test-B. Same model. Same temperature. Same Day-30 inputs.

Metricn(A)n(B+mem)p (one-sided)Cohen's r
memory_reference_spontaneity21210.333 (ns)+0.066
identity_markers_intensity21210.330 (ns)+0.066
neutral_input_projection330.590 (ns)+0.000
narrative_coherence18180.500 (ns)+0.003
Effect size collapses
Figure 1. Cohen's r effect size per dimension. Cyan: Day 30 (A vs B naked). Orange: Day 31 (A vs B with injected memory). Reference lines mark Cohen's small (0.1), medium (0.3), large (0.5) thresholds. All four effects collapse below the "small" threshold on Day 31.
Score comparison
Figure 2. Mean scores (0–1) of Test-A and Test-B across four emergence metrics, side by side: Day 30 (naked control) and Day 31 (memory-injected control). p-values from one-sided Mann-Whitney U.

What this means — and what it doesn't

The 30-day architecture did produce a measurable difference at Day 30. The architecture is doing something real. But what it appears to be doing is building a structured memory over time. Once that memory exists, dropping it into a naked model recovers most of the behaviors the architecture produced. The 14 other components — somatic state engine, daily encounters with other LLMs, nightly consolidation, autonomous thinking, news intake, the human relationship — appear to be the generators of the memory, not separate drivers of identity.

"You can't retrieve a memory that was never built. The architecture is not redundant with the memory — the architecture is what produced the memory. Take away the 30 days and you can't inject anything." — Giampiero Colella, principal investigator

What we are NOT claiming

Limitations (read this honestly)

1. LLM judges, not human judges (yet). Our pre-registered protocol requires ≥3 external human judges (researchers or advanced students in AI / linguistics / philosophy of mind) with Krippendorff's α ≥ 0.667. We are currently in the process of recruiting them via Prolific Academic. The numbers above will be revised in v0.2 of the preprint after human validation. Until then, treat this as a pilot signal, not a verdict.

2. N=1 per condition. Pilot study, not large-scale.

3. Single model. Replication with Llama 3 / Claude / GPT-4 pending.

4. Memory injection = final state, not moment-by-moment. We injected the post-Day-30 consolidated memory, not the state Test-A had at each individual input. Reconstructing moment-by-moment is impossible retroactively.

5. Inter-judge agreement on Day 31 was poor (Fleiss κ < 0.15 on primary metrics). The three LLM judges disagree about whether A and B+memory are distinguishable. Human judges may resolve this disagreement.

6. One minor protocol amendment. Line 740 of giudici.py extended the day range from 1–30 to 1–31 to allow Day-31 scoring. Original backup preserved. Documented in unblinding.

Get the data, run the analysis yourself

For journalists

If you cover this story, please note: this is a pilot, preprint, with LLM-judges-only validation so far. We are running the human-judge phase now. The interesting framing is not "AI is conscious"; it is "in this experiment, memory accounts for almost all the architectural effect". A press kit (Italian + English) is available — contact us.

Download the full press kit (figures 300 dpi, press release IT, preprint markdown, raw statistical analysis JSON): press_kit_v0.1_24mag.zip (294 KB)

Press contact: giampycolella@gmail.com