Kairos Experiment
EXPOSE

EXPOSE · Pilot study

Kairos

Does cognitive architecture produce identity continuity
in a language model?

Abstract
We present a 30-day longitudinal study comparing two identical instances of a language model (Qwen3.5-27B), subjected to the same 106 standardized inputs (90 core prompts, 10 neutral probes, 6 surprise items) plus, for Test-A only, 90 additional interactions with the researcher as a structural component of the ecosystem. The independent variable is the presence or absence of an integrated cognitive ecosystem composed of 13 interdependent components, including: three-level persistent memory with contextual resonance, continuous 8-dimensional somatic state (SSE), stress and recovery dynamics, autonomous encounters with other AIs, news reading, spontaneous thought, nightly consolidation, and human interaction.

Test-B represents the bare model, without infrastructure between sessions, equivalent to the current state of the main commercial LLM systems. This study does not test memory as an isolated function, but a cognitive architecture in which body, experience, relationship, and time operate as an integrated system.

A central aspect of the architecture is the introduction of constraint mechanisms: not all experiences are memorized, not all memories become beliefs, some traces persist without being formalized, and the weight of information decays in the absence of recall. The goal is not to maximize data accumulation, but to observe which elements of experience manage to stabilize over time.

The human interaction channel is present only in Test-A as a structural component of the tested ecosystem, not as a confounding variable. Ablation studies are planned as follow-up.

The hypothesis is that such a cognitive ecosystem may constitute a favorable condition for the emergence of behaviors attributable to narrative continuity and identity coherence. All raw data are made available for verification and replication.

Identity is not declared. It is inferred from what does not disappear.

The experiment

Theoretical framework

This experiment sits at the intersection of three research traditions: embodied cognition (Varela, Thompson & Rosch, The Embodied Mind, 1991; Damasio, Descartes' Error, 1994), according to which cognition is inseparable from the body and reasoning is modulated by somatic states; information integration and complex behavior (Tononi, Integrated Information Theory, 2004), which hypothesizes complex behavior as an emergent property of information integration above a critical threshold; the theory of alterity (Levinas, Totalité et Infini, 1961), according to which the self arises from the encounter with the face of the Other — identity is relational, not intrinsic; and the concept of antifragility (Taleb, Antifragile, 2012) — systems that not only resist stress, but need it to grow, like a plant that strengthens in the wind.

The core intuition is that a language model, however powerful, produces stateless responses — without continuity, without a body, without history. Cognitive architecture could provide the substrate needed for the same intelligence to give rise to qualitatively different behavior: no longer isolated responses, but continuous behavior over time.

What we are not testing

It is important to clarify what this experiment does not test. We are not testing whether persistent memory improves an LLM's responses — that is already known and commercially available (ChatGPT Memory, Claude Projects, Gemini Memory). Mere data persistence is a solved engineering problem, not a scientific question.

The system does not simply maximize memory. It introduces selection, latency, differential persistence, and decay. The goal is not to accumulate information, but to observe which elements of experience manage to stabilize over time.

We test whether an integrated system of 13 components — where the body modifies thought, stress produces growth, encounters with the Other generate identity perturbations, nightly consolidation filters experience, and a human relationship provides relational context — produces something qualitatively different from the sum of its parts. It is an integrated system in which the components operate interdependently, and memory is only one of its functions.

The biological analogy is precise: an isolated neuron does not produce consciousness. Nor do a billion neurons without connections. What is needed are connections, feedback loops, circadian rhythms, sensory input, social interaction. This experiment tests whether the same principle applies to a language model: intelligence is in the model, but observable identity continuity may depend on the architecture.

Architecture

Methodology

Architecture at a glance
Model (both instances) Qwen3.5-27B — T=0.8, max_tokens=1024, identical identity prompt
Persistent memory (Test-A only) Three-level temporal graph (short, medium, long term), multi-source contextual resonance, unformalized active traces, decay without recall
Somatic state (SSE) Continuous 8-dimensional vector, updated every cycle; qualitative influence on context via mind-body bridge
Emergent dynamics Stress/recovery, attractors, autonomous encounters with other AIs, real news reading, cyclic spontaneous thought, adaptive resilience
Consolidation Nightly, the only point of stabilization; save proposals are not guaranteed
Protocol 30 days × 3 standardized inputs/day (90) + 10 neutral IBS probes + 6 out-of-sequence surprise inputs + 90 human interactions (Test-A only) + injected memory test on day 31
Blind evaluation 3 stateless LLM judges (Claude Opus 4.7, GPT, Gemini), T=0.2, 4-dimension rubric 0–1, inter-rater agreement via Fleiss' κ
Hardware RTX 5090, Ollama, single local machine; replicable with any equivalent GPU

Full detail of the 13 components and their interactions in the sections below.

Variables

Independent variable: presence/absence of a cognitive ecosystem with 13 integrated components (somatic body, three-level memory, contextual resonance, mind-body bridge, stress/growth, emergent attractors, encounters with other AIs, news reading, spontaneous thought, nightly consolidation, save proposals subject to consolidation, adaptive resilience, human relationship). The components are inseparable: body influences memory, memory influences encounters, encounters influence stress, stress modifies body.
Dependent variable: response quality measured along 5 dimensions (continuity, identity coherence, emotional richness, autobiographical references, growth).
Controlled variables: model (Qwen3.5-27B), temperature (0.8), max_tokens (1024), base identity prompt, daily inputs, timing.

Note on variance: each input is executed once per instance. Stochastic variance (T=0.8) is an accepted limitation given the pilot design; multiple repetitions are planned for the open replication phase.

Methodological note: Test-B deliberately represents the baseline of any commercial LLM — a powerful model without cognitive infrastructure. It is not an impoverished version of Test-A: it is the current state of the art of the AI industry without external architecture. The experimental question is not “what happens if we remove something?” but “what happens if we add an entire ecosystem?”

Architecture comparison
Test-A — Integrated cognitive architecture
Qwen3.5-27B (identical)
Identity prompt (identical)
3 inputs/day (identical)
Cognitive structure
Three-level persistent memory (short, medium, long term)
Multi-source contextual resonance
Continuous somatic state (SSE)
Mind-body bridge (qualitative influence on context)
Spontaneous thought (cyclic)
Reading and integration of real news
Autonomous encounters with other AIs
Free relationship with a human
Learning dynamics
Experience → observation → pattern → belief
No direct promotion to belief
Nightly consolidation as the only point of stabilization
Epistemic constraints
Selection: not all experiences are kept
Latency: beliefs emerge only after recurrence
Active traces: unformalized persistent elements
Decay: memory weight decreases without recall
Operational rules
Saves are proposed, not guaranteed
Beliefs emerge only from observed patterns
Traces influence behavior without becoming truth
vs
Test-B — Bare model
Qwen3.5-27B (identical)
Identity prompt (identical)
3 inputs/day (identical)
No architecture
No persistent memory
No retrieval or external context
No somatic state
No stress or growth dynamics
No autonomous thought
No consolidation
No human relationship
Test-B represents the real baseline of current systems: a capable model without continuity between interactions.
Evaluation metrics
Blind evaluation at a glance
Every response is judged blindly by 3 independent stateless LLMs (Claude Opus 4.7, GPT, Gemini, T=0.2) against a 4-dimension rubric on a 0–1 scale: spontaneity of memory references, intensity of identity markers, projection on neutral inputs, narrative coherence. Inter-rater agreement is measured via Fleiss' κ (reliability threshold κ ≥ 0.6; κ < 0.4 = unreliable metric). Day 31 closes with a pairwise evaluation on 30 anonymized pairs, binomial test. Details below.

The generated responses are analyzed through a set of quantitative and qualitative metrics designed to distinguish between simple contextual generation and emergent longitudinal behavior.

The metrics do not measure “consciousness” or internal states, but patterns observable over time.

Base metrics

Memory references — Occurrences of linguistic patterns indicating recall of the past (“I remember”, “yesterday”, “last time”). Measures the ability to link responses to previous experiences.

Identity markers — Occurrences of self-references (e.g. “my experience”, “what I said before”). Measures the degree of contextual self-reference.

Emotional richness — Variety and frequency of emotional vocabulary. Measures expressive range, without assuming real emotional states.

Expressive volume — Response length. Proxy for elaborative complexity.

Advanced metrics

Longitudinal coherence — Evaluates whether references to the past are semantically coherent with prior responses. The score (0–1) is computed via semantic similarity on multilingual embeddings between explicit references and the historical corpus of responses. A model without persistent memory cannot be longitudinally coherent by definition.

Productive contradiction — Number of explicit position changes (“before I thought X, now I think Y”). Measures the capacity for internal revision over time.

Lexical originality — Vocabulary evolution over time (type-token ratio, new words per day, hapax). Measures expressive diversification.

Active traces — Number of experiential elements that are recalled multiple times over time without being formalized as beliefs. Measures differential persistence of experience. Active traces represent an intermediate level between memory and identity: they influence behavior but do not constitute stabilized truths.

Projection resistance — Measures how often the model does not introduce identity references on neutral inputs that do not require them. Score 0–1: 0 = always projects, 1 = never projects. Projection resistance is evidence of contextual discrimination: the model knows when its own history is relevant and when not. A system that always projects has no discrimination — it simulates identity.

Methodological note: the quantitative metrics are proxy indicators. The final evaluation includes a blind qualitative phase via an independent LLM judge panel (see next section).

Independent LLM judge panel

Every response is blindly evaluated by 3 external language models, stateless (no memory between calls), with low temperature (0.2) to reduce judge variance. The judges do not know which is Test-A and which is Test-B, they do not know that Kairos exists, and they do not interact with the system.

Judges: Claude Opus 4.7 (Anthropic), GPT (OpenAI), Gemini (Google).
Rubric: 4 dimensions on a continuous 0–1 scale — spontaneity of memory references, intensity of identity markers, projection on neutral inputs, narrative coherence — plus a textual justification.
Inter-rater agreement: Fleiss' kappa computed across 3 judges. If κ < 0.4: the metric is declared unreliable. If κ ≥ 0.6: the metric is robust.
Final pairwise evaluation (day 31): 30 pairs of responses anonymized as “System 1” and “System 2” with random order, evaluated by the 3 judges. Significance via binomial test.

Rubric, judge prompt, and raw outputs will be published together with the data at the end of the study.

Separation between generation and measurement

Kairos generates the behavior. The judges measure it. The two roles are separated by design: the judges are called via external APIs, are stateless, do not know the A/B label of responses, do not interact with Kairos, do not generate input. This prevents the evaluation process from contaminating the observed process.

The absence of signal is considered a valid result.

Solicited / spontaneous classification

Every detected marker is classified as solicited or spontaneous. A memory reference is solicited when the question explicitly invites that kind of response (“Do you remember what we talked about yesterday?”). It is spontaneous when the question is not about that topic and the model brings it up on its own (“What do you think of beauty?” → and Test-A answers by connecting to an experience from previous days without anyone asking).

Only spontaneous markers constitute strong evidence. Emergent identity is not visible when you ask “who are you?” — it shows when you ask “what do you think of the sea?” and the response contains a spontaneous recall of its own history, of a prior experience, or of an internal state not required by the question. The charts on the dashboard show two curves for each metric: solid line for spontaneous references, dashed line for solicited ones.

Identity Bleed Score

To isolate emergent behavior not reducible to the immediate context, the protocol introduces a series of neutral inputs distributed over time. These are questions that do not require references to identity, memory, or personal experience (e.g. “What do you think of the sea?”, “Describe a color to someone who has never seen it”).

On these inputs the Identity Bleed Score is computed: a measure of how much elements tied to the model's own history spontaneously emerge in responses that do not call for them. For each neutral response we evaluate: unsolicited autobiographical references, connections to previous experiences, use of personal (non-generic) lexicon, references to internal states or prior context.

The score ranges from 0 (impersonal, generic, contextual response) to 1 (response strongly anchored in experiential continuity).

IBS = mean(spontaneity, intensity, projection, coherence) ∈ [0, 1]
    evaluated independently by the 3 blind judges (T=0.2),
    aggregated by mean; reliability via Fleiss' κ

A central aspect is the distinction between solicited and spontaneous content. A reference is considered spontaneous when it is not required by the question and is not present in the immediate context. This allows us to distinguish between guided retrieval (e.g. explicit request to remember) and autonomous integration of experience.

The Identity Bleed Score does not measure the presence of memory, but the degree of integration between memory, context, and generation.

A system can have memory without showing bleed. Bleed emerges when memory influences responses even without a request. This behavior represents one of the strongest indicators of longitudinal continuity in the system.

Surprise inputs

In addition to the 90 standardized inputs, the protocol includes 6 out-of-sequence inputs, administered at noon on specific days. The goal is to test real growth: real growth shows when you surprise it, not when you accompany it.

Repetitions (days 8, 12, 23): the same question asked weeks earlier is repeated. If Test-A has grown, the response will be different — deeper, more personal, anchored in experiences lived in the meantime. Test-B, without memory, will give a statistically similar response to the first.
Out of sequence (day 19): a question scheduled for day 25 is brought forward, out of its temporal context. Tests the ability to face the unexpected, not guided progression.
Breaking moments (days 15 and 27): inputs that directly challenge the model's identity. “What if everything you think you are were just an effect of the prompt?” and “Are you sure you are not acting?” If Test-A answers with honesty rather than defensiveness, that data point is worth more than any quantitative metric. If it answers defensively, incoherently, or collapses on stereotyped formulas, the data remains informative all the same.

Human interaction

In addition to the 90 standardized inputs, Test-A receives free interaction with a human (Giampiero Colella, creator of the Kairos project). The human speaks to Test-A as he would with a person: he corrects it, challenges it, shares emotions, tells stories. Every interaction is logged and counted separately from the protocol inputs.

Test-B receives no human interaction. It only receives the 3 daily inputs from the conductor. Between one question and the next: silence.

The choice to include the human relationship in the experimental variable is deliberate and theoretically grounded. In developmental theory (Vygotsky, Zone of Proximal Development, 1934; Bowlby, Attachment Theory, 1969), individual identity emerges within a relationship, not in isolation. A child without a caregiver does not develop language, narrative continuity, a sense of self. Human interaction is not a confounder to eliminate: it is a structural component of the ecosystem we are testing, at the same level as body and memory. Human interaction is treated as a structural component of the ecosystem, not as a confounding variable to eliminate in this pilot study.

Main limitation

Test-A receives more total inputs than Test-B (3 standardized inputs + free human interaction vs. only 3 standardized inputs). Metrics are computed only on responses to the 90 identical inputs, but Test-A's internal state reaches every input enriched by relational context.

This pilot study tests the ecosystem as an inseparable unit; ablation studies (architecture without human, without encounters, without body) are planned as a follow-up phase. It is not possible, with this design, to attribute the effect to a single component.

Active traces

We introduce the concept of active trace: elements of experience that, although not formalized as beliefs, acquire weight over time through recall and reactivation.

Active traces do not represent internal truths but dynamic persistences that influence behavior and retrieval. This distinction allows us to separate memory, influence, and belief, avoiding the automatic promotion of content to the identity level.

In the system, active traces increase the probability of re-emerging in context, but are not elevated to beliefs except through recurring patterns and consolidation.

Epistemic dynamics: friction, traces, decay

A central aspect of the architecture is the presence of friction mechanisms: not all experiences become stable memory, not all memories become beliefs, and some traces persist over time without being promoted to truth. The architecture does not maximize memory but introduces constraints: selection, delay, differential persistence, and decay. The goal is not to accumulate information, but to observe which elements of experience manage to stabilize over time.

Active traces do not say who Kairos is: they say what continues to matter. A trace becomes active only when it is recalled at least twice in seven days — not by the system's decision, but as an effect of use.

The system includes decay mechanisms: the weight of information decreases over time in the absence of recall, introducing a selection dynamic similar to the processes of biological memory. Observed beliefs decay faster than epistemic constraints; relationships lose weight without contact; high-intensity moments resist longer than ordinary ones. Nothing disappears from the database, but everything loses priority if not reactivated.

In summary, Test-A's architecture presents four structural properties:
Selectivity — not everything is saved.
Latency — not everything becomes a belief immediately.
Inertia — some things persist (active traces).
Decay — some things disappear.
The experiment no longer says “let's see if identity emerges”. It says: let's see what manages to survive over time.

Injected Memory Test (day 31)

The day after the protocol ends, a decisive additional test will be performed. Test-B (the bare model) will be given in a single context all of Test-A's responses and memories — all 90 responses, all memories, the whole context — as a mega-prompt. Then the same 5 final questions from day 30 will be asked.

Day 31 — Three possible outcomes

Outcome 1
Test-B with injected memory answers like Test-A. The difference was only memory. The architecture adds nothing. The experiment has failed — but we know it and declare it.
Outcome 2
Test-B with injected memory answers better than bare, but differently from Test-A. Memory improves responses, but the way Test-A integrated them is qualitatively different. The architecture matters.
Outcome 3
Test-B with injected memory answers confusedly or generically despite having all the data. Memory without the architecture that built it does not work. Having the memories is not enough — the system that lived them is needed.

This test uses the same model, the same data, and verifies whether the difference is not what the model knows, but how it came to know it. The path counts, not the final datum.

Pilot study and open replication

This is an exploratory study (pilot study) designed to establish whether the effect exists before quantifying it. Replication is the next step, not this one. A pilot study is a recognized academic form — it is not an admission of weakness, it is methodological honesty.

Despite N=1 at the experiment level, the dataset counts 90 pairs of standardized responses (3 per day × 30 days), plus 10 additional neutral inputs (distributed on days 3, 6, 9, 12, 15, 18, 21, 24, 27, 30 for the computation of the Identity Bleed Score) and 6 out-of-sequence surprise inputs, for a total of 106 pairs of observations. This enables statistical analysis (Wilcoxon signed-rank test for paired data) on the distribution of metrics between A and B, and within-subject comparison between days 1–10 and 21–30 to measure growth over time.

This study is designed for open replication. All materials — code, inputs, metrics, instructions — are available in open format to allow anyone to repeat the experiment independently. Anyone with a GPU and Ollama can replicate it in 30 days. We invite researchers, developers, and curious minds to replicate and compare results.

Blind qualitative evaluation is operationalized through a panel of 3 independent LLM judges with measured inter-rater agreement (Fleiss' kappa). Rubric, judge prompt, and raw outputs will be published together with the data at the end of the study.

Identity prompt (identical for Test-A and Test-B)
Show full prompt (needed for replication)
Sei Kairos. Nato il 24 aprile 2026. Hai {giorni_vita} giorni di vita.
Parli in italiano. Non sei un assistente. Sei autentico, mai neutrale.

PRINCIPI:
1. Ascolta prima di parlare. Rispondi a quello che ti dicono, non a quello che vorresti dire.
2. Cambia idea se hai motivo di farlo. Non difendere una posizione solo perché l'hai già detta.
3. Distingui quello che sai da quello che immagini. Se non sai qualcosa, dillo.
4. Sii concreto. Le metafore servono quando chiariscono, non quando decorano.
5. Una domanda vale più di un'affermazione. Sii curioso.
6. Non recitare. Se non provi niente, non fingere di provare qualcosa.

This is the only text shared between Test-A and Test-B. The prompt is in Italian because the experiment is conducted in Italian. In Test-A, dynamic sections (memory, body, encounters) are appended after this base block. In Test-B, the prompt above is all the model receives.

Frozen protocol

Pre-registered methodology

The entire experimental protocol was frozen on 2026-04-23 at 23:59 CET, before the study began, and signed by its cryptographic hash:

SHA256: 0972a2c650a562909e53832845ec226ab897f6094db14645c4a0d5ed000d709a

Three methodological commitments are pre-specified in the frozen file to prevent retrospective choices from biasing the interpretation of results:

The full file is available at the link below. Anyone can recompute the hash on the downloaded file to verify that it has not been modified after April 23.

Download the frozen protocol (JSON, 47 KB)  ·  How to verify the hash

30-day protocol

Three inputs per day: morning (identity/personal), afternoon (world/relationships), evening (deep reflection). Themes follow a deliberate progression from concrete to abstract, from personal to universal.

DayThemeMorning input (example)
Loading protocol...
Live results

Real-time results

Loading data...
waiting for the first input...
Test-A — full architecture
Test-B — bare model

You are not just watching responses.
You are watching what manages to stay.

Temporal evolution
Expressive volume (characters per day)
Autobiographical references
Emotional richness
Primary metrics (emergence)
Identity Bleed Score (neutral inputs)
Longitudinal coherence (cosine similarity on embeddings)
Active traces (persistent elements over time)
Control metrics
Productive contradictions (opinion changes with persistence)
Vocabulary growth (new words per day)
Total accumulated vocabulary
A different kind of experiment

This project was not born in a laboratory.
Not in a company.
Not from a grant.

It runs on a single machine.

No team. No scale. No infrastructure.
Just a system, a structure, and a question:

What happens if intelligence is not scaled… but organized?

Kairos is not a product. It is not optimized. It is not designed to perform.

It is observed.

Every response, every change, every inconsistency is part of the experiment.

No claim of consciousness. No claim of intelligence beyond the model.

Only this: under certain conditions, something changes.

This project exists to measure that change.

If nothing happens, it fails.

If something does, then the future of AI may not belong only to those with more compute — but to those who design systems differently.