Agent Coordination Is a Distributed Systems Problem
Table of contents
Open Claude Code and start working. Yesterday's session is gone: the architecture it understood, the convention it established, the dead end it explored. Two other sessions are running right now on overlapping files, and yours can't see them. Private context, no shared state. So you re-explain, re-discover, and watch agents step on each other's work.
During development of this blog's interactive features, a session found an encoding bug across three call sites, an alist/plist mismatch that silently broke WebSocket communication. It recorded the diagnosis in the task's event log and exited. The next session replayed the log on bootstrap and inherited the fix, with no explicit communication between them. Standard event replay.
That's coordination, not retrieval, not memory, not a better prompt. Multi-session AI agent workflows are concurrent processes with private state, independent failure modes, and shared mutable resources, exhibiting all four properties Waldo et al. identify as distinguishing distributed from local computing: latency, separate memory, concurrency, and partial failure [9]. Lamport's original treatment explicitly includes processes on a single computer [10]. The coordination questions are specific: who else is editing this file right now? Did the session that crashed at 2am leave work half-finished? Which saved patterns actually improve outcomes? These have well-studied solutions in the distributed systems literature [1]. Event sourcing provides durable history; CRDTs provide convergent merge, with correctness proofs that are purely algebraic, no network model in the theorem statement [11].
Concrete coordination problems follow from this distribution: structural queries over task dependencies, temporal queries over decision propagation across sessions, and conflict detection for concurrent writes. Each requires a different computational primitive (graph traversal, event replay, merge semantics) that similarity search cannot provide.
For a solo developer on one repo, simpler approaches work. A curated set of markdown files committed to git gives you persistence and versioning with zero new tooling. We mean that honestly, and if you're not running concurrent sessions it's the right call. But the simple approaches stop working at specific thresholds. git blame can't distinguish sessions when agents share credentials, and giving each agent its own git identity doesn't scale; a team lead spawning four sub-agents that each spawn two more produces a commit history that's noise, and the identity management alone becomes a coordination problem. Without principled scoring, flat context files bloat until they degrade context quality. Graph databases like Neo4j handle typed edge traversal well, but add an external service dependency; relational databases can express it with recursive CTEs, but the queries get awkward fast. We chose JSONL event logs with CRDT merge semantics because they're plain files under version control: no external service to run, no schema migrations, works offline, and grep still works when the tooling breaks.
kli is a coordination layer for AI agent sessions built on event sourcing and conflict-free replicated data types. Sessions append events to shared logs, state is computed by replay, and concurrent writes converge without locks or coordination protocols. The broader approach, treating agent context as an evolving, structured collection rather than a static prompt, draws on recent work in agentic context engineering [2], and the handoff protocol between sessions builds on Horthy's practitioner framework for agentic context engineering [8]. Our contribution is recording observations as events in an append-only log, giving each session access to everything previous sessions learned, with CRDT merge semantics when sessions run concurrently. Embedding-based retrieval handles observation search and pattern activation; the event-sourced graph handles the structural, temporal, and coordination queries retrieval can't. One binary, one MCP server exposing 31 tools, seven lifecycle hooks. One external dependency: a local embedding model (ollama) for observation search and pattern retrieval. Everything else is self-contained: plain JSONL files under version control, next to your code.
Three coordination problems
Each has a distinct computational character, and none reduces to similarity search.
Structural queries. "Which tasks block this one? What are the leaf phases of the current plan?" These require walking a graph of typed edges (depends-on, phase-of, related-to), which is transitive reachability over a DAG. Metadata filters can match a single field; they can't follow a chain of depends-on edges three levels deep to find what's actually blocking your work.
Temporal queries. "What happened between the session crash and now? In what order did these decisions propagate across sessions?" These require causal ordering, not timestamp filtering alone. A metadata filter can give you "observations from the last hour," but it can't reconstruct which session's handoff informed which subsequent session's work. That's a property of the event sequence, not of any individual record.
Coordination queries. "Are two sessions about to conflict? Should this session claim exclusive access?" These require liveness detection and merge semantics: knowing which sessions are currently active, what files they're editing, and how to resolve concurrent writes to the same field. No retrieval system, however well-filtered, provides coordination.
Halpern and Moses showed that when processes maintain private state, the knowledge hierarchy that collapses in shared-memory systems remains strict. The system is epistemically distributed regardless of physical topology [12]. Agent sessions satisfy this condition exactly: each carries a private context window that no other session can observe. The structure of information access, not the presence of a network, is what creates distribution. Distributed systems research has well-studied solutions for exactly this class of problem.
How it works
State from events
kli stores nothing in a database. Each task is a directory with a JSONL event log, stored by default under .kli/tasks/ in your git root, or wherever KLI_TASKS_DIR points. State is what you get when you replay the log. A fold:
;; The real task-state has 11 CRDT fields. Here's the core idea:
;; create an empty state, walk the events, apply each one.
(defstruct task-state
(id "")
(status "pending")
(observations nil)
(edges nil)
(sessions nil))
(defun apply-event (state event)
"Apply one event to state. Mutates in place."
(let ((type (first event))
(data (second event)))
(case type
(:task.create (setf (task-state-id state) (getf data :name))
(setf (task-state-status state) "active"))
(:observation (push (getf data :text) (task-state-observations state)))
(:session.join (pushnew (getf data :session) (task-state-sessions state)
:test #'equal))
(:task.link (push (list (getf data :target) (getf data :edge-type))
(task-state-edges state)))
(:task.update-status (setf (task-state-status state) (getf data :status)))))
state)
(defun compute-state (events)
"Fold events into state. No database."
(let ((state (make-task-state)))
(dolist (ev events state)
(apply-event state ev))))
;; Seven events go in. State comes out.
(compute-state
'((:task.create (:name "fix-auth-redirect"))
(:session.join (:session "a1b2c3"))
(:observation (:text "Auth module uses JWT with RS256 signing"))
(:observation (:text "Rate limiter lives in middleware/rate-limit.ts"))
(:task.link (:target "implement-rate-limiting" :edge-type :depends-on))
(:session.join (:session "d4e5f6"))
(:observation (:text "Found existing refresh logic in auth/refresh.ts"))))
Given an event log $E = [e_1, e_2, \ldots, e_n]$ and initial state $s_0$:
$$s_n = \text{fold}(\texttt{apply-event},\; s_0,\; E)$$
The real apply-task-event in lib/task/state.lisp handles 19 event types: task.create, session.join, observation, task.fork, task.link, task.sever, session.claim, tool.call, and more. Each field of the task-state struct is itself a CRDT: LWW-Registers for status and claims, G-Sets for observations and sessions, OR-Sets for edges and artifacts, LWW-Maps for metadata. But the principle is the fold. Create empty state, iterate, apply.
In practice, this means task data is files under version control: git log is your task history, grep finds observations, and every event carries a timestamp and session ID. No schema migrations, no external service to keep running.
The event log has a second use. At the end of a task, the developer invokes a reflection command that replays the full event log (every tool call, every observation, every dead end) through an LLM agent that evaluates what happened against the patterns that were activated during the session. The LLM proposes new patterns ("prefer meta-tag CSRF injection over middleware modification when session middleware may be absent") and scores existing ones as helpful or harmful, with evidence attached. A human triggers it, and an LLM does the reasoning; there's no automatic extraction happening in the background. But because the event log carries enough structure to reconstruct why decisions were made and what their outcomes were, the reflection can be substantive rather than superficial. The resulting patterns enter a scored graph where future sessions retrieve them by semantic similarity, weighted by a Beta-Binomial model over accumulated feedback: a pattern with 16 helpful votes and 0 harmful has a posterior mean of 0.94, while one with 1 helpful and 2 harmful scores 0.40.
Convergent merge
Append-only isn't enough on its own. When two sessions write to the same task, their events need to merge. Consider a concrete scenario: session A removes a stale edge (this task no longer depends on fix-auth-redirect) while session B, working concurrently, adds a new edge to the same task. With a naive set, A's remove could kill B's add. Correct behavior: B's add survives, because A's remove should only affect what A knew about.
This is the problem Observed-Remove Sets solve [1]. Every add carries a unique tag. A remove only tombstones tags it has observed; tags from concurrent adds, not yet observed, survive. Here's the real OR-Set from kli's lib/crdt/or-set.lisp:
;; Observed-Remove Set: the actual CRDT from kli.
;; Each add carries a unique tag. Remove only tombstones observed tags.
;; Concurrent adds always survive concurrent removes.
(defstruct (or-set (:conc-name ors-))
(elements (make-hash-table :test 'equal)) ; element → list of tags
(tombstones (make-hash-table :test 'equal))) ; tag → t
(defun ors-add (ors element tag)
"Add ELEMENT with unique TAG. Concurrent adds with different tags coexist."
(pushnew tag (gethash element (ors-elements ors) nil) :test #'equal)
ors)
(defun ors-remove (ors element)
"Remove ELEMENT by tombstoning all *currently observed* tags.
Tags from concurrent adds — not yet seen — survive."
(dolist (tag (gethash element (ors-elements ors)))
(setf (gethash tag (ors-tombstones ors)) t))
ors)
(defun ors-members (ors)
"Elements with at least one live (non-tombstoned) tag."
(let (result)
(maphash (lambda (element tags)
(when (some (lambda (tag)
(not (gethash tag (ors-tombstones ors))))
tags)
(push element result)))
(ors-elements ors))
result))
;; Session A adds an edge, session B adds a different edge.
;; Session A then removes its edge. B's edge survives.
(let ((edges (make-or-set)))
(ors-add edges "fix-auth::depends-on" "session-A:evt-1")
(ors-add edges "add-tests::phase-of" "session-B:evt-2")
(ors-remove edges "fix-auth::depends-on")
(ors-members edges))
The merge function for any state-based CRDT must be commutative, associative, and idempotent.[1] This is what gives you convergence without coordination protocols, without locks, without manual conflict resolution. Associativity gives you order-independent three-way merges. And idempotence means replaying a duplicate event (which happens whenever two sessions read overlapping segments of the same log) is harmless.
Not every field needs the full OR-Set machinery. kli matches each field to the CRDT that fits its access pattern: G-Sets for observations and session lists (append-only; you never delete an observation), OR-Sets for edges and artifacts (where add-remove races are real), LWW-Registers for status and claims (last-writer-wins by timestamp), LWW-Maps for metadata, PN-Counters for pattern feedback scoring. Each event carries a vector clock [3] for future causal analysis, though the current implementation orders events by wall-clock timestamp. For co-located sessions this is sufficient, and the clock data means we can wire up causal ordering without replaying history when sessions eventually span machines.
Coordination without messages
Most agent coordination systems use messaging: requests, responses, shared state through a broker. kli doesn't require it. Every tool call writes a trace event to the shared log: file path, session ID, timestamp. When a new session starts, the bootstrap process reads those traces and discovers what other sessions have done, with no explicit messaging between them. The information is in the environment.
Heylighen formalizes this pattern as stigmergy: coordination through traces left in a shared medium, where the trace left by one agent's action stimulates subsequent action by another [13]. The term originates with Grassé's study of termite construction [4], and the computational variant is well-established in multi-agent systems research [7]. The deposit-and-respond cycle in kli (agents write events, other agents discover those events during bootstrap and adjust) satisfies the formal definition. Where kli extends biological stigmergy is the analysis layer. Termites respond to local chemical gradients with fixed behavioral rules. kli computes behavioral fingerprints across all session traces and presents similarity scores, what Ricci et al. call cognitive stigmergy: augmenting indirect coordination with computational artifacts that rational agents reason about, rather than merely react to [14].
The deposit-and-discover cycle (agents write events, other agents find them during bootstrap) is stigmergic in the strict sense: best-effort, emergent, no global awareness required. The TQ and PQ queries described below are a different thing: deliberate analysis over the full graph, extending the stigmergic substrate with formal structure.
In practice, the bootstrap process computes a behavioral fingerprint for each prior session: which tools it called and how often, which files it read versus edited, how its observations cluster in embedding space, and how close it is in the task graph to whatever you're working on now.
;; Behavioral fingerprinting: sessions are more than their file lists.
;; The real fingerprint computes a weighted similarity score across
;; five dimensions. The weights are empirically tuned — iterated from
;; an earlier 3-component version (0.3/0.3/0.4) — not derived from
;; first principles. Observation embeddings get the highest weight
;; because they're the semantically richest signal.
(defstruct fingerprint
session-id
tools ; alist of (tool-name . call-count)
files ; alist of (file-path . touch-count)
obs-vector ; mean embedding of session's observations
archetype ; :builder or :observer
graph-dist) ; BFS distance to current task in edge graph
;; Stubs for the real implementations (loaded from lib/swarm/).
(defun cosine-sim (a b) (declare (ignore a b)) 0.5)
(defun graph-proximity (a b) (declare (ignore a b)) 0.5)
(defun session-freshness (fp) (declare (ignore fp)) 0.5)
(defun fingerprint-similarity (a b)
"Five-component weighted similarity between session behaviors."
(+ (* 0.20 (cosine-sim (fingerprint-tools a) (fingerprint-tools b)))
(* 0.20 (cosine-sim (fingerprint-files a) (fingerprint-files b)))
(* 0.30 (if (and (fingerprint-obs-vector a) (fingerprint-obs-vector b))
(cosine-sim (fingerprint-obs-vector a) (fingerprint-obs-vector b))
0.0))
(* 0.20 (graph-proximity (fingerprint-graph-dist a)
(fingerprint-graph-dist b)))
(* 0.10 (session-freshness a))))
;; Similarity > 0.8 → "doing very similar work — coordinate"
;; Similarity > 0.5 → "related work — check before editing shared files"
;; Similarity < 0.3 → "some overlap in tooling"
Sessions get classified as :builder or :observer based on their tool-to-event ratio: builders make edits, observers mostly read. When two builders overlap on the same files, that's a conflict warning. An observer on the same task gets visibility. The system also detects information flows between sessions: when session B's bootstrap surfaces observations that session A recorded, that's a measurable signal that the coordination substrate is working.
Beyond file conflicts, kli uses these traces for orphan pickup. When a session crashes or disconnects, the next session that bootstraps the same task discovers the abandoned phases and can claim them. There's also find-missing-edges, which watches for sessions that keep jumping between two unlinked tasks: if an agent repeatedly transitions between A and B with no connecting edge in the graph, kli suggests adding one.
From sessions to teams
Everything above describes a single Claude Code instance. One session deposits traces, reads the environment, reacts. This is the degenerate case.
Claude Code's team feature spawns multiple agents: a team lead delegates to researchers, implementers, testers, all sharing a task list and communicating through direct messages. kli doesn't reimplement any of that. From kli's perspective, a team is a cluster of sessions that happen to share a team-name. When an agent joins a team, Claude Code sets environment variables. kli reads them on session start and emits a session.team-join event instead of a plain session.join:
;; From lib/task/state.lisp — team-join records membership as metadata.
;; A plain session.join only adds a session ID to the G-Set.
;; team-join also writes team identity into the LWW-Map:
;; key: "team:session-id"
;; value: "team-name|agent-name|agent-type"
(defun handle-team-join (event-type data session)
"Apply a team-join event. Returns the metadata entry it would write."
(case event-type
(:session.team-join
;; The real code calls (gs-add sessions session) and (lwwm-set metadata ...)
;; Here we return what the LWW-Map entry looks like:
(let ((team-name (getf data :team-name))
(agent-name (getf data :agent-name))
(agent-type (getf data :agent-type)))
(when team-name
(list :key (format nil "team:~A" session)
:value (format nil "~A|~A|~A" team-name
(or agent-name "") (or agent-type ""))))))))
;; An agent named "researcher" joins the auth-team:
(handle-team-join :session.team-join
'(:team-name "auth-team" :agent-name "researcher"
:agent-type "explorer")
"session-a1b2c3")
Session fingerprints carry the same identity: each one tracks the session's tools with frequency counts, files touched, and a mean observation embedding vector. When a new session bootstraps, the team-name field lets swarm awareness partition active sessions into teams and solo agents, flag cross-team file conflicts, and group same-team builders together.
We should be honest: this is implemented but not validated. The team-join events are emitted, the fingerprint fields are populated, the partitioning code exists. What we haven't done is run a real multi-team workflow end-to-end and confirmed that the coordination signals are actually useful, that a cross-team conflict warning arrives in time to prevent a real collision, that orphaned team work gets correctly attributed and picked up. Between teams, coordination is intended to work the same way as between solo sessions: indirect traces, no explicit handshake. Claude Code's built-in messaging handles communication within a team; kli doesn't touch that. A team working on one task would leave traces (file edits, observations, fingerprints) that another team on a related task discovers during bootstrap.
We conjecture that this scales to multiple teams across a shared task graph, each team a localized cluster of activity, coordination between them emerging from the shared environment rather than messages. Architecturally, the pieces are in place: fingerprints carry team identity, swarm awareness groups by team, file conflict detection crosses team boundaries. We have not tested any of this with real teams. The most concurrent sessions we've run is ten on a single monorepo, concurrent and aware of each other through fingerprints and swarm awareness, but not organized into named teams. Whether the team-aware coordination layer on top of that actually helps in practice is an open question that needs validation.
Querying the graph
The trade-off of event sourcing over a database: no SQL, no joins across tasks. We compensated by building two purpose-built query languages that Claude uses through MCP tools, each designed for a specific graph.
TQ (Task Query) traverses the task graph. It's a pipeline language where each step transforms a set of nodes. (-> (active) :enrich (:sort :obs-count) (:take 5)) returns the five most-observed active tasks. Try it yourself on a synthetic project:
Task Query Playground
This is what Claude sees when it queries a task graph. The dataset below is a synthetic project — 18 tasks, 6 phases, typed edges. TQ is a pipeline language: start with a source, chain transformations.
PQ (Playbook Query) operates over the pattern graph, the scored patterns produced by the reflection workflow described above. Retrieval is a hybrid pipeline: semantic similarity against the query embedding first, then spreading activation over a co-application graph that boosts patterns frequently applied together in successful sessions. The activation parameters (decay rate, hop limit, typed relation weights) are empirically tuned, not derived from first principles. In practice, semantic similarity still dominates ranking; the co-application graph acts as a modest boost for neighbors rather than a primary retrieval mechanism. We expect the graph signal to become more useful as the co-application ledger accumulates data, but we haven't proven that yet:
;; Activate retrieves by embedding similarity, then walks the
;; co-application graph. Patterns frequently applied together
;; in successful sessions get boosted.
(-> (activate "file conflict detection" :boost (lisp))
(:take 5))
Playbook Query Playground
Patterns are reusable conventions with effectiveness scores. PQ queries the pattern graph — filter by domain, sort by helpfulness, find what works. Patterns that hurt get demoted.
Both are S-expression pipelines: pick a starting set, then filter, sort, take, group, or mutate. They won't replace SQL for analytics; we're aware of that, and honestly for the access patterns agent coordination actually needs, we haven't missed it.
Worked example
The task graph below is kli's own development record, a worked example rather than independent validation. We show it because the data is real, the queries are reproducible, and the coordination patterns are concrete:
The stats above the graph are computed live from the event logs. Here is what coordination looked like in practice.
Session 1 found a critical bug in the SSE event encoding: format-sse-event, ws-send-json, and ws-broadcast-json all had an alist/plist mismatch that silently broke WebSocket communication. The session recorded the bug and its fix across three call sites, then exited with a handoff: "css-cleanup-done-sse-bug-fixed-lol-reactive-dedup-in-progress." Session 2 verified the fix end-to-end, session 3 persisted it to source, and session 4 integrated it into the blog you're reading. Without the task graph carrying observations forward, each new session would have spent its first tool calls re-discovering the same problem: WebSocket messages silently not arriving, the SSE stream looking correct in the logs, the real cause buried in a subtle alist-versus-plist mismatch that only shows up at runtime.
This example comes straight from the event logs. The coordination value here is structural: which session discovered what, in what order, and how that knowledge propagated to subsequent sessions through the task graph. Cosine similarity can't tell you any of that.
What standard tooling can't tell you
Git records what changed. GitHub Issues and PR comments capture some of what was discussed. CI logs show what was tested. None of them preserve the diagnostic chains, dead ends, and design rationale that connect a session's first hypothesis to its final commit. We asked a graph analyst agent, an agent whose tools are task queries, observation search, and timeline traversal, to reconstruct how the interactive features were developed. Here is the transcript, edited for length but preserving real queries and results.
The agent starts by bootstrapping the task, loading its full computed state in one call:
> task_bootstrap("2026-02-11-explore-lol-reactive-blog-features")
Status: completed
Sessions: 4 | Observations: 27 | Edges: 2
phase-of → 2026-02-12-restructure-kli-launch-post
related-to → plan-kleisli-blog-docs-sites
Latest handoff: code-modal-readme-fix-blog-post-needs-revision
Departures: 4 sessions (all graceful)
Four sessions, 27 observations, a child task for the restructure. The agent doesn't read the event log directly; it queries the computed state, which is the fold of all 61 events into a CRDT. Now it searches for bugs by meaning, not by filename:
> obs_search("bug fix encoding mismatch")
[0.89] "JSON encoding bug affected 3 call sites — SSE event
formatting, WebSocket send, WebSocket broadcast. The JSON
library encodes alists as arrays of arrays, not objects.
Without this fix, no server-push functionality works."
[0.84] "HTMX CSRF: runtime never includes tokens in POST.
Middleware validates csrf-token on all non-GET requests.
Every hx-post call fails silently."
[0.79] "sandbox-eval reads only first form. Code blocks with
multiple definitions only evaluate the first."
Three framework-level bugs, ranked by semantic relevance to the query. None of them visible in the final code; the commits show the fixes, not the diagnostic chains that found them. The agent traces how knowledge moved between sessions:
> timeline(limit=12)
[Feb 11 20:10] handoff — "sse-bug-fixed, NOT yet in core copy"
[Feb 11 20:48] handoff — "dedup-complete, fix now canonical"
[Feb 12 11:35] session.join — new session inherits fixed code
[Feb 12 12:09] handoff — "csrf-bug-found, post-body-fix-done"
[Feb 12 12:30] handoff — "csrf-fix-done, meta-tag approach"
[Feb 12 13:24] handoff — "rethink launch post vision"
The first session discovered the encoding bug and flagged it as fixed in only one copy. That same session then consolidated the codebase, making the fix canonical. When the second session joined the next morning, it inherited correct code without needing to know about the bug at all. The structural dependency was already resolved. No coordination protocol. The trace left in the log was enough.
The agent peeks at the child task without switching context:
> task_get("2026-02-12-restructure-kli-launch-post")
Status: completed | Sessions: 1 | Observations: 8
parent: explore-lol-reactive-blog-features
Then searches for design rationale, the kind of reasoning that doesn't survive into source code:
> obs_search("design rationale approach chosen rejected")
[0.91] "CSRF fix: meta-tag approach chosen over modifying
middleware because it degrades gracefully when session
middleware is absent."
[0.87] "Explorer widget removed — it demonstrates HTMX
interactivity but has nothing to do with event sourcing
or agent coordination. Wrong demo for the post."
[0.82] "Expandable code blocks abandoned. CSS max-height
requires pre-existing height, creating a chicken-and-egg
problem. Pivoted to modal overlay."
Each of these is a dead end that led somewhere. The CSRF fix carries reasoning about graceful degradation that no commit message would include. The explorer widget was fully built, verified working, and then cut because it told the wrong story, a judgment call invisible in the diff, which shows the deletion and nothing else. The expandable code blocks were tried first and failed for a specific CSS reason; the commit history shows the modal overlay being added but nothing about what it replaced or why.
This structured trace is also what powers the reflection workflow described above. When a developer triggers reflection at the end of a task, the LLM replays the same event log the graph analyst queried (every tool call, every observation, every abandoned approach). Because the events preserve enough structure to reconstruct why decisions were made and what happened as a result, the reflection can trace the event sequence: session A chose the meta-tag approach for a specific reason, session B inherited the fix and never hit the bug, the approach worked across both contexts. That chain becomes a scored pattern. In future sessions, when a similar problem surfaces, the pattern activates through the hybrid retrieval pipeline: semantic similarity first, then the co-application graph boosts patterns that co-occurred with other effective patterns in similar task contexts. The event log is what makes the reflection substantive rather than generic: "prefer meta-tag CSRF injection" is a useful pattern because the event log carries the evidence for why, not the conclusion alone.
git log tells you what changed. The event log tells you what was tried, what failed, what was learned, and why the final design looks the way it does.
Trade-offs and limitations
No general-purpose queries. TQ and PQ handle the access patterns agent coordination needs and nothing more. They don't support arbitrary joins, aggregations, or ad-hoc analytics. If you need "all tasks where observation text mentions 'performance' grouped by author," you're writing a custom TQ extension or piping the JSONL through jq.
CRDT overhead for simple workflows. For a single developer running one session at a time, CRDTs are unnecessary machinery; you're paying for merge semantics you'll never exercise. Below two concurrent sessions, a plain text file works as well and is frankly easier to reason about.
Append-only growth. Event logs grow without compaction. We haven't implemented snapshotting or log truncation yet, and replay latency grows linearly with event count. Across 987 real tasks (SBCL 2.5.10, Linux x86-64):
| Operation | Scale | Time |
|---|---|---|
compute-state (CRDT fold) | 200 events | < 1 µs (below timer resolution) |
compute-state (CRDT fold) | 820 events (largest real task) | < 1 ms |
elog-load (JSON from disk) | 200 events | 4 ms |
elog-load (JSON from disk) | 820 events | 12 ms |
elog-load (synthetic) | 5,000 events | 34 ms |
compute-state (synthetic) | 5,000 events | 4 ms |
| OR-Set merge | 10k + 10k elements | 3 ms |
JSON deserialization dominates the fold by roughly 100x. The CRDT computation itself is not the bottleneck. We accumulate around 160 observations per day in our monorepo and would reach 60k within a year without issues. Full benchmark results.
File-based storage. The system relies on the filesystem and git for synchronization. We've run it past 1,000 tasks in our own development without hitting a wall, but we don't know where the ceiling is. At some scale, filesystem-level coordination will stop being sufficient and you'd want a proper transport layer. We haven't needed to build one yet.
Last-writer-wins on scalar fields. LWW-Registers resolve conflicts deterministically but not always correctly. If session A marks a task "blocked" because it found a real issue and session B marks it "active" one second later, the later timestamp wins regardless of who had more context. For scalar metadata like status and claims, this is an acceptable trade-off. The alternative is multi-value registers or explicit conflict resolution, which add complexity we haven't needed. But it's a known limitation, not a feature.
Vector clocks collected, not yet exercised. Each event carries a vector clock for future causal analysis, but the current implementation orders events by wall-clock timestamp. For co-located sessions sharing a filesystem, wall-clock ordering is sufficient. When sessions run on different machines with clock skew, it won't be. The vector clock data is being collected so we can wire up causal ordering without replaying history.
Prior art
Existing tools cluster into three categories, each optimizing for a different piece of the coordination problem.
Observation stores persist what agents learn and retrieve it by similarity. MemGPT [5] and its successor Letta treat context as a tiered storage problem: main context, archival storage, recall storage, with the LLM itself deciding when to page data between tiers. Mem0 and Zep add database-backed persistence (SQL with vector extensions, graph stores) and cross-session retrieval. These systems handle recall well, and recall is genuinely hard; RAG failure modes remain an active research area. But none provide coordination primitives. Two sessions using the same store have no way to detect they're working on the same file.
Graph-based workflow frameworks add structure. LangGraph represents agent workflows as state machines with typed edges and supports checkpoint-based persistence for crash recovery. Microsoft's AutoGen coordinates multi-agent conversations through a graph of message-passing agents. ChatDev assigns predefined roles (CEO, programmer, tester) that communicate through a shared scratchpad, giving agents a common context surface. These systems handle workflow orchestration well ("run agent A, then B, then C") but the workflow topology is defined at design time. When "investigate this bug" turns into "actually this is a design problem" turns into "fork a new task and hand off to a different session," a predetermined state machine can't capture the transition. The emergent structure of agent work, where the graph shape isn't known until the work is done, resists upfront specification.
Durable workflow engines like Temporal and Prefect solve persistence through event sourcing, but they assume you can define the workflow in advance: step one feeds into step two, retry on failure, alert on timeout. This is reasonable for CI pipelines and data processing, but agent sessions don't follow a predefined DAG. The graph shape emerges from the work.
kli occupies a different point in the design space: event-sourced state with CRDT merge semantics, optimized for concurrent sessions that don't know about each other in advance. The closest prior work is Kleppmann et al.'s local-first software manifesto [6], which advocates CRDTs as a foundational technology for collaborative applications that work offline and converge without central servers. Automerge and Yjs are production-grade implementations of this idea for collaborative document editing; kli applies similar convergence guarantees to task-level coordination state rather than document content. The stigmergic coordination model draws on Theraulaz and Bonabeau's synthesis of indirect communication in multi-agent systems [7].
kli is open source under MIT. Code at github.com/kleisli-io/kli.
curl -fsSL https://kli.kleisli.io/install | sh
kli init
If you've solved agent coordination differently (message-passing, shared databases, something we haven't considered) we'd like to hear what trade-offs you found. If you think CRDTs are overkill for this problem, you might be right; the trade-offs section above is our honest assessment of when simpler approaches win.
References
- M. Shapiro, N. Preguiça, C. Baquero, M. Zawirski. "Conflict-Free Replicated Data Types." SSS 2011, LNCS 6976, pp. 386–400. Springer, 2011. ↩
- Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V. Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, K. Olukotun. "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models." arXiv:2510.04618, 2025. Accepted at ICLR 2026 (Poster). OpenReview: eC4ygDs02R. ↩
- C. J. Fidge. "Timestamps in Message-Passing Systems That Preserve the Partial Ordering." Proceedings of the 11th Australian Computer Science Conference, 1988. See also F. Mattern, "Virtual Time and Global States of Distributed Systems," 1988. ↩
- P.-P. Grassé. "La reconstruction du nid et les coordinations interindividuelles chez Bellicositermes natalensis et Cubitermes sp." Insectes Sociaux, vol. 6, pp. 41–80, 1959. ↩
- C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, I. Stoica, J. E. Gonzalez. "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560, 2023, revised 2024. ↩
- M. Kleppmann, A. Wiggins, P. van Hardenberg, M. McGranaghan. "Local-First Software: You Own Your Data, in Spite of the Cloud." Onward! 2019. ACM, pp. 154–178. ↩
- G. Theraulaz and E. Bonabeau. "A Brief History of Stigmergy." Artificial Life, vol. 5, no. 2, pp. 97–116, 1999. ↩
- D. Horthy. "Advanced Context Engineering for Coding Agents." HumanLayer, 2025 (technical guide). https://github.com/humanlayer/advanced-context-engineering-for-coding-agents ↩
- J. Waldo, G. Wyant, A. Wollrath, S. Kendall. "A Note on Distributed Computing." Sun Microsystems Laboratories, TR-94-29, 1994. Identifies four properties (latency, separate memory, concurrency, partial failure) that distinguish distributed from local computing. ↩
- L. Lamport. "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, vol. 21, no. 7, pp. 558–565, 1978. "A single computer can also be viewed as a distributed system." ↩
- V. B. F. Gomes, M. Kleppmann, D. P. Mulligan, A. R. Beresford. "Verifying Strong Eventual Consistency in Distributed Systems." arXiv:1707.01747, 2017. Proves the abstract convergence theorem for CRDTs in Isabelle/HOL without a network model. ↩
- J. Y. Halpern, Y. Moses. "Knowledge and Common Knowledge in a Distributed Environment." Journal of the ACM, vol. 37, no. 3, pp. 549–587, 1990. arXiv:cs/0006009. Shows that private state creates epistemic distribution: the knowledge hierarchy collapses only when processes share memory. ↩
- F. Heylighen. "Stigmergy as a Universal Coordination Mechanism I: Definition and Components." Cognitive Systems Research, vol. 38, pp. 4–13, 2016. ↩
- A. Ricci, A. Omicini, M. Viroli, L. Gardelli, E. Oliva. "Cognitive Stigmergy: Towards a Framework Based on Agents and Artifacts." E4MAS 2006, LNCS 4389, pp. 124–140. Springer, 2007. ↩