KnowledgeRefinery/docs/data-model.md
oho 38a99476d6 Knowledge Refinery: local-first semantic search & 3D concept visualization
macOS app for corpus ingestion, semantic search, and concept universe
visualization powered by local LLMs via LM Studio.

Architecture:
- Go daemon (17MB single binary, zero dependencies)
  - chi router, pure-Go SQLite, tiktoken tokenizer
  - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize
  - Brute-force cosine vector search in memory
  - 89 tests across 8 packages
- SwiftUI app (macOS 15+)
  - Multi-workspace management with auto-start daemons
  - Live pipeline progress, search, concept browser
  - WebGPU 3D universe renderer with Canvas2D fallback
  - Custom crystal app icon
2026-02-13 18:09:46 +01:00

4 KiB

Knowledge Refinery Data Model

Entity Relationship

WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (LanceDB)   │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)

Tables

file_assets

Tracks every file in watched volumes. Status progresses through: pendingextractedchunkedembeddedannotatedconceptualized

content_atoms

Raw content extracted from files. Types: text, image, table, metadata, binary. Each atom has an evidence_anchor linking to exact source location.

chunks

Deterministic text segments (500-800 tokens). IDs are stable across re-processing. Linked to LanceDB vectors via embedding_id.

annotations

LLM-generated structured metadata per chunk. Never overwritten - new annotations are added with is_current=1 and previous ones marked is_current=0. Versioned by model_id + prompt_id + prompt_version.

concept_nodes

Hierarchical concept clusters derived from embedding similarity. Level 0 = macro concepts, higher levels = finer granularity.

graph_edges

Typed, weighted edges: similarity, concept membership, co-occurrence. Each edge stores evidence references back to source chunks.

pipeline_jobs

Crash recovery: tracks job state so processing resumes after restart.

Live Progress State (M8, In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

Live Progress Dict

Per-stage status object returned in the live field of /ingest/status:

{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}

Each stage transitions through pending -> running -> done.

Activity Log Ring Buffer

A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:

{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}

The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.

Enriched Status Counters

The /ingest/status response includes running totals updated as each stage completes:

Counter Description
chunk_count Total chunks produced so far
annotation_count Total annotations generated
concept_count Total concept nodes created
edge_count Total graph edges created

Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}

Vector Schema (LanceDB)

Field Type Description
id string Matches chunks.id
vector float32[] Embedding vector
text string Chunk text
asset_id string Source file
asset_path string File path
evidence_anchor string JSON anchor
topics string Comma-separated topics
atom_type string text/image/etc
pipeline_version string Version tag