KnowledgeRefinery/docs/data-model.md
oho 9dfb9ff684 Update all documentation for Go daemon rewrite
All docs, README, and presentation now reflect the Go daemon architecture:
Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext.
Updated test counts (97), model names (qwen3-4b-2507), app bundle structure,
installer steps, and tech stack tables.
2026-02-13 19:29:23 +01:00

3.7 KiB

Knowledge Refinery Data Model

Entity Relationship

WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (SQLite)    │
    │    768-dim BLOB       │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)

Tables (SQLite)

file_assets

Tracks every file in watched volumes. Status progresses through: pendingextractedchunkedembeddedannotatedconceptualized

content_atoms

Raw content extracted from files. Types: text, image, table, metadata, binary. Each atom has an evidence_anchor linking to exact source location.

chunks

Deterministic text segments (500-800 tokens). IDs are stable across re-processing. Linked to vectors in chunk_vectors table via chunk ID.

chunk_vectors

Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector). Loaded into memory at startup for brute-force cosine similarity search.

Field Type Description
id TEXT PRIMARY KEY Matches chunks.id
vector BLOB 768-dim float32 embedding
text TEXT Chunk text
asset_id TEXT Source file
asset_path TEXT File path
evidence_anchor TEXT JSON anchor
pipeline_version TEXT Version tag

annotations

LLM-generated structured metadata per chunk. Never overwritten - new annotations are added with is_current=1 and previous ones marked is_current=0. Versioned by model_id + prompt_id + prompt_version.

concept_nodes

Hierarchical concept clusters derived from embedding similarity. Level 0 = macro concepts, higher levels = finer granularity.

graph_edges

Typed, weighted edges: similarity, concept membership, co-occurrence. Each edge stores evidence references back to source chunks.

pipeline_jobs

Crash recovery: tracks job state so processing resumes after restart.

Live Progress State (In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

Live Progress Dict

Per-stage status object returned in the live field of /ingest/status:

{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}

Activity Log Ring Buffer

A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.

Enriched Status Counters

Counter Description
chunk_count Total chunks produced so far
annotation_count Total annotations generated
concept_count Total concept nodes created
edge_count Total graph edges created

Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}