All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables.
3.7 KiB
Knowledge Refinery Data Model
Entity Relationship
WatchedVolume
│
▼ (contains files)
FileAsset ──────────────────┐
│ │
▼ (extracted into) │
ContentAtom │
│ │
▼ (split into) │
Chunk ──────────────────────┤
│ │
├──▶ Vector (SQLite) │
│ 768-dim BLOB │
│ │
├──▶ Annotation │
│ (versioned) │
│ │
└──▶ GraphEdge ◀────────┘
│
▼
ConceptNode
(hierarchical)
Tables (SQLite)
file_assets
Tracks every file in watched volumes. Status progresses through:
pending → extracted → chunked → embedded → annotated → conceptualized
content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary. Each atom has an evidence_anchor linking to exact source location.
chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to vectors in chunk_vectors table via chunk ID.
chunk_vectors
Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector). Loaded into memory at startup for brute-force cosine similarity search.
| Field | Type | Description |
|---|---|---|
| id | TEXT PRIMARY KEY | Matches chunks.id |
| vector | BLOB | 768-dim float32 embedding |
| text | TEXT | Chunk text |
| asset_id | TEXT | Source file |
| asset_path | TEXT | File path |
| evidence_anchor | TEXT | JSON anchor |
| pipeline_version | TEXT | Version tag |
annotations
LLM-generated structured metadata per chunk. Never overwritten - new annotations
are added with is_current=1 and previous ones marked is_current=0.
Versioned by model_id + prompt_id + prompt_version.
concept_nodes
Hierarchical concept clusters derived from embedding similarity. Level 0 = macro concepts, higher levels = finer granularity.
graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence. Each edge stores evidence references back to source chunks.
pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.
Live Progress State (In-Memory)
During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:
Live Progress Dict
Per-stage status object returned in the live field of /ingest/status:
{
"scan": {"status": "done", "progress_pct": 100},
"extract": {"status": "running", "progress_pct": 72},
"chunk": {"status": "pending", "progress_pct": 0},
"embed": {"status": "pending", "progress_pct": 0},
"annotate": {"status": "pending", "progress_pct": 0},
"conceptualize": {"status": "pending", "progress_pct": 0}
}
Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.
Enriched Status Counters
| Counter | Description |
|---|---|
chunk_count |
Total chunks produced so far |
annotation_count |
Total annotations generated |
concept_count |
Total concept nodes created |
edge_count |
Total graph edges created |
Evidence Anchors
Every derived artifact links back to source via JSON evidence anchors:
{
"asset_id": "abc123...",
"page": 5,
"bbox": [100, 200, 400, 250],
"chapter": "Introduction",
"offset": 1024,
"archive_chain": "docs.zip::papers/paper.pdf::page=5",
"line_start": 42,
"line_end": 58
}