KnowledgeRefinery/docs/data-model.md

# Knowledge Refinery Data Model

## Entity Relationship

```
WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (SQLite)    │
    │    768-dim BLOB       │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)
```

## Tables (SQLite)

### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.

### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to vectors in `chunk_vectors` table via chunk ID.

### chunk_vectors
Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector).
Loaded into memory at startup for brute-force cosine similarity search.

| Field | Type | Description |
|-------|------|-------------|
| id | TEXT PRIMARY KEY | Matches chunks.id |
| vector | BLOB | 768-dim float32 embedding |
| text | TEXT | Chunk text |
| asset_id | TEXT | Source file |
| asset_path | TEXT | File path |
| evidence_anchor | TEXT | JSON anchor |
| pipeline_version | TEXT | Version tag |

### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.

### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.

### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.

### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.

## Live Progress State (In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:

```json
{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}
```

### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.

### Enriched Status Counters

| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |

## Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

```json
{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}
```
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00			`# Knowledge Refinery Data Model`

			`## Entity Relationship`

			```
			`WatchedVolume`
			`│`
			`▼ (contains files)`
			`FileAsset ──────────────────┐`
			`│ │`
			`▼ (extracted into) │`
			`ContentAtom │`
			`│ │`
			`▼ (split into) │`
			`Chunk ──────────────────────┤`
			`│ │`
Update all documentation for Go daemon rewrite All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables. 2026-02-13 18:29:23 +00:00			`├──▶ Vector (SQLite) │`
			`│ 768-dim BLOB │`
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00			`│ │`
			`├──▶ Annotation │`
			`│ (versioned) │`
			`│ │`
			`└──▶ GraphEdge ◀────────┘`
			`│`
			`▼`
			`ConceptNode`
			`(hierarchical)`
			```

Update all documentation for Go daemon rewrite All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables. 2026-02-13 18:29:23 +00:00			`## Tables (SQLite)`
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00
			`### file_assets`
			`Tracks every file in watched volumes. Status progresses through:`
			`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

			`### content_atoms`
			`Raw content extracted from files. Types: text, image, table, metadata, binary.`
			`Each atom has an evidence_anchor linking to exact source location.`

			`### chunks`
			`Deterministic text segments (500-800 tokens). IDs are stable across re-processing.`
Update all documentation for Go daemon rewrite All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables. 2026-02-13 18:29:23 +00:00			Linked to vectors in `chunk_vectors` table via chunk ID.

			`### chunk_vectors`
			`Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector).`
			`Loaded into memory at startup for brute-force cosine similarity search.`

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			`\| id \| TEXT PRIMARY KEY \| Matches chunks.id \|`
			`\| vector \| BLOB \| 768-dim float32 embedding \|`
			`\| text \| TEXT \| Chunk text \|`
			`\| asset_id \| TEXT \| Source file \|`
			`\| asset_path \| TEXT \| File path \|`
			`\| evidence_anchor \| TEXT \| JSON anchor \|`
			`\| pipeline_version \| TEXT \| Version tag \|`
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00
			`### annotations`
			`LLM-generated structured metadata per chunk. Never overwritten - new annotations`
			are added with `is_current=1` and previous ones marked `is_current=0`.
			`Versioned by model_id + prompt_id + prompt_version.`

			`### concept_nodes`
			`Hierarchical concept clusters derived from embedding similarity.`
			`Level 0 = macro concepts, higher levels = finer granularity.`

			`### graph_edges`
			`Typed, weighted edges: similarity, concept membership, co-occurrence.`
			`Each edge stores evidence references back to source chunks.`

			`### pipeline_jobs`
			`Crash recovery: tracks job state so processing resumes after restart.`

Update all documentation for Go daemon rewrite All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables. 2026-02-13 18:29:23 +00:00			`## Live Progress State (In-Memory)`
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00
			`During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:`

			`### Live Progress Dict`
			Per-stage status object returned in the `live` field of `/ingest/status`:

			```json
			`{`
			`"scan": {"status": "done", "progress_pct": 100},`
			`"extract": {"status": "running", "progress_pct": 72},`
			`"chunk": {"status": "pending", "progress_pct": 0},`
			`"embed": {"status": "pending", "progress_pct": 0},`
			`"annotate": {"status": "pending", "progress_pct": 0},`
			`"conceptualize": {"status": "pending", "progress_pct": 0}`
			`}`
			```

			`### Activity Log Ring Buffer`
Update all documentation for Go daemon rewrite All docs, README, and presentation now reflect the Go daemon architecture: Python/FastAPI/LanceDB/PyMuPDF references replaced with Go/chi/SQLite/pdftotext. Updated test counts (97), model names (qwen3-4b-2507), app bundle structure, installer steps, and tech stack tables. 2026-02-13 18:29:23 +00:00			`A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.`
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00
			`### Enriched Status Counters`

			`\| Counter \| Description \|`
			`\|---------\|-------------\|`
			\| `chunk_count` \| Total chunks produced so far \|
			\| `annotation_count` \| Total annotations generated \|
			\| `concept_count` \| Total concept nodes created \|
			\| `edge_count` \| Total graph edges created \|

			`## Evidence Anchors`

			`Every derived artifact links back to source via JSON evidence anchors:`

			```json
			`{`
			`"asset_id": "abc123...",`
			`"page": 5,`
			`"bbox": [100, 200, 400, 250],`
			`"chapter": "Introduction",`
			`"offset": 1024,`
			`"archive_chain": "docs.zip::papers/paper.pdf::page=5",`
			`"line_start": 42,`
			`"line_end": 58`
			`}`
			```