KnowledgeRefinery/docs/data-model.md

# Knowledge Refinery Data Model

## Entity Relationship

```
WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (LanceDB)   │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)
```

## Tables

### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.

### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to LanceDB vectors via embedding_id.

### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.

### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.

### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.

### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.

## Live Progress State (M8, In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:

```json
{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}
```

Each stage transitions through `pending` -> `running` -> `done`.

### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:

```json
{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}
```

The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.

### Enriched Status Counters
The `/ingest/status` response includes running totals updated as each stage completes:

| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |

## Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

```json
{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}
```

## Vector Schema (LanceDB)

| Field | Type | Description |
|-------|------|-------------|
| id | string | Matches chunks.id |
| vector | float32[] | Embedding vector |
| text | string | Chunk text |
| asset_id | string | Source file |
| asset_path | string | File path |
| evidence_anchor | string | JSON anchor |
| topics | string | Comma-separated topics |
| atom_type | string | text/image/etc |
| pipeline_version | string | Version tag |
Knowledge Refinery: local-first semantic search & 3D concept visualization macOS app for corpus ingestion, semantic search, and concept universe visualization powered by local LLMs via LM Studio. Architecture: - Go daemon (17MB single binary, zero dependencies) - chi router, pure-Go SQLite, tiktoken tokenizer - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize - Brute-force cosine vector search in memory - 89 tests across 8 packages - SwiftUI app (macOS 15+) - Multi-workspace management with auto-start daemons - Live pipeline progress, search, concept browser - WebGPU 3D universe renderer with Canvas2D fallback - Custom crystal app icon 2026-02-13 17:09:46 +00:00			`# Knowledge Refinery Data Model`

			`## Entity Relationship`

			```
			`WatchedVolume`
			`│`
			`▼ (contains files)`
			`FileAsset ──────────────────┐`
			`│ │`
			`▼ (extracted into) │`
			`ContentAtom │`
			`│ │`
			`▼ (split into) │`
			`Chunk ──────────────────────┤`
			`│ │`
			`├──▶ Vector (LanceDB) │`
			`│ │`
			`├──▶ Annotation │`
			`│ (versioned) │`
			`│ │`
			`└──▶ GraphEdge ◀────────┘`
			`│`
			`▼`
			`ConceptNode`
			`(hierarchical)`
			```

			`## Tables`

			`### file_assets`
			`Tracks every file in watched volumes. Status progresses through:`
			`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

			`### content_atoms`
			`Raw content extracted from files. Types: text, image, table, metadata, binary.`
			`Each atom has an evidence_anchor linking to exact source location.`

			`### chunks`
			`Deterministic text segments (500-800 tokens). IDs are stable across re-processing.`
			`Linked to LanceDB vectors via embedding_id.`

			`### annotations`
			`LLM-generated structured metadata per chunk. Never overwritten - new annotations`
			are added with `is_current=1` and previous ones marked `is_current=0`.
			`Versioned by model_id + prompt_id + prompt_version.`

			`### concept_nodes`
			`Hierarchical concept clusters derived from embedding similarity.`
			`Level 0 = macro concepts, higher levels = finer granularity.`

			`### graph_edges`
			`Typed, weighted edges: similarity, concept membership, co-occurrence.`
			`Each edge stores evidence references back to source chunks.`

			`### pipeline_jobs`
			`Crash recovery: tracks job state so processing resumes after restart.`

			`## Live Progress State (M8, In-Memory)`

			`During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:`

			`### Live Progress Dict`
			Per-stage status object returned in the `live` field of `/ingest/status`:

			```json
			`{`
			`"scan": {"status": "done", "progress_pct": 100},`
			`"extract": {"status": "running", "progress_pct": 72},`
			`"chunk": {"status": "pending", "progress_pct": 0},`
			`"embed": {"status": "pending", "progress_pct": 0},`
			`"annotate": {"status": "pending", "progress_pct": 0},`
			`"conceptualize": {"status": "pending", "progress_pct": 0}`
			`}`
			```

			Each stage transitions through `pending` -> `running` -> `done`.

			`### Activity Log Ring Buffer`
			`A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:`

			```json
			`{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}`
			```

			`The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.`

			`### Enriched Status Counters`
			The `/ingest/status` response includes running totals updated as each stage completes:

			`\| Counter \| Description \|`
			`\|---------\|-------------\|`
			\| `chunk_count` \| Total chunks produced so far \|
			\| `annotation_count` \| Total annotations generated \|
			\| `concept_count` \| Total concept nodes created \|
			\| `edge_count` \| Total graph edges created \|

			`## Evidence Anchors`

			`Every derived artifact links back to source via JSON evidence anchors:`

			```json
			`{`
			`"asset_id": "abc123...",`
			`"page": 5,`
			`"bbox": [100, 200, 400, 250],`
			`"chapter": "Introduction",`
			`"offset": 1024,`
			`"archive_chain": "docs.zip::papers/paper.pdf::page=5",`
			`"line_start": 42,`
			`"line_end": 58`
			`}`
			```

			`## Vector Schema (LanceDB)`

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			`\| id \| string \| Matches chunks.id \|`
			`\| vector \| float32[] \| Embedding vector \|`
			`\| text \| string \| Chunk text \|`
			`\| asset_id \| string \| Source file \|`
			`\| asset_path \| string \| File path \|`
			`\| evidence_anchor \| string \| JSON anchor \|`
			`\| topics \| string \| Comma-separated topics \|`
			`\| atom_type \| string \| text/image/etc \|`
			`\| pipeline_version \| string \| Version tag \|`