KnowledgeRefinery/docs/data-model.md
oho 38a99476d6 Knowledge Refinery: local-first semantic search & 3D concept visualization
macOS app for corpus ingestion, semantic search, and concept universe
visualization powered by local LLMs via LM Studio.

Architecture:
- Go daemon (17MB single binary, zero dependencies)
  - chi router, pure-Go SQLite, tiktoken tokenizer
  - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize
  - Brute-force cosine vector search in memory
  - 89 tests across 8 packages
- SwiftUI app (macOS 15+)
  - Multi-workspace management with auto-start daemons
  - Live pipeline progress, search, concept browser
  - WebGPU 3D universe renderer with Canvas2D fallback
  - Custom crystal app icon
2026-02-13 18:09:46 +01:00

127 lines
4 KiB
Markdown

# Knowledge Refinery Data Model
## Entity Relationship
```
WatchedVolume
▼ (contains files)
FileAsset ──────────────────┐
│ │
▼ (extracted into) │
ContentAtom │
│ │
▼ (split into) │
Chunk ──────────────────────┤
│ │
├──▶ Vector (LanceDB) │
│ │
├──▶ Annotation │
│ (versioned) │
│ │
└──▶ GraphEdge ◀────────┘
ConceptNode
(hierarchical)
```
## Tables
### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending``extracted``chunked``embedded``annotated``conceptualized`
### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.
### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to LanceDB vectors via embedding_id.
### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.
### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.
### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.
### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.
## Live Progress State (M8, In-Memory)
During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:
### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:
```json
{
"scan": {"status": "done", "progress_pct": 100},
"extract": {"status": "running", "progress_pct": 72},
"chunk": {"status": "pending", "progress_pct": 0},
"embed": {"status": "pending", "progress_pct": 0},
"annotate": {"status": "pending", "progress_pct": 0},
"conceptualize": {"status": "pending", "progress_pct": 0}
}
```
Each stage transitions through `pending` -> `running` -> `done`.
### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:
```json
{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}
```
The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.
### Enriched Status Counters
The `/ingest/status` response includes running totals updated as each stage completes:
| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |
## Evidence Anchors
Every derived artifact links back to source via JSON evidence anchors:
```json
{
"asset_id": "abc123...",
"page": 5,
"bbox": [100, 200, 400, 250],
"chapter": "Introduction",
"offset": 1024,
"archive_chain": "docs.zip::papers/paper.pdf::page=5",
"line_start": 42,
"line_end": 58
}
```
## Vector Schema (LanceDB)
| Field | Type | Description |
|-------|------|-------------|
| id | string | Matches chunks.id |
| vector | float32[] | Embedding vector |
| text | string | Chunk text |
| asset_id | string | Source file |
| asset_path | string | File path |
| evidence_anchor | string | JSON anchor |
| topics | string | Comma-separated topics |
| atom_type | string | text/image/etc |
| pipeline_version | string | Version tag |