KnowledgeRefinery/docs/architecture.md
oho 38a99476d6 Knowledge Refinery: local-first semantic search & 3D concept visualization
macOS app for corpus ingestion, semantic search, and concept universe
visualization powered by local LLMs via LM Studio.

Architecture:
- Go daemon (17MB single binary, zero dependencies)
  - chi router, pure-Go SQLite, tiktoken tokenizer
  - 6-stage pipeline: scan → extract → chunk → embed → annotate → conceptualize
  - Brute-force cosine vector search in memory
  - 89 tests across 8 packages
- SwiftUI app (macOS 15+)
  - Multi-workspace management with auto-start daemons
  - Live pipeline progress, search, concept browser
  - WebGPU 3D universe renderer with Canvas2D fallback
  - Custom crystal app icon
2026-02-13 18:09:46 +01:00

121 lines
6.2 KiB
Markdown

# Knowledge Refinery Architecture
## Overview
Knowledge Refinery is a local-first macOS application that ingests heterogeneous document corpora, extracts structured knowledge using local LLMs, and provides semantic search and visualization.
## System Components
```
┌──────────────────────────────────┐
│ SwiftUI macOS App │
│ ┌────────┬────────────┐ │
│ │ Search │ Evidence │ │
│ │ View │ Panel (QL) │ │
│ ├────────┼────────────┤ │
│ │Pipeline│ Volume │ │
│ │Progress│ Manager │ │
│ │ Panel │ │ │
│ └────────┴────────────┘ │
│ │ HTTP (localhost) │
│ │ ┌──────────────────┐ │
│ │ │ 1.5s poll loop │ │
│ │ │ /ingest/status │◀─┐ │
│ │ └──────────────────┘ │ │
│ │ auto-stop on done │ │
│ │ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ 5s universe │ │ │
│ │ │ auto-refresh │──┘ │
│ │ └──────────────────┘ │
└───────┼──────────────────────────┘
┌──────────────────────────────────┐
│ Refinery Daemon (Python) │
│ Per-workspace on independent │
│ port + data dir │
│ ┌──────────────────────┐ │
│ │ FastAPI Server │ │
│ └──────────┬───────────┘ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Pipeline │ │
│ │ Orchestrator │ │
│ └──┬──┬──┬──┬──┬──┬────┘ │
│ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ │
│ Scan Extract Chunk Embed │
│ Annotate Conceptualize │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Live Progress Dict │ │
│ │ + Activity Log Ring │ │
│ │ (200-entry buf) │ │
│ └──────────────────────┘ │
│ │
│ ┌─────────┬───────────┐ │
│ │ SQLite │ LanceDB │ │
│ │ (meta) │ (vectors) │ │
│ └─────────┴───────────┘ │
└───────┼──────────────────────────┘
┌──────────────────────────────────┐
│ LM Studio │
│ (127.0.0.1:1234) │
│ Embeddings + Chat │
└──────────────────────────────────┘
```
## Pipeline Stages
1. **Scan** - Walk directories, compute content hashes, detect changes
2. **Extract** - Produce ContentAtoms with evidence anchors (PDF pages, text lines, etc.)
3. **Chunk** - Deterministic text splitting (500-800 tokens, 50 token overlap)
4. **Embed** - Generate vector embeddings via LM Studio
5. **Annotate** - Structured LLM annotation (topics, entities, claims, sentiment)
6. **Conceptualize** - Build similarity graph and concept clusters
## Data Flow
Files → FileAsset → ContentAtom → Chunk → Embedding (LanceDB) + Annotation (SQLite)
ConceptNode + GraphEdge
## Live Progress Data Flow (M8)
During pipeline execution, the daemon maintains in-memory state that the app polls:
```
Pipeline Orchestrator
├──▶ live progress dict (per-stage status: pending/running/done)
│ stage_name, progress_pct, item_count
├──▶ counters: chunk_count, annotation_count, concept_count, edge_count
└──▶ activity_log ring buffer (200 entries, last 50 returned via API)
timestamp + message per event
SwiftUI App polling loop (1.5s interval):
GET /ingest/status ──▶ stages, counters, activity_log
├── Pipeline Progress Panel: checkmarks + progress bars per stage
├── Animated counters: chunks, vectors, annotations, concepts, edges
├── Interaction indicators: App↔Daemon, Daemon↔LM Studio
├── Auto-scrolling activity log
└── Auto-stop polling when pipeline status = idle/done
Universe auto-refresh (5s timer during ingestion):
GET /universe/snapshot ──▶ mergeUniverse() for incremental node injection
```
## Key Design Decisions
- **LanceDB** over Qdrant: Embedded, no separate server, local-first
- **SQLite** for metadata/graph: Simple, reliable, WAL mode for concurrency
- **Deterministic chunk IDs**: SHA-256(asset_id + anchor + normalized_text_hash)
- **Versioned annotations**: Never overwrite, mark active by pipeline version
- **Evidence-native**: Every derived insight links back to source file + location
- **Fast polling over WebSocket**: 1.5s HTTP polls are simpler and sufficient for pipeline status; avoids connection lifecycle complexity
- **Ring buffer for activity log**: Fixed 200-entry buffer prevents memory growth during long pipeline runs