A macOS Tahoe application that ingests heterogeneous document corpora,
extracts structured knowledge via local LLMs, and renders an immersive 3D concept universe.
Research papers, notes, e-books, medical images, code, archives — spread across folders with no semantic connections. Keyword search fails. Cloud tools leak your data. You need something that actually understands your corpus.
Keyword search misses semantically related content. "Machine learning" won't find your "neural network" papers.
Cloud-based tools send your sensitive documents, medical records, and proprietary research to third-party servers.
You have thousands of files but no way to see how concepts relate across documents, formats, and domains.
Uses LM Studio running locally — your data never leaves your machine. Embeddings for similarity search, structured annotation for deep understanding.
Every annotation, concept, and search result traces back to an exact location: file, page, chapter, even byte offset within nested archives.
WebGPU-powered visualization renders your knowledge as an interactive 3D graph. Zoom from macro concepts to individual chunks. See how ideas connect.
Archive extraction runs in macOS sandbox-exec. Changed files are detected via content hashing — only new/modified files are reprocessed.
Every file flows through six deterministic stages. The pipeline is incremental — unchanged files are never reprocessed.
pending → extracted → chunked → embedded → annotated
Priority-sorted extractor registry. Each extractor produces ContentAtoms with evidence anchors. The Tika fallback handles anything the others miss.
Every chunk is annotated by the local LLM, producing structured metadata that enriches search and powers concept formation.
"This text describes various cryptographic techniques, including symmetric and asymmetric encryption algorithms like AES and RSA..."
| Field | Type |
|---|---|
topics | 2-5 multi-label tags |
entities | Named entities + type |
claims | Extracted claims + confidence |
sentiment | Label + confidence score |
summary | 1-2 sentence summary |
quality_flags | Truncated, technical, etc. |
is_current=1, preserving the full audit trail.
Concepts are rendered as a force-directed 3D graph using WebGPU. Orbit, zoom, and click to explore how your knowledge connects.
| Renderer | WebGPU in WKWebView |
| Shaders | WGSL (214 lines) |
| Layout | Force-directed (Velocity Verlet) |
| Nodes | Billboarded quads with circle SDF + glow |
| Camera | Orbit (drag), Pan (right-drag), Zoom (scroll) |
| LOD | Shows | Zoom |
|---|---|---|
| MACRO | Concept clusters only | Distant |
| MID | Concepts + sub-concepts | Medium |
| NEAR | All nodes + all edges | Close |
Every derived artifact links back to its source. The data model captures the full provenance chain from file to concept.
Every ContentAtom, Chunk, and Edge stores a JSON evidence anchor linking to the exact source location.
| SQLite | Metadata, vectors (BLOB), graph, jobs (WAL mode) |
| Disk | ~/.knowledge-refinery/ |
Zero cloud calls. Localhost-only daemon. Sandboxed extraction. Every security layer works locally.
Daemon binds exclusively to 127.0.0.1 — not reachable from the network. No tokens needed for a single-user local app.
Archive extraction runs in sandbox-exec with: no network, restricted filesystem, CPU and memory limits.
All archive member paths are validated and resolved against the extraction base directory. Traversal attempts are blocked.
Limits enforced: 10,000 max files, 500MB total extracted, 50MB per file, max 3 nesting levels.
LM Studio runs on localhost. Data stays on disk. No telemetry, no cloud APIs, no external network calls. Zero data leakage.
SHA-256 streaming hash for change detection. Deterministic chunk IDs ensure stable references across re-processing.
One dashboard to manage everything. LM Studio monitoring, multi-workspace lifecycle, visual data lake mapping — all from a single window.
LM Studio status card with model names. Workspace grid with start/stop toggles, vector counts, and color-coded cards. "Start All" for one-click launch.
Each workspace gets its own data directory, daemon port (8742, 8743, ...), and SQLite database (metadata + vectors + graph). Independent lifecycle management.
Visual Canvas view showing which folders feed which workspaces. Bézier curves connect data lakes to knowledge bases with color-coded workspace tags.
Direct polling of /v1/models every 5 seconds. Auto-classifies chat vs. embedding models. Independent of daemon status — green means ready.
Start, stop, restart per-workspace daemons. Environment variables KR_DATA_DIR and KR_PORT injected automatically. Live log capture (last 500 lines).
Create workspaces with name, color tag, and folder picker. Native NSOpenPanel for multi-selecting data lake paths including external drives.
| macOS | Tahoe (26.x) on Apple Silicon |
| Xcode | 26.x or Command Line Tools |
| Go | 1.22+ (from go.dev or Homebrew) |
| LM Studio | From lmstudio.ai (free) |
Validates macOS version, architecture, Xcode tools, Swift, and Go version.
Compiles the Go daemon into a single 17MB binary with zero runtime dependencies.
Swift release build → proper .app with Info.plist, bundled daemon, and WebGPU resources.
Copies the app bundle. Appears in Launchpad and Spotlight immediately.
Green/red status, loaded model names (chat + embedding), port number. Polls directly at /v1/models.
Color-coded cards with daemon toggle, vector count, data lake count. Click "Open" to enter a workspace.
Canvas-drawn Bézier curves connecting folder paths to workspaces. See which folders feed which knowledge bases.
Full NavigationSplitView with semantic search, WebGPU 3D concept graph, hierarchical concept browser with "Why?" explanations.
Pipeline monitoring, volume management, file inventory. Each workspace has its own daemon process and data directory.
Play/stop/restart buttons in the header bar. Live log viewer showing last 500 lines of daemon stdout/stderr.
"cryptography encryption"
"This text describes various cryptographic techniques, including symmetric and asymmetric encryption..."
The daemon exposes a clean REST API on localhost. All endpoints are accessible without authentication — the daemon only binds to 127.0.0.1.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Health check + LM Studio status (no auth) |
| POST | /volumes/add | Register a folder to watch |
| GET | /volumes/list | List all watched volumes |
| POST | /ingest/start | Trigger full 6-stage pipeline |
| GET | /ingest/status | Pipeline progress + per-stage counts |
| POST | /search | Semantic vector search with annotation enrichment |
| GET | /search/quick?q=... | Quick search via query parameter |
| GET | /universe/snapshot?lod=macro | Graph snapshot at given LOD level |
| POST | /universe/focus | Focus on node and return neighborhood |
| GET | /concepts/list | List all concept clusters |
| GET | /concepts/{id} | Concept detail with member chunks |
| GET | /concepts/{id}/why | "Why this concept?" with evidence chain |
| GET | /evidence/chunk/{id}/annotation | Full annotation for a specific chunk |
| Framework | SwiftUI (macOS Tahoe / 26.x) |
| Language | Swift 6.2.3 |
| Architecture | Multi-workspace + dashboard |
| 3D Rendering | WebGPU via WKWebView |
| File Preview | QuickLook framework |
| Build | SPM + build.sh → .app bundle |
| Install | make install or install.sh |
| Language | Go 1.22+ (single 17MB binary) |
| HTTP Router | chi/v5 + CORS middleware |
| Storage | SQLite (metadata + vectors + graph, WAL mode) |
| SQLite Driver | modernc.org/sqlite (pure Go, no CGo) |
| LLM Client | net/http → LM Studio |
| pdftotext (poppler-utils) | |
| Tokenizer | tiktoken-go (cl100k_base) |
| Math | Pure Go (k-means++, cosine sim) |
| Chat | qwen3-4b-2507 |
| Embeddings | nomic-embed-text-v1.5 (768-dim) |
Fully local. Evidence-native. LLM-powered. Installable.
Clone, install, launch — from raw documents to searchable knowledge in minutes.