or swipe to navigate
Local-First · Evidence-Native · LLM-Powered

Knowledge Refinery

A macOS Tahoe application that ingests heterogeneous document corpora,
extracts structured knowledge via local LLMs, and renders an immersive 3D concept universe.

7
Milestones
6
Pipeline Stages
30
Unit Tests
0
Cloud Calls
macOS · SwiftUI · Go · chi · LM Studio · SQLite · WebGPU · Multi-Workspace
The Problem

Your knowledge is scattered, siloed, and unsearchable

Research papers, notes, e-books, medical images, code, archives — spread across folders with no semantic connections. Keyword search fails. Cloud tools leak your data. You need something that actually understands your corpus.

🔍

Search is Broken

Keyword search misses semantically related content. "Machine learning" won't find your "neural network" papers.

🔒

Privacy Concerns

Cloud-based tools send your sensitive documents, medical records, and proprietary research to third-party servers.

📁

No Connections

You have thousands of files but no way to see how concepts relate across documents, formats, and domains.

The Solution

Knowledge Refinery: Your Local Knowledge Engine

🧠

Semantic Understanding via Local LLMs

Uses LM Studio running locally — your data never leaves your machine. Embeddings for similarity search, structured annotation for deep understanding.

🔗

Evidence-Native — Every Insight Links to Source

Every annotation, concept, and search result traces back to an exact location: file, page, chapter, even byte offset within nested archives.

🌌

3D Concept Universe

WebGPU-powered visualization renders your knowledge as an interactive 3D graph. Zoom from macro concepts to individual chunks. See how ideas connect.

🛡

Sandboxed & Incremental

Archive extraction runs in macOS sandbox-exec. Changed files are detected via content hashing — only new/modified files are reprocessed.

Architecture

Three-Tier Local Architecture

📱 SwiftUI App

Search — Semantic vector search
Universe — WebGPU 3D visualization
Concepts — Cluster browser + "Why?"
Ingest — Pipeline monitoring
Volumes — Folder management
Assets — File inventory + Quick Look
HTTP (localhost) 127.0.0.1:8742

⚙️ Go Daemon

chi router with CORS
6-Stage Pipeline orchestrator
SQLite metadata + vectors + graph (WAL)
11MB single binary, zero deps
7 Extractors + fallback chain
OpenAI-Compatible API

🤖 LM Studio

qwen3-4b-2507 — Chat / Annotation
nomic-embed-text-v1.5 — Embeddings
Running on 127.0.0.1:1234
Core Engine

Six-Stage Ingestion Pipeline

Every file flows through six deterministic stages. The pipeline is incremental — unchanged files are never reprocessed.

1
Scan Walk directories, compute SHA-256 content hashes, detect new/changed files
2
Extract Produce ContentAtoms with evidence anchors (page, chapter, offset)
3
Chunk Deterministic 500-800 token splits with 50-token overlap, stable IDs
4
Embed 768-dim vectors via nomic-embed-text stored in SQLite
5
Annotate LLM extracts topics, entities, claims, sentiment, summary
6
Conceptualize K-means++ clustering, concept labeling, kNN similarity graph
Status tracked per-asset: pending → extracted → chunked → embedded → annotated
Format Support

Seven Pluggable Extractors

Priority-sorted extractor registry. Each extractor produces ContentAtoms with evidence anchors. The Tika fallback handles anything the others miss.

📑
PDF
.pdf · Per-page text · pdftotext
Priority 20
📖
EPUB
.epub · OPF spine order · Metadata
Priority 18
🖼
Image + OCR
.png .jpg .heic · macOS Vision
Priority 15
🩻
DICOM
.dcm · Medical metadata · Binary header parsing
Priority 15
📝
Text
.txt .md .html .rtf · Tag stripping
Priority 10
📦
Archive
.zip .tar.gz · Sandboxed · Zip-slip safe
Priority 5
🔄
Tika Fallback
.doc .docx .pptx · textutil · Raw text
Priority 1
Custom
Extend BaseExtractor · Register in registry
Pluggable
LLM Intelligence

Rich Structured Annotations

Every chunk is annotated by the local LLM, producing structured metadata that enriches search and powers concept formation.

Topics
cryptography encryption security post-quantum digital signatures
Summary

"This text describes various cryptographic techniques, including symmetric and asymmetric encryption algorithms like AES and RSA..."

Entities
AES RSA ECC NIST SHA-256 EdDSA
Sentiment
⚖ neutral · 0.95

Annotation Fields

FieldType
topics2-5 multi-label tags
entitiesNamed entities + type
claimsExtracted claims + confidence
sentimentLabel + confidence score
summary1-2 sentence summary
quality_flagsTruncated, technical, etc.
Versioned & Immutable: Annotations are never overwritten. New model/prompt versions create new records with is_current=1, preserving the full audit trail.
Visualization

3D Concept Universe

Concepts are rendered as a force-directed 3D graph using WebGPU. Orbit, zoom, and click to explore how your knowledge connects.

AI Learning Data Systems Security
9 nodes MACRO 60 fps

Rendering Engine

RendererWebGPU in WKWebView
ShadersWGSL (214 lines)
LayoutForce-directed (Velocity Verlet)
NodesBillboarded quads with circle SDF + glow
CameraOrbit (drag), Pan (right-drag), Zoom (scroll)

Level-of-Detail System

LODShowsZoom
MACROConcept clusters onlyDistant
MIDConcepts + sub-conceptsMedium
NEARAll nodes + all edgesClose
Data Model

Evidence-Native Knowledge Graph

Every derived artifact links back to its source. The data model captures the full provenance chain from file to concept.

-- Provenance chain: WatchedVolume (contains files) FileAsset -- SHA-256 ID, content hash (extracted into) ContentAtom -- text/image/table/metadata (split into) Chunk -- deterministic ID, 500-800 tokens ├──> Vector (SQLite BLOB, 768-dim) ├──> Annotation (versioned, immutable) └──> GraphEdge ConceptNode (hierarchical clusters)

Evidence Anchors

Every ContentAtom, Chunk, and Edge stores a JSON evidence anchor linking to the exact source location.

{ "asset_id": "abc123...", "page": 5, "chapter": "Introduction", "archive_chain": "docs.zip/paper.pdf", "line_start": 42, "line_end": 58 }

Storage

SQLiteMetadata, vectors (BLOB), graph, jobs (WAL mode)
Disk~/.knowledge-refinery/
Security

Defense in Depth

Zero cloud calls. Localhost-only daemon. Sandboxed extraction. Every security layer works locally.

🔐

Localhost-Only Binding

Daemon binds exclusively to 127.0.0.1 — not reachable from the network. No tokens needed for a single-user local app.

🛡

macOS Sandbox

Archive extraction runs in sandbox-exec with: no network, restricted filesystem, CPU and memory limits.

🚫

Zip-Slip Prevention

All archive member paths are validated and resolved against the extraction base directory. Traversal attempts are blocked.

💣

Archive Bomb Detection

Limits enforced: 10,000 max files, 500MB total extracted, 50MB per file, max 3 nesting levels.

💻

100% Local

LM Studio runs on localhost. Data stays on disk. No telemetry, no cloud APIs, no external network calls. Zero data leakage.

🔍

Content Hashing

SHA-256 streaming hash for change detection. Deterministic chunk IDs ensure stable references across re-processing.

Milestone 7

Master Control App

One dashboard to manage everything. LM Studio monitoring, multi-workspace lifecycle, visual data lake mapping — all from a single window.

📊

Dashboard

LM Studio status card with model names. Workspace grid with start/stop toggles, vector counts, and color-coded cards. "Start All" for one-click launch.

📦

Multi-Workspace

Each workspace gets its own data directory, daemon port (8742, 8743, ...), and SQLite database (metadata + vectors + graph). Independent lifecycle management.

🗺

Data Lake Mapping

Visual Canvas view showing which folders feed which workspaces. Bézier curves connect data lakes to knowledge bases with color-coded workspace tags.

🧠

LM Studio Monitor

Direct polling of /v1/models every 5 seconds. Auto-classifies chat vs. embedding models. Independent of daemon status — green means ready.

🔧

Daemon Lifecycle

Start, stop, restart per-workspace daemons. Environment variables KR_DATA_DIR and KR_PORT injected automatically. Live log capture (last 500 lines).

📁

Workspace Setup

Create workspaces with name, color tag, and folder picker. Native NSOpenPanel for multi-selecting data lake paths including external drives.

WorkspaceConfig.swift
LMStudioMonitor.swift
MasterDashboardView.swift
DataLakeMappingView.swift
WorkspaceDetailView.swift
Installation

From Clone to Running in One Command

Prerequisites

macOSTahoe (26.x) on Apple Silicon
Xcode26.x or Command Line Tools
Go1.22+ (from go.dev or Homebrew)
LM StudioFrom lmstudio.ai (free)

What the Installer Does

  1. Checks prerequisites

    Validates macOS version, architecture, Xcode tools, Swift, and Go version.

  2. Builds Go daemon

    Compiles the Go daemon into a single 17MB binary with zero runtime dependencies.

  3. Builds the .app bundle

    Swift release build → proper .app with Info.plist, bundled daemon, and WebGPU resources.

  4. Installs to /Applications

    Copies the app bundle. Appears in Launchpad and Spotlight immediately.

One-Line Install

# Clone and install git clone <repo-url> cd LongLocalTimeHorizonInfoRetrieval bash scripts/install.sh

Make Targets

make build # Build .app bundle to dist/ make install # Full install to /Applications make test # Run all tests make app-run # Dev mode (swift run) make daemon-run # Run daemon directly make clean # Remove build artifacts

App Bundle Structure

Knowledge Refinery.app/ Contents/ Info.plist MacOS/ KnowledgeRefinery # launcher KnowledgeRefinery-bin # Swift binary Resources/ knowledge-refinery-daemon # Go binary (17MB) WebGPU/ # 3D renderer universe.html/js/wgsl
App Walkthrough

Dashboard + Workspace: Two-Level UI

Level 1: Dashboard

📊

LM Studio Card

Green/red status, loaded model names (chat + embedding), port number. Polls directly at /v1/models.

📦

Workspace Grid

Color-coded cards with daemon toggle, vector count, data lake count. Click "Open" to enter a workspace.

🗺

Data Lake Map

Canvas-drawn Bézier curves connecting folder paths to workspaces. See which folders feed which knowledge bases.

Level 2: Workspace

🔎

Search + Universe + Concepts

Full NavigationSplitView with semantic search, WebGPU 3D concept graph, hierarchical concept browser with "Why?" explanations.

Ingest + Volumes + Assets

Pipeline monitoring, volume management, file inventory. Each workspace has its own daemon process and data directory.

🔧

Daemon Controls + Logs

Play/stop/restart buttons in the header bar. Live log viewer showing last 500 lines of daemon stdout/stderr.

🟢 LM Studio (shared) 🟢 Workspace 1 :8742 🟢 Workspace 2 :8743 Each workspace runs its own daemon
Deep Dive

How Search Works

Query Flow

1 User types: "cryptography encryption"
2 Embed query via nomic-embed-text → 768-dim vector
3 Brute-force cosine similarity search (in-memory vectors)
4 Enrich with annotations (topics, summary, entities)
5 Return ranked results with evidence anchors

Example Result

📄 cryptography_basics.html score: 0.378
LLM SUMMARY

"This text describes various cryptographic techniques, including symmetric and asymmetric encryption..."

ENTITIES
AES RSA ECC NIST SHA-256
⚖ neutral
Key insight: A search for "cryptography" finds the right file even though the query doesn't match any specific keywords in other documents. Semantic similarity beats keyword search.
API Reference

RESTful API Endpoints

The daemon exposes a clean REST API on localhost. All endpoints are accessible without authentication — the daemon only binds to 127.0.0.1.

MethodEndpointDescription
GET/healthHealth check + LM Studio status (no auth)
POST/volumes/addRegister a folder to watch
GET/volumes/listList all watched volumes
POST/ingest/startTrigger full 6-stage pipeline
GET/ingest/statusPipeline progress + per-stage counts
POST/searchSemantic vector search with annotation enrichment
GET/search/quick?q=...Quick search via query parameter
GET/universe/snapshot?lod=macroGraph snapshot at given LOD level
POST/universe/focusFocus on node and return neighborhood
GET/concepts/listList all concept clusters
GET/concepts/{id}Concept detail with member chunks
GET/concepts/{id}/why"Why this concept?" with evidence chain
GET/evidence/chunk/{id}/annotationFull annotation for a specific chunk
Technology

Technology Stack

Frontend (macOS App)

FrameworkSwiftUI (macOS Tahoe / 26.x)
LanguageSwift 6.2.3
ArchitectureMulti-workspace + dashboard
3D RenderingWebGPU via WKWebView
File PreviewQuickLook framework
BuildSPM + build.sh → .app bundle
Installmake install or install.sh

Backend (Go Daemon)

LanguageGo 1.22+ (single 17MB binary)
HTTP Routerchi/v5 + CORS middleware
StorageSQLite (metadata + vectors + graph, WAL mode)
SQLite Drivermodernc.org/sqlite (pure Go, no CGo)
LLM Clientnet/http → LM Studio
PDFpdftotext (poppler-utils)
Tokenizertiktoken-go (cl100k_base)
MathPure Go (k-means++, cosine sim)

LM Studio Models

Chatqwen3-4b-2507
Embeddingsnomic-embed-text-v1.5 (768-dim)

Your Knowledge,
Refined.

Fully local. Evidence-native. LLM-powered. Installable.
Clone, install, launch — from raw documents to searchable knowledge in minutes.

97
Tests Passing
0
Compiler Warnings
~90
Source Files
100%
Local
📁 daemon-go/
📱 apps/macos/KnowledgeRefinery/
🧪 daemon-go/internal/*/
🛠 scripts/install.sh
git clone <repo> && cd LongLocalTimeHorizonInfoRetrieval && make install
Built with SwiftUI, Go, chi, SQLite, WebGPU, and local LLMs