# Knowledge Refinery Data Model

## Entity Relationship

```
WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (SQLite)    │
    │    768-dim BLOB       │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)
```

## Tables (SQLite)

### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.

### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to vectors in `chunk_vectors` table via chunk ID.

### chunk_vectors
Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector).
Loaded into memory at startup for brute-force cosine similarity search.

| Field | Type | Description |
|-------|------|-------------|
| id | TEXT PRIMARY KEY | Matches chunks.id |
| vector | BLOB | 768-dim float32 embedding |
| text | TEXT | Chunk text |
| asset_id | TEXT | Source file |
| asset_path | TEXT | File path |
| evidence_anchor | TEXT | JSON anchor |
| pipeline_version | TEXT | Version tag |

### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.

### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.

### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.

### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.

## Live Progress State (In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:

```json
{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}
```

### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.

### Enriched Status Counters

| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |

## Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

```json
{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}
```