KnowledgeRefinery/docs/data-model.md

# Knowledge Refinery Data Model

## Entity Relationship

```
WatchedVolume
    │
    ▼ (contains files)
FileAsset ──────────────────┐
    │                       │
    ▼ (extracted into)      │
ContentAtom                 │
    │                       │
    ▼ (split into)          │
Chunk ──────────────────────┤
    │                       │
    ├──▶ Vector (LanceDB)   │
    │                       │
    ├──▶ Annotation         │
    │    (versioned)        │
    │                       │
    └──▶ GraphEdge ◀────────┘
         │
         ▼
    ConceptNode
    (hierarchical)
```

## Tables

### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`

### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.

### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to LanceDB vectors via embedding_id.

### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.

### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.

### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.

### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.

## Live Progress State (M8, In-Memory)

During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:

### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:

```json
{
  "scan":          {"status": "done",    "progress_pct": 100},
  "extract":       {"status": "running", "progress_pct": 72},
  "chunk":         {"status": "pending", "progress_pct": 0},
  "embed":         {"status": "pending", "progress_pct": 0},
  "annotate":      {"status": "pending", "progress_pct": 0},
  "conceptualize": {"status": "pending", "progress_pct": 0}
}
```

Each stage transitions through `pending` -> `running` -> `done`.

### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:

```json
{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}
```

The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.

### Enriched Status Counters
The `/ingest/status` response includes running totals updated as each stage completes:

| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |

## Evidence Anchors

Every derived artifact links back to source via JSON evidence anchors:

```json
{
    "asset_id": "abc123...",
    "page": 5,
    "bbox": [100, 200, 400, 250],
    "chapter": "Introduction",
    "offset": 1024,
    "archive_chain": "docs.zip::papers/paper.pdf::page=5",
    "line_start": 42,
    "line_end": 58
}
```

## Vector Schema (LanceDB)

| Field | Type | Description |
|-------|------|-------------|
| id | string | Matches chunks.id |
| vector | float32[] | Embedding vector |
| text | string | Chunk text |
| asset_id | string | Source file |
| asset_path | string | File path |
| evidence_anchor | string | JSON anchor |
| topics | string | Comma-separated topics |
| atom_type | string | text/image/etc |
| pipeline_version | string | Version tag |