KnowledgeRefinery/docs/data-model.md

120 lines
3.7 KiB
Markdown
Raw Permalink Normal View History

# Knowledge Refinery Data Model
## Entity Relationship
```
WatchedVolume
▼ (contains files)
FileAsset ──────────────────┐
│ │
▼ (extracted into) │
ContentAtom │
│ │
▼ (split into) │
Chunk ──────────────────────┤
│ │
├──▶ Vector (SQLite) │
│ 768-dim BLOB │
│ │
├──▶ Annotation │
│ (versioned) │
│ │
└──▶ GraphEdge ◀────────┘
ConceptNode
(hierarchical)
```
## Tables (SQLite)
### file_assets
Tracks every file in watched volumes. Status progresses through:
`pending``extracted``chunked``embedded``annotated``conceptualized`
### content_atoms
Raw content extracted from files. Types: text, image, table, metadata, binary.
Each atom has an evidence_anchor linking to exact source location.
### chunks
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
Linked to vectors in `chunk_vectors` table via chunk ID.
### chunk_vectors
Embedding vectors stored as binary BLOBs (768 x float32 = 3072 bytes per vector).
Loaded into memory at startup for brute-force cosine similarity search.
| Field | Type | Description |
|-------|------|-------------|
| id | TEXT PRIMARY KEY | Matches chunks.id |
| vector | BLOB | 768-dim float32 embedding |
| text | TEXT | Chunk text |
| asset_id | TEXT | Source file |
| asset_path | TEXT | File path |
| evidence_anchor | TEXT | JSON anchor |
| pipeline_version | TEXT | Version tag |
### annotations
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
are added with `is_current=1` and previous ones marked `is_current=0`.
Versioned by model_id + prompt_id + prompt_version.
### concept_nodes
Hierarchical concept clusters derived from embedding similarity.
Level 0 = macro concepts, higher levels = finer granularity.
### graph_edges
Typed, weighted edges: similarity, concept membership, co-occurrence.
Each edge stores evidence references back to source chunks.
### pipeline_jobs
Crash recovery: tracks job state so processing resumes after restart.
## Live Progress State (In-Memory)
During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:
### Live Progress Dict
Per-stage status object returned in the `live` field of `/ingest/status`:
```json
{
"scan": {"status": "done", "progress_pct": 100},
"extract": {"status": "running", "progress_pct": 72},
"chunk": {"status": "pending", "progress_pct": 0},
"embed": {"status": "pending", "progress_pct": 0},
"annotate": {"status": "pending", "progress_pct": 0},
"conceptualize": {"status": "pending", "progress_pct": 0}
}
```
### Activity Log Ring Buffer
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries.
### Enriched Status Counters
| Counter | Description |
|---------|-------------|
| `chunk_count` | Total chunks produced so far |
| `annotation_count` | Total annotations generated |
| `concept_count` | Total concept nodes created |
| `edge_count` | Total graph edges created |
## Evidence Anchors
Every derived artifact links back to source via JSON evidence anchors:
```json
{
"asset_id": "abc123...",
"page": 5,
"bbox": [100, 200, 400, 250],
"chapter": "Introduction",
"offset": 1024,
"archive_chain": "docs.zip::papers/paper.pdf::page=5",
"line_start": 42,
"line_end": 58
}
```