mirror of
https://github.com/saymrwulf/KnowledgeRefinery.git
synced 2026-05-14 20:47:51 +00:00
128 lines
4 KiB
Markdown
128 lines
4 KiB
Markdown
|
|
# Knowledge Refinery Data Model
|
||
|
|
|
||
|
|
## Entity Relationship
|
||
|
|
|
||
|
|
```
|
||
|
|
WatchedVolume
|
||
|
|
│
|
||
|
|
▼ (contains files)
|
||
|
|
FileAsset ──────────────────┐
|
||
|
|
│ │
|
||
|
|
▼ (extracted into) │
|
||
|
|
ContentAtom │
|
||
|
|
│ │
|
||
|
|
▼ (split into) │
|
||
|
|
Chunk ──────────────────────┤
|
||
|
|
│ │
|
||
|
|
├──▶ Vector (LanceDB) │
|
||
|
|
│ │
|
||
|
|
├──▶ Annotation │
|
||
|
|
│ (versioned) │
|
||
|
|
│ │
|
||
|
|
└──▶ GraphEdge ◀────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
ConceptNode
|
||
|
|
(hierarchical)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Tables
|
||
|
|
|
||
|
|
### file_assets
|
||
|
|
Tracks every file in watched volumes. Status progresses through:
|
||
|
|
`pending` → `extracted` → `chunked` → `embedded` → `annotated` → `conceptualized`
|
||
|
|
|
||
|
|
### content_atoms
|
||
|
|
Raw content extracted from files. Types: text, image, table, metadata, binary.
|
||
|
|
Each atom has an evidence_anchor linking to exact source location.
|
||
|
|
|
||
|
|
### chunks
|
||
|
|
Deterministic text segments (500-800 tokens). IDs are stable across re-processing.
|
||
|
|
Linked to LanceDB vectors via embedding_id.
|
||
|
|
|
||
|
|
### annotations
|
||
|
|
LLM-generated structured metadata per chunk. **Never overwritten** - new annotations
|
||
|
|
are added with `is_current=1` and previous ones marked `is_current=0`.
|
||
|
|
Versioned by model_id + prompt_id + prompt_version.
|
||
|
|
|
||
|
|
### concept_nodes
|
||
|
|
Hierarchical concept clusters derived from embedding similarity.
|
||
|
|
Level 0 = macro concepts, higher levels = finer granularity.
|
||
|
|
|
||
|
|
### graph_edges
|
||
|
|
Typed, weighted edges: similarity, concept membership, co-occurrence.
|
||
|
|
Each edge stores evidence references back to source chunks.
|
||
|
|
|
||
|
|
### pipeline_jobs
|
||
|
|
Crash recovery: tracks job state so processing resumes after restart.
|
||
|
|
|
||
|
|
## Live Progress State (M8, In-Memory)
|
||
|
|
|
||
|
|
During pipeline execution, the daemon maintains ephemeral in-memory structures that are not persisted to SQLite:
|
||
|
|
|
||
|
|
### Live Progress Dict
|
||
|
|
Per-stage status object returned in the `live` field of `/ingest/status`:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"scan": {"status": "done", "progress_pct": 100},
|
||
|
|
"extract": {"status": "running", "progress_pct": 72},
|
||
|
|
"chunk": {"status": "pending", "progress_pct": 0},
|
||
|
|
"embed": {"status": "pending", "progress_pct": 0},
|
||
|
|
"annotate": {"status": "pending", "progress_pct": 0},
|
||
|
|
"conceptualize": {"status": "pending", "progress_pct": 0}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Each stage transitions through `pending` -> `running` -> `done`.
|
||
|
|
|
||
|
|
### Activity Log Ring Buffer
|
||
|
|
A fixed-size circular buffer (200 entries) that records pipeline events. The API returns the most recent 50 entries. Each entry contains a timestamp and message string:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{"timestamp": "2026-02-12T10:30:03Z", "message": "Found 47 files, 12 new"}
|
||
|
|
```
|
||
|
|
|
||
|
|
The ring buffer prevents unbounded memory growth during long pipeline runs. It is reset at the start of each new pipeline execution.
|
||
|
|
|
||
|
|
### Enriched Status Counters
|
||
|
|
The `/ingest/status` response includes running totals updated as each stage completes:
|
||
|
|
|
||
|
|
| Counter | Description |
|
||
|
|
|---------|-------------|
|
||
|
|
| `chunk_count` | Total chunks produced so far |
|
||
|
|
| `annotation_count` | Total annotations generated |
|
||
|
|
| `concept_count` | Total concept nodes created |
|
||
|
|
| `edge_count` | Total graph edges created |
|
||
|
|
|
||
|
|
## Evidence Anchors
|
||
|
|
|
||
|
|
Every derived artifact links back to source via JSON evidence anchors:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"asset_id": "abc123...",
|
||
|
|
"page": 5,
|
||
|
|
"bbox": [100, 200, 400, 250],
|
||
|
|
"chapter": "Introduction",
|
||
|
|
"offset": 1024,
|
||
|
|
"archive_chain": "docs.zip::papers/paper.pdf::page=5",
|
||
|
|
"line_start": 42,
|
||
|
|
"line_end": 58
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Vector Schema (LanceDB)
|
||
|
|
|
||
|
|
| Field | Type | Description |
|
||
|
|
|-------|------|-------------|
|
||
|
|
| id | string | Matches chunks.id |
|
||
|
|
| vector | float32[] | Embedding vector |
|
||
|
|
| text | string | Chunk text |
|
||
|
|
| asset_id | string | Source file |
|
||
|
|
| asset_path | string | File path |
|
||
|
|
| evidence_anchor | string | JSON anchor |
|
||
|
|
| topics | string | Comma-separated topics |
|
||
|
|
| atom_type | string | text/image/etc |
|
||
|
|
| pipeline_version | string | Version tag |
|