mirror of
https://github.com/saymrwulf/crisis.git
synced 2026-05-14 20:37:54 +00:00
The previous driver imposed a synchronous turn-counted clock that the
Crisis paper explicitly forbids — Crisis is supposed to work in
asynchronous P2P networks, with any synchronicity being virtual and
derived inside the consensus algorithm from the DAG structure, not
imposed externally by a coordinator. This commit removes the wall clock.
What changed in the engine:
- `Mothership.run_crisis_phase(num_turns, gossip_rounds_per_turn)`
is replaced by `run_until_quiescent(max_steps=200)`. The loop
interleaves three concerns on each iteration — emissions, gossip,
and alarm emissions — until none make progress. Termination is by
quiescence, not by a fixed turn count. `max_steps` is a safety
bound (loop-iteration cap), not an exposed clock.
- `Mothership.run_closed_phase(num_turns)` becomes
`run_closed_phase(max_steps=50)`. Same quiescence model — the
closed-phase conversation runs until no agent has more to say.
- Agents grew `pending_alarm_claims()`: each agent checks its own
graph for un-alarmed mutations and produces AlarmClaims directly.
The driver loop calls this every iteration, so alarms emit and
propagate in the same loop as regular emissions and gossip — no
separate "alarm phase."
- `Mothership.emit_alarms_from_detectors()` and the explicit
`run_gossip_round()` step are no longer needed by callers; both
are subsumed by the async loop. `run_gossip_round()` stays as a
helper but tests no longer call it externally.
What changed in the agent interface:
- `CrisisAgent.next_turn(turn, received_claims)` becomes
`try_emit()` — no arguments. Agents in an async network don't see
a global tick. They decide based on their own internal state.
- `CrisisAgent.observe(claim)` is the new optional callback the
closed-phase loop uses to feed context into agents that care
(overridden by LiveClaudeAgent to populate its prompt buffer).
- `pending_alarm_claims()` is idempotent: an internal
`_already_alarmed` set tracks claims this agent has emitted, so
the loop calls it every step without flooding the network with
duplicate alarms.
What changed in the dataclass schema:
- `AlarmClaim.detected_at_turn` -> `emitted_at_step`. The word
"turn" implies a global clock; "step" is a per-agent sequence
number used only for log ordering — local, not networked.
- `ClosedPhaseEntry.turn` and `CrisisPhaseEntry.turn` -> `step`.
Same rename, same reasoning.
- `Scenario.closed_phase_turns` and `Scenario.crisis_phase_turns`
are gone. The scenario no longer prescribes how many turns; it
just provides agents and lets the async loop run them out.
What changed in the CLI:
- Phase 3 reports "drove to quiescence in N step(s)" with a
breakdown of regular emissions / gossip transfers / alarm
emissions, instead of "ran N turns".
- `QuiescenceReport` (new dataclass) carries the run statistics
back from `run_until_quiescent`/`run_closed_phase` — steps taken,
emissions made, gossip transfers, alarm claims emitted, plus
whether termination was via quiescence or max-step cap.
New regression tests (`test_async_quiescence.py`):
- `test_run_until_quiescent_terminates`: the loop must exit.
- `test_two_runs_produce_identical_final_state`: determinism check —
if anything in the loop depended on real wall time, this would
fail.
- `test_max_steps_bound_caps_runtime`: setting max_steps=1 exits
immediately and `QuiescenceReport.reached_quiescence` reflects
reality.
- `test_no_turn_argument_exposed_to_agents`: introspects
`CrisisAgent.try_emit` signature; fails if anyone re-adds a
`turn` parameter.
- `test_no_turn_field_on_alarmclaim`: introspects the dataclass
fields; fails if `detected_at_turn` reappears.
- `test_alarms_propagate_through_async_loop_alone`: the loop alone
(no manual emit_alarms / run_gossip_round) ratifies an alarm.
- `test_quiescence_report_counts_match_logs`: sanity check that
the report's emission count equals the crisis log length.
Suite: 163 -> 170 tests, all green in 0.79s.
Behavioral end-state is identical to the previous (synchronous)
version: same fact-check scenario, same byzantine equivocation, same
proof JSON shape, same three signers, same quorum-met outcome. The
difference is structural: the protocol now matches the paper's async
shape, and a future port to actual TCP gossip + concurrent agents
needs no change to this engine.
CrisisViz: still untouched. The `crisis_data.json` pipeline that
drives the visualizer is orthogonal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
156 lines
5.9 KiB
Python
156 lines
5.9 KiB
Python
"""Tests for LiveClaudeAgent — uses a fake Anthropic client (no real API calls)."""
|
|
|
|
from dataclasses import dataclass
|
|
from typing import Any
|
|
|
|
import pytest
|
|
|
|
from crisis_agents.claim import Claim
|
|
from crisis_agents.live_agent import LiveClaudeAgent
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Fakes — we never hit the real Anthropic API in CI.
|
|
# ---------------------------------------------------------------------------
|
|
|
|
@dataclass
|
|
class _FakeContentBlock:
|
|
type: str
|
|
text: str
|
|
|
|
|
|
@dataclass
|
|
class _FakeResponse:
|
|
content: list[_FakeContentBlock]
|
|
|
|
|
|
class _FakeAnthropicClient:
|
|
"""Stand-in for anthropic.Anthropic that returns whatever JSON we hand it."""
|
|
|
|
def __init__(self, scripted_responses: list[str]):
|
|
self._responses = list(scripted_responses)
|
|
self.calls: list[dict[str, Any]] = []
|
|
|
|
# The real SDK exposes .messages.create; mirror that.
|
|
outer = self
|
|
|
|
class _MessagesProxy:
|
|
def create(self_inner, **kwargs):
|
|
outer.calls.append(kwargs)
|
|
text = outer._responses.pop(0) if outer._responses else "[]"
|
|
return _FakeResponse(content=[_FakeContentBlock("text", text)])
|
|
|
|
self.messages = _MessagesProxy()
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# The statements + reference doc fixture
|
|
# ---------------------------------------------------------------------------
|
|
|
|
_STATEMENTS = [
|
|
{"id": "s01", "text": "Water boils at 100C at standard pressure."},
|
|
{"id": "s02", "text": "Pluto is still classified as a planet by the IAU."},
|
|
]
|
|
_REF = "Water boils at 100C. Pluto was reclassified to a dwarf planet in 2006."
|
|
|
|
|
|
class TestLiveClaudeAgent:
|
|
|
|
def test_parses_clean_json_response(self):
|
|
response = (
|
|
'[{"statement_id":"s01","verdict":"true","confidence":0.95,"evidence":"per ref"},'
|
|
' {"statement_id":"s02","verdict":"false","confidence":0.9,"evidence":"per ref"}]'
|
|
)
|
|
client = _FakeAnthropicClient([response])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
turns = agent.try_emit()
|
|
assert len(turns) == 2
|
|
assert {t.claim.statement_id for t in turns} == {"s01", "s02"}
|
|
verdicts = {t.claim.statement_id: t.claim.verdict for t in turns}
|
|
assert verdicts == {"s01": "true", "s02": "false"}
|
|
|
|
def test_strips_markdown_fences(self):
|
|
"""Claude sometimes wraps JSON in ```json fences despite instructions."""
|
|
response = (
|
|
"```json\n"
|
|
'[{"statement_id":"s01","verdict":"true","confidence":0.9,"evidence":"ok"}]\n'
|
|
"```\n"
|
|
)
|
|
client = _FakeAnthropicClient([response])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
turns = agent.try_emit()
|
|
assert len(turns) == 1
|
|
assert turns[0].claim.statement_id == "s01"
|
|
|
|
def test_returns_empty_on_malformed_response(self):
|
|
client = _FakeAnthropicClient(["not json at all"])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
turns = agent.try_emit()
|
|
assert turns == []
|
|
|
|
def test_skips_invalid_claim_objects_in_response(self):
|
|
response = (
|
|
'[{"statement_id":"s01","verdict":"true","confidence":0.9,"evidence":"ok"},'
|
|
' "not a dict",'
|
|
' {"statement_id":"s02","verdict":"bogus","confidence":0.5,"evidence":"x"}]'
|
|
)
|
|
client = _FakeAnthropicClient([response])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
turns = agent.try_emit()
|
|
# Only the first item passes validation: bogus verdict and non-dict get skipped.
|
|
assert len(turns) == 1
|
|
assert turns[0].claim.statement_id == "s01"
|
|
|
|
def test_already_adjudicated_statements_are_skipped(self):
|
|
response_1 = '[{"statement_id":"s01","verdict":"true","confidence":0.9,"evidence":"ok"}]'
|
|
response_2 = '[{"statement_id":"s02","verdict":"false","confidence":0.9,"evidence":"ok"}]'
|
|
client = _FakeAnthropicClient([response_1, response_2])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
# First call adjudicates s01
|
|
first = agent.try_emit()
|
|
assert {t.claim.statement_id for t in first} == {"s01"}
|
|
|
|
# Second call should only ask about s02 (s01 is already done)
|
|
second = agent.try_emit()
|
|
assert {t.claim.statement_id for t in second} == {"s02"}
|
|
|
|
# The prompt sent for the second call should NOT mention s01
|
|
second_call = client.calls[1]
|
|
user_msg = second_call["messages"][0]["content"]
|
|
assert "s02:" in user_msg
|
|
# s01 was previously adjudicated; it should not appear in the
|
|
# "STATEMENTS TO ADJUDICATE" block of the second prompt.
|
|
statements_section = user_msg.split("=== STATEMENTS TO ADJUDICATE ===")[1]
|
|
next_section_start = statements_section.find("===")
|
|
statements_only = statements_section[:next_section_start]
|
|
assert "s01:" not in statements_only
|
|
|
|
def test_evidence_length_is_truncated(self):
|
|
long_evidence = "x" * 500
|
|
response = (
|
|
f'[{{"statement_id":"s01","verdict":"true","confidence":0.9,'
|
|
f'"evidence":"{long_evidence}"}}]'
|
|
)
|
|
client = _FakeAnthropicClient([response])
|
|
agent = LiveClaudeAgent(
|
|
"agent_alpha", reference_doc=_REF,
|
|
statements=_STATEMENTS, client=client,
|
|
)
|
|
turns = agent.try_emit()
|
|
assert len(turns) == 1
|
|
assert len(turns[0].claim.evidence) == Claim.EVIDENCE_MAX_LEN
|