Improve monograph UX and document operator workflow

2026-05-14 20:37:52 +00:00 · 2026-03-29 12:12:14 +02:00 · 2026-03-29 12:12:14 +02:00 · 640b7fd0a0
commit 640b7fd0a0
parent 7b0ba91114
3 changed files with 301 additions and 70 deletions
--- a/59
+++ b/59
@ -0,0 +1,59 @@
+PYTHON := .venv/bin/python
+PIP := .venv/bin/pip
+DOMAINS ?= domains.local.txt
+CACHE_TTL ?= 0
+DNS_CACHE_TTL ?= 86400
+MAX_CANDIDATES ?= 10000
+
+.PHONY: bootstrap install init-config inventory purpose consolidated monograph all
+
+bootstrap:
+	python3 -m venv .venv
+	$(PIP) install -r requirements.txt
+
+install: bootstrap
+
+init-config:
+	test -f $(DOMAINS) || cp domains.example.txt $(DOMAINS)
+
+inventory:
+	$(PYTHON) ct_scan.py \
+		--domains-file $(DOMAINS) \
+		--cache-ttl-seconds $(CACHE_TTL) \
+		--max-candidates-per-domain $(MAX_CANDIDATES) \
+		--output output/corpus/current-valid-certificates.md \
+		--latex-output output/corpus/current-valid-certificates.tex \
+		--pdf-output output/corpus/current-valid-certificates.pdf
+
+purpose:
+	$(PYTHON) ct_usage_assessment.py \
+		--domains-file $(DOMAINS) \
+		--cache-ttl-seconds $(CACHE_TTL) \
+		--max-candidates $(MAX_CANDIDATES) \
+		--markdown-output output/corpus/certificate-purpose-assessment.md \
+		--json-output output/corpus/certificate-purpose-assessment.json
+
+consolidated:
+	$(PYTHON) ct_master_report.py \
+		--domains-file $(DOMAINS) \
+		--cache-ttl-seconds $(CACHE_TTL) \
+		--dns-cache-ttl-seconds $(DNS_CACHE_TTL) \
+		--max-candidates-per-domain $(MAX_CANDIDATES) \
+		--markdown-output output/corpus/consolidated-corpus-report.md \
+		--latex-output output/corpus/consolidated-corpus-report.tex \
+		--pdf-output output/corpus/consolidated-corpus-report.pdf
+
+monograph:
+	$(PYTHON) ct_monograph_report.py \
+		--domains-file $(DOMAINS) \
+		--cache-ttl-seconds $(CACHE_TTL) \
+		--dns-cache-ttl-seconds $(DNS_CACHE_TTL) \
+		--max-candidates-per-domain $(MAX_CANDIDATES) \
+		--markdown-output output/corpus/monograph.md \
+		--latex-output output/corpus/monograph.tex \
+		--pdf-output output/corpus/monograph.pdf \
+		--appendix-markdown-output output/corpus/appendix-inventory.md \
+		--appendix-latex-output output/corpus/appendix-inventory.tex \
+		--appendix-pdf-output output/corpus/appendix-inventory.pdf
+
+all: init-config purpose monograph
--- a/README.md
+++ b/README.md
@ -1,8 +1,14 @@
 # Certificate Transparency Search

-This project scans Certificate Transparency for currently valid leaf certificates whose SAN sets contain configured search terms, verifies the certificates locally, inspects revocation state, classifies intended usage from EKU and KeyUsage, and scans the public DNS names exposed by the certificate corpus.
+This project builds a publication-grade report set from Certificate Transparency and public DNS:

-The repository is designed for public source control:
+- it finds currently valid leaf certificates whose SAN values contain configured search terms
+- it verifies locally that the certificates are real leaf certificates rather than CA certificates or precertificates
+- it assesses intended usage from EKU and KeyUsage
+- it scans the DNS names exposed by the SAN corpus
+- it produces readable Markdown, LaTeX, and PDF outputs
+
+The project is designed for public source control:

 - real search terms live only in `domains.local.txt`
 - generated artefacts live only in `output/`
@ -10,97 +16,213 @@ The repository is designed for public source control:

 None of those paths should be committed.

-## Setup
+## What You Need On A Fresh Machine
+
+### Required software
+
+- `git`
+- `python3`
+- `make`
+- `dig`
+- `xelatex`
+
+### What each dependency is for
+
+- `python3`: runs the scanners and report generators
+- `make`: gives you short repeatable commands instead of long manual command lines
+- `dig`: performs the live DNS scan
+- `xelatex`: compiles the PDF reports
+
+If `xelatex` is missing, the Markdown and LaTeX outputs can still be generated, but the PDF targets will fail.
+
+## Fresh Install On Another Computer
+
+Clone the repository from your chosen remote and enter the directory:

 ```bash
-python3 -m venv .venv
-.venv/bin/python -m pip install -r requirements.txt
-cp domains.example.txt domains.local.txt
+git clone <repository-url>
+cd CertTransparencySearch
 ```

-Edit `domains.local.txt` with the real search terms you want to scan.
+Create the local Python environment and install dependencies:

-## Safety Against Silent Undercounts
+```bash
+make bootstrap
+```

-The scanner now refuses to run if the configured per-domain candidate cap is lower than the live raw match count from crt.sh. This prevents silent truncation when the raw identity set is larger than the cap.
+Create the local-only search-term file:

-## Core Inventory Report
+```bash
+make init-config
+```
+
+Then edit `domains.local.txt` and replace the placeholder values with the real search terms you want to scan.
+
+## Local Search Terms
+
+The tracked file is:
+
+- `domains.example.txt`
+
+The local-only file is:
+
+- `domains.local.txt`
+
+Rules:
+
+- keep real search terms only in `domains.local.txt`
+- do not rename that file unless you also pass `DOMAINS=...` to `make`
+- do not commit it
+
+## One-Command Runs
+
+### Main publication
+
+This is the publication-grade monograph with appendices:
+
+```bash
+make monograph
+```
+
+Outputs:
+
+- `output/corpus/monograph.md`
+- `output/corpus/monograph.tex`
+- `output/corpus/monograph.pdf`
+- `output/corpus/appendix-inventory.md`
+- `output/corpus/appendix-inventory.tex`
+- `output/corpus/appendix-inventory.pdf`
+
+### Supporting purpose assessment
+
+```bash
+make purpose
+```
+
+Outputs:
+
+- `output/corpus/certificate-purpose-assessment.md`
+- `output/corpus/certificate-purpose-assessment.json`
+
+### Shorter executive report
+
+```bash
+make consolidated
+```
+
+Outputs:
+
+- `output/corpus/consolidated-corpus-report.md`
+- `output/corpus/consolidated-corpus-report.tex`
+- `output/corpus/consolidated-corpus-report.pdf`
+
+### Full operator run
+
+This creates the local config if missing, then runs the purpose assessment and the full monograph:
+
+```bash
+make all
+```
+
+## Reproducibility And Run Behaviour
+
+The default `Makefile` values are:
+
+- `DOMAINS=domains.local.txt`
+- `CACHE_TTL=0`
+- `DNS_CACHE_TTL=86400`
+- `MAX_CANDIDATES=10000`
+
+This means:
+
+- Certificate Transparency is refreshed live on every normal run.
+- DNS results are reused for up to one day unless you override the DNS cache TTL.
+- The query cap is high enough for the current corpus and the scanner will refuse to run if the live raw match count exceeds the cap.
+
+If you want to override values:
+
+```bash
+make monograph CACHE_TTL=86400 DNS_CACHE_TTL=86400
+```
+
+Or:
+
+```bash
+make monograph DOMAINS=/path/to/other.local.txt
+```
+
+## Manual Commands
+
+If you do not want to use `make`, the equivalent commands are:
+
+### Inventory appendix source

 ```bash
 .venv/bin/python ct_scan.py \
  --domains-file domains.local.txt \
  --cache-ttl-seconds 0 \
-  --output output/current-valid-certificates.md \
-  --latex-output output/current-valid-certificates.tex \
-  --pdf-output output/current-valid-certificates.pdf
+  --max-candidates-per-domain 10000 \
+  --output output/corpus/current-valid-certificates.md \
+  --latex-output output/corpus/current-valid-certificates.tex \
+  --pdf-output output/corpus/current-valid-certificates.pdf
 ```

-This report is the issuer-first inventory view.
-
-## Purpose Assessment
+### Purpose assessment

 ```bash
 .venv/bin/python ct_usage_assessment.py \
  --domains-file domains.local.txt \
  --cache-ttl-seconds 0 \
-  --markdown-output output/certificate-purpose-assessment.md \
-  --json-output output/certificate-purpose-assessment.json
+  --max-candidates 10000 \
+  --markdown-output output/corpus/certificate-purpose-assessment.md \
+  --json-output output/corpus/certificate-purpose-assessment.json
 ```

-This assessment classifies the current corpus into:
-
- TLS server only
- TLS server and client auth
- client auth only
- S/MIME only
- code signing only
-
-## Monograph Report
-
-```bash
-.venv/bin/python ct_monograph_report.py \
-  --domains-file domains.local.txt \
-  --cache-ttl-seconds 0 \
-  --dns-cache-ttl-seconds 86400 \
-  --markdown-output output/corpus/monograph.md \
-  --latex-output output/corpus/monograph.tex \
-  --pdf-output output/corpus/monograph.pdf
-```
-
-This is the main publication-grade document for readers. It combines:
-
- data-integrity and completeness proof
- certificate inventory and issuer analysis
- purpose assessment
- naming-pattern interpretation
- public DNS delivery analysis
- crosswalk between certificate structure and DNS structure
- confidence and limit statements
- a full issuer-first inventory appendix embedded into the final PDF
-
-The monograph also emits a standalone appendix inventory in the same output area:
-
- `output/corpus/appendix-inventory.md`
- `output/corpus/appendix-inventory.tex`
- `output/corpus/appendix-inventory.pdf`
-
-## Short Consolidated Report
-
-If you still want the shorter executive version, use:
+### Consolidated report

 ```bash
 .venv/bin/python ct_master_report.py \
  --domains-file domains.local.txt \
  --cache-ttl-seconds 0 \
  --dns-cache-ttl-seconds 86400 \
+  --max-candidates-per-domain 10000 \
  --markdown-output output/corpus/consolidated-corpus-report.md \
  --latex-output output/corpus/consolidated-corpus-report.tex \
  --pdf-output output/corpus/consolidated-corpus-report.pdf
 ```

+### Full monograph
+
+```bash
+.venv/bin/python ct_monograph_report.py \
+  --domains-file domains.local.txt \
+  --cache-ttl-seconds 0 \
+  --dns-cache-ttl-seconds 86400 \
+  --max-candidates-per-domain 10000 \
+  --markdown-output output/corpus/monograph.md \
+  --latex-output output/corpus/monograph.tex \
+  --pdf-output output/corpus/monograph.pdf \
+  --appendix-markdown-output output/corpus/appendix-inventory.md \
+  --appendix-latex-output output/corpus/appendix-inventory.tex \
+  --appendix-pdf-output output/corpus/appendix-inventory.pdf
+```
+
+## Project Structure
+
+- `ct_scan.py`: core CT scan, leaf verification, grouping, and detailed inventory report
+- `ct_usage_assessment.py`: EKU and KeyUsage assessment
+- `ct_dns_utils.py`: DNS scanning and provider-signature logic
+- `ct_master_report.py`: shorter consolidated report
+- `ct_monograph_report.py`: publication-grade monograph with appendices
+- `Makefile`: reproducible operator workflow
+
+## Safety Against Silent Undercounts
+
+The scanner checks the live raw identity-row count before it executes the capped query. If the configured cap is too low, it stops with an error instead of silently returning an incomplete corpus.
+
 ## Public Repo Rules

- Keep `domains.local.txt` local only.
- Never commit `output/`.
- Never commit `.cache/`.
- If you need a sample config in git, update `domains.example.txt`, not `domains.local.txt`.
+- keep `domains.local.txt` local only
+- never commit `output/`
+- never commit `.cache/`
+- if you need a sample config in git, update `domains.example.txt`, not `domains.local.txt`
--- a/ct_monograph_report.py
+++ b/ct_monograph_report.py
@ -97,17 +97,61 @@ def short_issuer(issuer_name: str) -> str:
    return issuer_name


+def build_issuer_family_rows(report: dict[str, object]) -> list[dict[str, str]]:
+    issuer_trust = report["issuer_trust"]
+    families: dict[str, dict[str, object]] = {}
+    for issuer_name, count in report["issuer_counts"].most_common():
+        family = short_issuer(issuer_name)
+        row = families.setdefault(
+            family,
+            {
+                "family": family,
+                "certificates": 0,
+                "variants": [],
+                "major_webpki": True,
+            },
+        )
+        row["certificates"] += count
+        row["variants"].append(issuer_name)
+        row["major_webpki"] = bool(row["major_webpki"] and issuer_trust[issuer_name].major_webpki)
+    ordered = sorted(
+        families.values(),
+        key=lambda item: (-int(item["certificates"]), str(item["family"]).casefold()),
+    )
+    result: list[dict[str, str]] = []
+    for item in ordered:
+        variant_labels = [
+            str(name).split("CN=")[-1]
+            for name in sorted(item["variants"], key=str.casefold)
+        ]
+        result.append(
+            {
+                "family": str(item["family"]),
+                "certificates": str(item["certificates"]),
+                "variant_count": str(len(variant_labels)),
+                "major_webpki": "yes" if item["major_webpki"] else "no",
+                "variants": ", ".join(variant_labels),
+            }
+        )
+    return result
+
+
 def render_markdown(args: argparse.Namespace, report: dict[str, object]) -> None:
    args.markdown_output.parent.mkdir(parents=True, exist_ok=True)
    appendix_markdown = args.appendix_markdown_output.read_text(encoding="utf-8")
    hits = report["hits"]
    groups = report["groups"]
    purpose_summary = report["purpose_summary"]
-    issuer_trust = report["issuer_trust"]
    dual_items = [item for item in report["classifications"] if item.category == "tls_server_and_client"]
    issuer_rows = [
-        [short_issuer(issuer_name), str(count), "yes" if issuer_trust[issuer_name].major_webpki else "no"]
-        for issuer_name, count in report["issuer_counts"].most_common()
+        [
+            row["family"],
+            row["certificates"],
+            row["variant_count"],
+            row["major_webpki"],
+            row["variants"],
+        ]
+        for row in build_issuer_family_rows(report)
    ]
    family_rows = [
        [
@ -194,12 +238,14 @@ def render_markdown(args: argparse.Namespace, report: dict[str, object]) -> None
    lines.append("")
    lines.append("### Issuer Trust Table")
    lines.append("")
-    lines.extend(md_table(["Issuer Family", "Count", "Major WebPKI"], issuer_rows))
+    lines.extend(md_table(["Issuer Family", "Certificates", "Variants", "Major WebPKI", "Issuer Variants Seen"], issuer_rows))
    lines.append("")
    lines.append("**What WebPKI trust means**")
    lines.append("")
    lines.append("A WebPKI-trusted issuer is a certificate authority trusted by mainstream browser and operating-system trust stores for public TLS. That matters because it tells you these certificates are not part of a private PKI hidden inside one organisation. They are intended to be valid in the public Internet trust model.")
    lines.append("")
+    lines.append("This table is intentionally collapsed to issuer families. The detailed issuer variant names remain visible in the appendix inventory, where the reader needs them for forensic precision rather than quick orientation.")
+    lines.append("")
    lines.append("## Chapter 3: Intended Purpose of the Certificates")
    lines.append("")
    lines.append("**Management Summary**")
@ -332,6 +378,7 @@ def render_latex(args: argparse.Namespace, report: dict[str, object]) -> None:
    groups = report["groups"]
    purpose_summary = report["purpose_summary"]
    issuer_trust = report["issuer_trust"]
+    issuer_family_rows = build_issuer_family_rows(report)
    dual_items = [item for item in report["classifications"] if item.category == "tls_server_and_client"]
    appendix_pdf_path = args.appendix_pdf_output.resolve().as_posix()
    lines: list[str] = [
@ -450,17 +497,20 @@ def render_latex(args: argparse.Namespace, report: dict[str, object]) -> None:
    lines.extend(
        [
            r"\subsection{Issuer Trust Table}",
-            r"\begin{longtable}{>{\raggedright\arraybackslash}p{0.60\linewidth} >{\raggedleft\arraybackslash}p{0.12\linewidth} >{\raggedleft\arraybackslash}p{0.16\linewidth}}",
+            r"\begin{longtable}{>{\raggedright\arraybackslash}p{0.18\linewidth} >{\raggedleft\arraybackslash}p{0.10\linewidth} >{\raggedleft\arraybackslash}p{0.09\linewidth} >{\raggedleft\arraybackslash}p{0.12\linewidth} >{\raggedright\arraybackslash}p{0.39\linewidth}}",
            r"\toprule",
-            r"Issuer & Count & Major WebPKI \\",
+            r"Issuer Family & Certs & Variants & Major WebPKI & Issuer Variants Seen \\",
            r"\midrule",
        ]
    )
-    for issuer_name, count in report["issuer_counts"].most_common():
+    for row in issuer_family_rows:
        lines.append(
-            rf"{latex_escape(short_issuer(issuer_name))} & {count} & {'yes' if issuer_trust[issuer_name].major_webpki else 'no'} \\"
+            rf"{latex_escape(row['family'])} & {row['certificates']} & {row['variant_count']} & {row['major_webpki']} & {latex_escape(row['variants'])} \\"
        )
    lines.extend([r"\bottomrule", r"\end{longtable}"])
+    lines.append(
+        r"This table is intentionally collapsed to issuer families for readability. The appendix inventory keeps the exact issuer variant names for forensic review."
+    )

    lines.append(r"\section{Intended Purpose of the Certificates}")
    add_summary(