Add teaching docs for analytics code

This commit is contained in:
saymrwulf 2026-04-01 18:57:58 +02:00
parent 08ef4dbc2b
commit 388d435158
14 changed files with 12189 additions and 28 deletions

View file

@ -19,7 +19,7 @@ The project is designed for public source control:
None of those paths should be committed.
## What You Need On A Fresh Machine
## What You Need On A Fresh macOS Machine
### Required software
@ -38,7 +38,71 @@ None of those paths should be committed.
If `xelatex` is missing, the Markdown and LaTeX outputs can still be generated, but the PDF targets will fail.
## Fresh Install On Another Computer
### Network access required
- outbound TCP access to `crt.sh:5432`
- public DNS resolution for `dig`
The scanner reads Certificate Transparency data directly from the public `certwatch` PostgreSQL service on `crt.sh` using guest access. If that TCP path is blocked, the certificate part of the run will fail even if normal web browsing works.
## Clean-Room Operator Checklist
Use this sequence if you need to reproduce the same output structure on another Mac without any extra guidance:
1. Install the required macOS system tools.
2. Clone the repository.
3. Create the Python virtual environment and install Python dependencies.
4. Create the local-only config files.
5. Put the real search terms into `domains.local.txt`.
6. Optionally put the focused Subject-CN cohort into `focus_subjects.local.txt`.
7. Run `make monograph`.
8. Read the outputs from `output/corpus/`.
Expected final outputs:
- `output/corpus/monograph.md`
- `output/corpus/monograph.tex`
- `output/corpus/monograph.pdf`
The PDF build no longer depends on macOS-only fonts.
## macOS Install Recipe
Install Apple command-line tools first:
```bash
xcode-select --install
```
If Homebrew is not already installed, install it from `https://brew.sh`, then install the required tools:
```bash
brew install python make
brew install --cask mactex-no-gui
```
Notes:
- `git`, `make`, and `dig` are usually already present once Apple command-line tools are installed.
- `mactex-no-gui` provides `xelatex`.
- If `xelatex` is still not on your `PATH` after installation, open a new shell and re-run `which xelatex`.
## Preflight Checks
Run these checks before the first full build:
```bash
python3 --version
git --version
make --version
dig -v
xelatex --version
nc -vz crt.sh 5432
```
If the last command fails, the CT query layer will not be able to reach the public `certwatch` database.
## Fresh Install On Another Mac
Clone the repository from your chosen remote and enter the directory:
@ -63,6 +127,26 @@ Then edit `domains.local.txt` and replace the placeholder values with the real s
If you want the monograph to analyse a remembered or suspicious Subject-CN cohort as well, edit `focus_subjects.local.txt` too. The format is one Subject CN per line, optionally followed by analyst notes in parentheses.
## Fastest End-To-End Run
If the Mac already has the required system tools installed, this is the shortest full path:
```bash
git clone <repository-url>
cd CertTransparencySearch
make bootstrap
make init-config
# edit domains.local.txt
# optionally edit focus_subjects.local.txt
make monograph
```
The canonical results will then be in:
- `output/corpus/monograph.md`
- `output/corpus/monograph.tex`
- `output/corpus/monograph.pdf`
## Local Search Terms
The tracked file is:

View file

@ -643,9 +643,6 @@ def render_latex(path: Path, report: dict[str, object]) -> None:
r"\usepackage{titlesec}",
r"\usepackage[most]{tcolorbox}",
r"\defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}",
r"\setmainfont{Palatino}",
r"\setsansfont{Avenir Next}",
r"\setmonofont{Menlo}",
r"\definecolor{Ink}{HTML}{17202A}",
r"\definecolor{Muted}{HTML}{667085}",
r"\definecolor{Line}{HTML}{D0D5DD}",

View file

@ -1703,9 +1703,6 @@ def render_latex(
r"\usepackage[most]{tcolorbox}",
r"\usepackage{pdfpages}",
r"\defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}",
r"\setmainfont{Palatino}",
r"\setsansfont{Avenir Next}",
r"\setmonofont{Menlo}",
r"\definecolor{Ink}{HTML}{17202A}",
r"\definecolor{Muted}{HTML}{667085}",
r"\definecolor{Line}{HTML}{D0D5DD}",
@ -1723,8 +1720,11 @@ def render_latex(
r"\pagestyle{plain}",
r"\titleformat{\section}{\sffamily\bfseries\LARGE\color{Ink}\raggedright}{\thesection}{0.8em}{}",
r"\titleformat{\subsection}{\sffamily\bfseries\Large\color{Ink}\raggedright}{\thesubsection}{0.8em}{}",
r"\titleformat{\subsubsection}{\sffamily\bfseries\normalsize\color{Ink}\raggedright}{\thesubsubsection}{0.8em}{}",
r"\tcbset{panel/.style={enhanced,breakable,boxrule=0.55pt,arc=3pt,left=9pt,right=9pt,top=8pt,bottom=8pt,colback=white,colframe=Line}}",
r"\newcommand{\SummaryBox}[1]{\begin{tcolorbox}[panel,colback=Panel]#1\end{tcolorbox}}",
r"\newcommand{\SummaryBox}[1]{\begin{tcolorbox}[enhanced,boxrule=0.55pt,arc=3pt,left=9pt,right=9pt,top=8pt,bottom=8pt,colback=Panel,colframe=Line]#1\end{tcolorbox}}",
r"\newcommand{\SoftSubsection}[1]{\Needspace{12\baselineskip}\subsection{#1}}",
r"\newcommand{\SoftSubsubsection}[1]{\Needspace{10\baselineskip}\subsubsection{#1}}",
r"\begin{document}",
r"\begin{titlepage}",
r"\vspace*{16mm}",
@ -1743,7 +1743,11 @@ def render_latex(
+ rf"{purpose_summary.category_counts.get('tls_server_and_client', 0)} certificates from templates that also permit client-certificate use."
+ r"}",
r"\end{titlepage}",
r"\begingroup",
r"\small",
r"\setlength{\parskip}{2pt}",
r"\tableofcontents",
r"\endgroup",
r"\clearpage",
]
@ -2744,7 +2748,17 @@ def render_latex(
r"\end{document}",
]
)
args.latex_output.write_text("\n".join(lines) + "\n", encoding="utf-8")
def soften_heading(line: str) -> str:
if line.startswith(r"\subsection{"):
return line.replace(r"\subsection{", r"\SoftSubsection{", 1)
if line.startswith(r"\subsubsection{"):
return line.replace(r"\subsubsection{", r"\SoftSubsubsection{", 1)
return line
args.latex_output.write_text(
"\n".join(soften_heading(line) for line in lines) + "\n",
encoding="utf-8",
)
def main() -> int:

View file

@ -755,7 +755,7 @@ def build_san_tree_lines(san_entries: list[str]) -> list[str]:
return build_san_tree_lines_with_style(san_entries, ascii_only=False)
def build_san_tree_lines_with_style(san_entries: list[str], ascii_only: bool) -> list[str]:
def build_san_tree_units_with_style(san_entries: list[str], ascii_only: bool) -> list[list[str]]:
dns_entries = sorted({entry[4:] for entry in san_entries if entry.startswith("DNS:")})
other_entries = sorted({entry for entry in san_entries if not entry.startswith("DNS:")})
tree: dict[str, Any] = {}
@ -784,11 +784,55 @@ def build_san_tree_lines_with_style(san_entries: list[str], ascii_only: bool) ->
lines.extend(render(child, child_prefix))
return lines
lines = render(tree)
units: list[list[str]] = []
for key in sorted(tree.keys(), key=str.casefold):
units.append(render({key: tree[key]}))
for entry in other_entries:
lines.append(f"{'*' if ascii_only else ''} {entry}")
if not lines:
lines.append(f"{'*' if ascii_only else ''} -")
units.append([f"{'*' if ascii_only else ''} {entry}"])
if not units:
units.append([f"{'*' if ascii_only else ''} -"])
return units
def build_san_tree_chunks_with_style(
san_entries: list[str],
ascii_only: bool,
max_lines_per_chunk: int = 24,
) -> list[list[str]]:
chunks: list[list[str]] = []
current_chunk: list[str] = []
current_lines = 0
def flush_current_chunk() -> None:
nonlocal current_chunk, current_lines
if current_chunk:
chunks.append(current_chunk)
current_chunk = []
current_lines = 0
for unit in build_san_tree_units_with_style(san_entries, ascii_only=ascii_only):
if len(unit) > max_lines_per_chunk:
flush_current_chunk()
for start in range(0, len(unit), max_lines_per_chunk):
chunks.append(unit[start : start + max_lines_per_chunk])
continue
if current_chunk and current_lines + len(unit) > max_lines_per_chunk:
flush_current_chunk()
current_chunk.extend(unit)
current_lines += len(unit)
flush_current_chunk()
return chunks
def build_san_tree_lines_with_style(san_entries: list[str], ascii_only: bool) -> list[str]:
lines: list[str] = []
for chunk in build_san_tree_chunks_with_style(
san_entries,
ascii_only=ascii_only,
max_lines_per_chunk=10_000,
):
lines.extend(chunk)
return lines
@ -1033,9 +1077,6 @@ def render_latex_report(
r"\usepackage{fancyvrb}",
r"\usepackage{needspace}",
r"\defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}",
r"\setmainfont{Palatino}",
r"\setsansfont{Avenir Next}",
r"\setmonofont{Menlo}",
r"\definecolor{Ink}{HTML}{17202A}",
r"\definecolor{Muted}{HTML}{667085}",
r"\definecolor{Line}{HTML}{D0D5DD}",
@ -1071,7 +1112,7 @@ def render_latex_report(
r" issuerpanel/.style={panel,colback=Panel,colframe=Ink!45},",
r" familypanel/.style={panel,colback=AccentSoft,colframe=AccentLine},",
r" subjectpanel/.style={panel,colback=white,colframe=Line},",
r" treepanel/.style={panel,colback=Panel,colframe=AccentLine},",
r" treepanel/.style={enhanced,boxrule=0.55pt,arc=3pt,left=9pt,right=9pt,top=8pt,bottom=8pt,colback=Panel,colframe=AccentLine},",
r"}",
r"\newcommand{\DomainChip}[1]{\tcbox[on line,boxrule=0pt,arc=3pt,left=5pt,right=5pt,top=2pt,bottom=2pt,colback=AccentSoft]{\sffamily\footnotesize\texttt{#1}}}",
r"\newcommand{\MetricChip}[2]{\tcbox[on line,boxrule=0pt,arc=3pt,left=6pt,right=6pt,top=3pt,bottom=3pt,colback=Panel]{\sffamily\footnotesize\textcolor{Muted}{#1}\hspace{0.45em}\textbf{#2}}}",
@ -1225,6 +1266,11 @@ def render_latex_report(
rf"\newline \textcolor{{Muted}}{{SANs: {len(hit.san_entries)} \quad crt.sh: {latex_escape(crtsh_ids)} \quad {latex_escape(one_line_revocation(hit))}}}",
]
)
tree_chunks = build_san_tree_chunks_with_style(
unique_san_entries,
ascii_only=True,
max_lines_per_chunk=24,
)
lines.extend(
[
r"\end{itemize}",
@ -1240,18 +1286,35 @@ def render_latex_report(
rf"\textbf{{Dominant zones}}: {latex_escape(', '.join(f'{zone} ({count})' for zone, count in san_summary['top_zones']) if san_summary['top_zones'] else 'none')}",
r"\par",
rf"\textbf{{Repeating host schemas}}: {latex_escape(', '.join(f'{pattern} ({count})' for pattern, count in san_summary['repeating_patterns']) if san_summary['repeating_patterns'] else 'mostly one-off SAN hostnames')}",
r"\end{tcolorbox}",
r"\begin{tcolorbox}[treepanel,title={SAN Structure}]",
r"\begin{Verbatim}[fontsize=\footnotesize]",
]
)
lines.extend(build_san_tree_lines_with_style(unique_san_entries, ascii_only=True))
lines.extend(
[
r"\end{Verbatim}",
(
rf"\par\medskip\textcolor{{Muted}}{{The SAN structure below is shown in {len(tree_chunks)} intact panels so the visual grouping is not broken across a page.}}"
if len(tree_chunks) > 1
else ""
),
r"\end{tcolorbox}",
]
)
for tree_chunk_index, tree_lines in enumerate(tree_chunks, start=1):
tree_title = (
"SAN Structure"
if len(tree_chunks) == 1
else f"SAN Structure ({tree_chunk_index}/{len(tree_chunks)})"
)
tree_needspace = max(12, min(len(tree_lines) + 7, 32))
lines.extend(
[
rf"\Needspace{{{tree_needspace}\baselineskip}}",
rf"\begin{{tcolorbox}}[treepanel,title={{{latex_escape(tree_title)}}}]",
r"\begin{Verbatim}[fontsize=\footnotesize]",
]
)
lines.extend(tree_lines)
lines.extend(
[
r"\end{Verbatim}",
r"\end{tcolorbox}",
]
)
lines.extend(
[

View file

@ -0,0 +1,44 @@
# teachingNoobs Curriculum
Open each file in VS Code and use Markdown Preview. The intended order is:
1. [ct_scan.md](./ct_scan.md)
Why first: this is the core analytics engine. If you understand this file, you understand where the certificate facts come from.
2. [ct_dns_utils.md](./ct_dns_utils.md)
Why second: this explains how the DNS side was scanned and interpreted.
3. [ct_usage_assessment.md](./ct_usage_assessment.md)
Why third: this explains how certificate purpose was classified from EKU and KeyUsage.
4. [ct_lineage_report.md](./ct_lineage_report.md)
Why fourth: this adds historical time and red-flag logic.
5. [ct_caa_analysis.md](./ct_caa_analysis.md)
Why fifth: this adds the DNS-side issuance-policy layer.
6. [ct_focus_subjects.md](./ct_focus_subjects.md)
Why sixth: this explains the special hand-picked Subject-CN cohort logic.
7. [ct_master_report.md](./ct_master_report.md)
Why seventh: this shows how the current-state analytical layers are stitched into one coherent bundle.
8. [ct_monograph_report.md](./ct_monograph_report.md)
Why last: this is the publishing layer. Read it last because it is about presentation and assembly, not fact extraction.
Suggested reading method:
- Keep the Markdown preview open.
- For each page, read the explanation on the right first.
- Then look left at the code block and see how the explanation maps onto the exact lines.
- Do not try to memorize every helper function on first pass. Focus on the few blocks that move real data from one stage to the next.
- Pay special attention to the new `Flow arrows` panel on the right side. That panel tells you where the block's output goes next.
What matters most:
- In `ct_scan.py`: how raw database rows become verified leaf certificates.
- In `ct_dns_utils.py`: how raw DNS answers become delivery clues.
- In `ct_lineage_report.py`: how the code decides what is a normal renewal versus a red flag.
- In `ct_caa_analysis.py`: how live DNS policy is compared with live certificate coverage.
- In `ct_master_report.py`: how the current-state pieces are combined.
What matters less on first read:
- tiny formatting helpers
- string-wrapping helpers
- Markdown/LaTeX table plumbing
Those are still useful, but they are support code, not the heart of the analytics.

View file

@ -0,0 +1,451 @@
#!/usr/bin/env python3
from __future__ import annotations
import ast
import html
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]
OUT_DIR = ROOT / "teachingNoobs"
SOURCE_FILES = [
"ct_scan.py",
"ct_dns_utils.py",
"ct_usage_assessment.py",
"ct_lineage_report.py",
"ct_caa_analysis.py",
"ct_focus_subjects.py",
"ct_master_report.py",
"ct_monograph_report.py",
]
FILE_INTROS = {
"ct_scan.py": (
"Core Certificate Transparency scanner. This file talks to crt.sh's public "
"database, downloads the real certificate bytes, verifies that they are real "
"leaf certificates, groups them into readable families, and can render the "
"full inventory appendix."
),
"ct_dns_utils.py": (
"Public DNS scanner. This file runs dig, follows alias chains, finds public "
"addresses, and collapses raw DNS evidence into readable delivery labels."
),
"ct_usage_assessment.py": (
"Certificate-purpose analyzer. This file looks at EKU and KeyUsage to decide "
"what each certificate is technically allowed to do."
),
"ct_lineage_report.py": (
"Historical analyzer. This file studies expired plus current certificates to "
"find renewals, overlap, drift, and issuance bursts over time."
),
"ct_caa_analysis.py": (
"CAA analyzer. This file resolves live DNS issuance policy and compares it "
"against the public CA families that are actually covering the names today."
),
"ct_focus_subjects.py": (
"Focused-cohort analyzer. This file takes your special hand-picked Subject CN "
"list and compares it against the wider certificate and DNS estate."
),
"ct_master_report.py": (
"Current-state synthesizer. This file combines certificate facts, DNS facts, "
"purpose classification, grouping, and curated examples into one report bundle."
),
"ct_monograph_report.py": (
"Publication builder. This file takes all analytical layers and turns them into "
"the final monograph in Markdown, LaTeX, and PDF."
),
}
FILE_FLOW_STRIPS = {
"ct_scan.py": "domains file -> raw CT query -> parsed leaf certificates -> CN families -> issuer trust -> appendix reports",
"ct_dns_utils.py": "DNS name -> dig answers -> normalized observation -> provider hints -> delivery label",
"ct_usage_assessment.py": "certificate bytes -> EKU and KeyUsage -> purpose label -> summary counts",
"ct_lineage_report.py": "historical CT rows -> historical certificates -> grouped by Subject CN -> overlap and drift checks -> red flags",
"ct_caa_analysis.py": "DNS name -> effective CAA lookup -> allowed CA families -> compare with live cert families",
"ct_focus_subjects.py": "focus-subject file -> cohort entries -> compare against current and historical estate -> bucketed cohort explanation",
"ct_master_report.py": "current CT facts + DNS facts + usage facts -> one current-state report bundle",
"ct_monograph_report.py": "current-state bundle + history + CAA + focused cohort -> Markdown/LaTeX/PDF monograph",
}
BLOCK_NOTES = {
"ct_scan.py": {
"__module__": "Imports, SQL, constants, and shared data shapes for the core CT scanner.",
"DatabaseRecord": "A raw row as it comes back from the crt.sh database before local cleanup.",
"CertificateHit": "The cleaned working object used by the rest of the analytics pipeline.",
"VerificationStats": "A tiny running counter that proves how many rows were kept or rejected.",
"CertificateGroup": "One readable family of related certificates after grouping logic runs.",
"ScanStats": "Top-level summary numbers used in reports.",
"IssuerTrustInfo": "Stores the public-trust picture for one issuer family.",
"connect": "Opens the direct guest PostgreSQL connection to crt.sh's certwatch backend.",
"query_domain": "Runs the main certificate query for one search term and refuses silent undercounting.",
"query_raw_match_count": "Counts how many raw hits exist before the capped query runs.",
"build_hits": "Parses certificate bytes, rejects bad objects, and merges duplicate views of the same cert.",
"build_groups": "Turns a flat certificate list into CN-based families such as exact endpoints or numbered rails.",
"query_issuer_trust": "Checks which issuers are currently trusted for public TLS in the major WebPKI contexts.",
"render_markdown_report": "Writes the raw inventory appendix as readable Markdown.",
"render_latex_report": "Writes the raw inventory appendix as LaTeX for PDF assembly.",
"compile_latex_to_pdf": "Hands LaTeX to XeLaTeX and turns it into a finished PDF file.",
"main": "The standalone command-line entrypoint for the inventory scanner.",
},
"ct_dns_utils.py": {
"__module__": "Shared DNS scanning helpers, cache helpers, and the logic that turns raw DNS answers into platform clues.",
"DnsObservation": "One complete DNS observation for one hostname.",
"scan_name_live": "Runs the live DNS walk for one hostname.",
"scan_name_cached": "Reuses a recent DNS result if possible, otherwise performs the live scan.",
"infer_provider_hints": "Reads the raw DNS trail and pulls out likely platform or vendor clues.",
"infer_stack_signature": "Collapses several low-level DNS clues into one human-readable delivery label.",
"provider_explanations": "Supplies the glossary text used later in the reports.",
},
"ct_usage_assessment.py": {
"__module__": "Purpose-analysis constants and small data shapes for EKU and KeyUsage classification.",
"PurposeClassification": "One certificate plus the usage label assigned to it.",
"AssessmentSummary": "The roll-up numbers that power the purpose chapter.",
"build_classifications": "Walks through all current certificates and labels them by intended usage.",
"summarize": "Compresses the per-certificate labels into counts, templates, and issuer breakdowns.",
"render_markdown": "Writes the standalone purpose report.",
"main": "The standalone command-line entrypoint for the purpose analyzer.",
},
"ct_lineage_report.py": {
"__module__": "Historical query logic, data structures, and red-flag rules for certificate lifecycle analysis.",
"HistoricalCertificate": "One certificate in the full time-based dataset, including expired ones.",
"CnCollisionRow": "A table row for Subject-DN drift or issuer drift under the same Subject CN.",
"SanChangeRow": "A table row that describes SAN-profile change for one Subject CN.",
"OverlapRow": "A table row describing long predecessor/successor overlap.",
"RedFlagRow": "A compact summary row for names worth attention.",
"HistoricalAssessment": "The full historical analysis bundle used by the monograph.",
"query_historical_domain": "Fetches the wider historical corpus for one search term.",
"build_certificates": "Converts raw DB rows into historical working objects.",
"dn_change_rows": "Finds names whose formal Subject DN changed over time.",
"issuer_change_rows": "Finds names whose issuing CA family changed over time.",
"san_change_rows": "Finds names whose SAN bundle changed over time.",
"overlap_rows": "Finds predecessor/successor pairs that overlap too long.",
"build_assessment": "Runs the full historical workflow and returns the finished analytical bundle.",
"render_markdown": "Writes the standalone historical report in Markdown.",
"render_latex": "Writes the standalone historical report in LaTeX.",
"main": "The standalone command-line entrypoint for the historical analyzer.",
},
"ct_caa_analysis.py": {
"__module__": "Data structures and lookup logic for effective CAA policy analysis.",
"CaaObservation": "One resolved CAA result before it is merged with certificate coverage data.",
"CaaNameRow": "One final row that compares DNS policy with current live certificate families.",
"CaaAnalysis": "The full CAA analysis bundle used by the monograph.",
"relevant_caa_live": "Finds the effective live CAA for one name, including inheritance and alias behavior.",
"build_analysis": "Runs CAA across the whole SAN namespace and compares policy with live issuance.",
"rows_for_zone": "Filters the full analysis down to one configured DNS zone.",
},
"ct_focus_subjects.py": {
"__module__": "Rules and data shapes for analyzing the special hand-picked Subject-CN cohort.",
"FocusSubject": "One line from the local focus-subject file.",
"FocusSubjectDetail": "One detailed analytical row for one focused Subject CN.",
"FocusCohortAnalysis": "The full cohort comparison bundle used in the monograph.",
"load_focus_subjects": "Reads the local focus-subject list and any analyst notes attached to it.",
"classify_taxonomy_bucket": "Places a name into the direct-front, platform-anchor, or ambiguous bucket.",
"observed_role": "Tries to describe what role the name appears to play in the public estate.",
"build_analysis": "Runs the full comparison between the focused cohort and the rest of the estate.",
},
"ct_master_report.py": {
"__module__": "Current-state report assembly code that sits above the low-level scanners.",
"ExampleBlock": "A small narrative evidence block used in the naming chapter.",
"load_records": "Loads current CT records for all configured search terms.",
"enrich_dns": "Adds DNS observations and provider clues to the raw SAN-name list.",
"pick_examples": "Chooses a few representative examples that make the naming and DNS story understandable.",
"build_group_digest": "Builds a compact family catalogue used in reports.",
"summarize_for_report": "Creates the big current-state dictionary consumed by the monograph builder.",
"render_markdown": "Writes the shorter consolidated report in Markdown.",
"render_latex": "Writes the shorter consolidated report in LaTeX.",
"main": "The standalone command-line entrypoint for the consolidated current-state report.",
},
"ct_monograph_report.py": {
"__module__": "The orchestration and publishing layer that turns all analytical modules into one publication.",
"render_appendix_inventory": "Generates the hidden full inventory appendix before the main monograph is assembled.",
"append_longtable": "Shared LaTeX helper for readable multi-page tables.",
"render_markdown": "Writes the narrative monograph in Markdown.",
"render_latex": "Writes the narrative monograph in LaTeX.",
"main": "The top-level command-line entrypoint for the complete monograph build.",
},
}
BLOCK_FLOWS = {
"ct_scan.py": {
"Module setup": ("Nothing yet; this is the starting point.", "`connect`, `query_domain`, `build_hits`, and the report renderers use these shared definitions."),
"load_domains": ("Operator's local config file.", "`query_domain` and the higher-level loaders use this cleaned domain list."),
"connect": ("Called by query functions that need live crt.sh data.", "`query_domain`, `query_raw_match_count`, and issuer-trust lookups all depend on this connection."),
"query_raw_match_count": ("A domain string from the local config.", "`query_domain` uses this count to refuse silent undercounting."),
"query_domain": ("A domain plus the safety cap and retry settings.", "`build_hits` receives the raw records returned here."),
"build_hits": ("Raw `DatabaseRecord` rows from crt.sh.", "`build_groups`, purpose analysis, DNS analysis, and CAA analysis all consume these cleaned hits."),
"build_groups": ("The flat list of `CertificateHit` objects.", "The report builders use these groups to turn raw certificate clutter into readable families."),
"query_issuer_trust": ("The cleaned current certificate hits.", "Report builders use this trust view in the certificate chapters and appendix tables."),
"render_markdown_report": ("Current hits, groups, and trust data.", "Produces the Markdown inventory appendix."),
"render_latex_report": ("Current hits, groups, and trust data.", "Produces the LaTeX appendix source that later becomes PDF."),
"compile_latex_to_pdf": ("A finished `.tex` file.", "Produces the human-readable PDF artifact."),
"main": ("CLI arguments from the operator.", "Runs the whole scanner end to end."),
},
"ct_dns_utils.py": {
"Module setup": ("Nothing yet; this is the starting point.", "The later DNS helpers all reuse these imports and small shared helpers."),
"run_dig": ("A hostname and record type.", "`scan_name_live`, `dig_status`, `dig_short`, and `ptr_lookup` all rely on this."),
"scan_name_live": ("One DNS name from a SAN entry.", "`scan_name_cached` returns this result shape to higher-level analytics."),
"scan_name_cached": ("A DNS name plus cache settings.", "`ct_master_report.enrich_dns` uses this for every SAN name in the current corpus."),
"infer_provider_hints": ("One normalized DNS observation.", "`infer_stack_signature` and the report layers use the hints it produces."),
"infer_stack_signature": ("One DNS observation plus provider clues.", "`ct_master_report` uses the resulting label in naming and DNS chapters."),
"provider_explanations": ("The delivery labels used by the report.", "The monograph glossary uses these explanations directly."),
},
"ct_usage_assessment.py": {
"extract_eku_oids": ("One certificate object.", "`classify_purpose` uses these OIDs to decide the category."),
"extract_key_usage_flags": ("One certificate object.", "`build_classifications` stores these flags as supporting evidence."),
"classify_purpose": ("The EKU OID list from one certificate.", "`build_classifications` turns that decision into a per-certificate record."),
"build_classifications": ("The cleaned current hits plus raw records.", "`summarize` compresses these rows into report-level counts."),
"summarize": ("The per-certificate purpose labels.", "Current-state and monograph chapters use the summary counts and templates."),
"main": ("CLI arguments from the operator.", "Runs the standalone purpose analysis end to end."),
},
"ct_lineage_report.py": {
"query_historical_domain": ("A configured search domain.", "`load_records` uses it to build the wider historical corpus."),
"build_certificates": ("Historical `DatabaseRecord` rows.", "`group_by_subject_cn` and all drift checks consume these normalized historical certificates."),
"group_by_subject_cn": ("Historical certificates.", "`dn_change_rows`, `issuer_change_rows`, `san_change_rows`, and `overlap_rows` all work off this grouping."),
"dn_change_rows": ("CN-grouped historical certificates.", "`build_assessment` uses these rows for Subject-DN drift sections."),
"issuer_change_rows": ("CN-grouped historical certificates.", "`build_assessment` uses these rows for CA-family drift sections."),
"san_change_rows": ("CN-grouped historical certificates.", "`build_assessment` uses these rows for SAN-drift sections."),
"overlap_rows": ("CN-grouped historical certificates.", "`build_assessment` turns these into current and past overlap red flags."),
"build_assessment": ("Historical records from all configured domains.", "The monograph and standalone historical reports consume this one big bundle."),
"main": ("CLI arguments from the operator.", "Runs the standalone historical analysis end to end."),
},
"ct_caa_analysis.py": {
"relevant_caa_live": ("One DNS name from the SAN universe.", "`build_analysis` uses this to learn the effective issuance policy per name."),
"allowed_ca_families": ("Raw CAA rows for one effective policy.", "`build_analysis` uses the normalized families for policy-vs-live comparison."),
"build_analysis": ("Current certificate hits and the configured zones.", "The monograph uses this for the CAA chapter and appendix."),
"rows_for_zone": ("The full CAA analysis bundle.", "The monograph uses zone-filtered rows for per-zone policy tables."),
},
"ct_focus_subjects.py": {
"load_focus_subjects": ("The local focus-subject file.", "`build_analysis` uses these parsed cohort entries."),
"classify_taxonomy_bucket": ("One focused Subject CN plus surrounding evidence.", "`build_analysis` uses the bucket label in the focused-cohort chapter."),
"observed_role": ("One focused Subject CN plus public evidence.", "`build_analysis` stores the plain-English role description."),
"build_analysis": ("The focus-subject list, current-state report, and historical assessment.", "The monograph uses the resulting bundle for Chapter 8 and Appendix D."),
},
"ct_master_report.py": {
"load_records": ("Configured domains from the local file.", "`summarize_for_report` uses the returned CT rows as its starting point."),
"enrich_dns": ("The unique SAN DNS names from current hits.", "`summarize_for_report` uses the enriched observations for DNS chapters and examples."),
"pick_examples": ("Current hits, groups, and DNS observations.", "`summarize_for_report` stores the chosen examples for the naming chapter."),
"build_group_digest": ("Current groups plus DNS observations.", "Report builders use the digest in appendices and summary tables."),
"summarize_for_report": ("Current CT rows, DNS observations, issuer trust, and usage facts.", "`ct_monograph_report.main` consumes this as the main current-state input."),
"main": ("CLI arguments from the operator.", "Runs the shorter consolidated current-state report end to end."),
},
"ct_monograph_report.py": {
"render_appendix_inventory": ("The current-state report bundle.", "Creates the hidden appendix files that are later embedded into the monograph."),
"render_markdown": ("Current-state facts, history, CAA, and focused-cohort analysis.", "Produces the main Markdown monograph."),
"render_latex": ("Current-state facts, history, CAA, and focused-cohort analysis.", "Produces the main LaTeX monograph source."),
"main": ("CLI arguments from the operator.", "Runs the full publication pipeline from raw analytics to finished PDF."),
},
}
CURRICULUM = """# teachingNoobs Curriculum
Open each file in VS Code and use Markdown Preview. The intended order is:
1. [ct_scan.md](./ct_scan.md)
Why first: this is the core analytics engine. If you understand this file, you understand where the certificate facts come from.
2. [ct_dns_utils.md](./ct_dns_utils.md)
Why second: this explains how the DNS side was scanned and interpreted.
3. [ct_usage_assessment.md](./ct_usage_assessment.md)
Why third: this explains how certificate purpose was classified from EKU and KeyUsage.
4. [ct_lineage_report.md](./ct_lineage_report.md)
Why fourth: this adds historical time and red-flag logic.
5. [ct_caa_analysis.md](./ct_caa_analysis.md)
Why fifth: this adds the DNS-side issuance-policy layer.
6. [ct_focus_subjects.md](./ct_focus_subjects.md)
Why sixth: this explains the special hand-picked Subject-CN cohort logic.
7. [ct_master_report.md](./ct_master_report.md)
Why seventh: this shows how the current-state analytical layers are stitched into one coherent bundle.
8. [ct_monograph_report.md](./ct_monograph_report.md)
Why last: this is the publishing layer. Read it last because it is about presentation and assembly, not fact extraction.
Suggested reading method:
- Keep the Markdown preview open.
- For each page, read the explanation on the right first.
- Then look left at the code block and see how the explanation maps onto the exact lines.
- Do not try to memorize every helper function on first pass. Focus on the few blocks that move real data from one stage to the next.
- Pay special attention to the new `Flow arrows` panel on the right side. That panel tells you where the block's output goes next.
What matters most:
- In `ct_scan.py`: how raw database rows become verified leaf certificates.
- In `ct_dns_utils.py`: how raw DNS answers become delivery clues.
- In `ct_lineage_report.py`: how the code decides what is a normal renewal versus a red flag.
- In `ct_caa_analysis.py`: how live DNS policy is compared with live certificate coverage.
- In `ct_master_report.py`: how the current-state pieces are combined.
What matters less on first read:
- tiny formatting helpers
- string-wrapping helpers
- Markdown/LaTeX table plumbing
Those are still useful, but they are support code, not the heart of the analytics.
"""
def block_span(node: ast.AST, next_node: ast.AST | None, total_lines: int) -> tuple[int, int]:
start = min((item.lineno for item in getattr(node, "decorator_list", []) if hasattr(item, "lineno")), default=node.lineno)
end = getattr(node, "end_lineno", None) or total_lines
return start, end
def fallback_explanation(file_name: str, block_name: str, kind: str) -> str:
lower = block_name.lower()
if kind == "class":
return "This class is a structured container for one piece of data that later code passes around instead of juggling many loose variables."
if lower == "parse_args":
return "This block defines the command-line knobs for the file: input paths, cache settings, output paths, and other runtime switches."
if lower == "main":
return "This is the file's entrypoint. It glues the earlier helper blocks together into one end-to-end run."
if lower.startswith("load_"):
return "This block loads data from disk, cache, or an earlier stage so later code can work with it."
if lower.startswith("store_"):
return "This block saves an intermediate result so the next run can reuse it instead of recomputing everything."
if lower.startswith("query_"):
return "This block asks an external source for data and returns it in a shape the rest of the file can use."
if lower.startswith("extract_"):
return "This block pulls one specific piece of information out of a larger object."
if lower.startswith("build_"):
return "This block constructs a richer higher-level result from simpler inputs."
if lower.startswith("render_"):
return "This block turns structured analysis data into human-readable output."
if lower.startswith("classify_"):
return "This block applies rules and chooses a category label."
if lower.startswith("summarize_") or lower == "summarize":
return "This block compresses many detailed rows into a smaller, easier-to-read summary."
if lower.startswith("compile_"):
return "This block hands an intermediate artifact to an external tool so it becomes a finished output file."
if lower.startswith("group_"):
return "This block clusters related items together so later code can analyze them as families instead of as isolated rows."
if lower.startswith("normalize_") or lower.startswith("canonicalize_"):
return "This block makes values consistent so matching and grouping do not get confused by superficial differences."
if lower.startswith("pct") or lower in {"utc_iso", "truncate_text", "first_list_item"}:
return "This is a small helper that keeps the larger analytical code cleaner and easier to reuse."
return f"This {kind} is one of the building blocks inside `{file_name}`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine."
def explain_block(file_name: str, block_name: str, kind: str) -> str:
specific = BLOCK_NOTES.get(file_name, {}).get(block_name)
if specific:
return specific
return fallback_explanation(file_name, block_name, kind)
def code_panel(code: str, language: str = "python") -> str:
escaped = html.escape(code.rstrip())
return (
'<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; '
'color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; '
'line-height:1.45;"><code class="language-'
+ language
+ '">'
+ escaped
+ "</code></pre>"
)
def explanation_panel(title: str, text: str) -> str:
return (
f"<p><strong>{html.escape(title)}</strong></p>"
f"<p>{html.escape(text)}</p>"
)
def flow_panel(file_name: str, block_name: str) -> str:
upstream, downstream = BLOCK_FLOWS.get(file_name, {}).get(
block_name,
(
"Earlier blocks or operator input feed this block.",
"Later blocks in the same file or in the next analytical stage consume its output.",
),
)
return (
"<p><strong>Flow arrows</strong></p>"
f"<p>{html.escape(upstream)} &#8594; <strong>{html.escape(block_name)}</strong> &#8594; {html.escape(downstream)}</p>"
)
def make_doc_for_file(file_name: str) -> str:
path = ROOT / file_name
source = path.read_text(encoding="utf-8")
lines = source.splitlines()
tree = ast.parse(source, filename=file_name)
top_nodes = [node for node in tree.body if isinstance(node, (ast.ClassDef, ast.FunctionDef, ast.AsyncFunctionDef))]
blocks: list[tuple[str, str, str]] = []
if top_nodes:
first_start = min(
(item.lineno for item in getattr(top_nodes[0], "decorator_list", []) if hasattr(item, "lineno")),
default=top_nodes[0].lineno,
)
preamble_end = first_start - 1
if preamble_end >= 1:
preamble_code = "\n".join(lines[:preamble_end]).rstrip()
if preamble_code:
blocks.append(("Module setup", "module", preamble_code))
for index, node in enumerate(top_nodes):
next_node = top_nodes[index + 1] if index + 1 < len(top_nodes) else None
start, end = block_span(node, next_node, len(lines))
code = "\n".join(lines[start - 1 : end]).rstrip()
kind = "class" if isinstance(node, ast.ClassDef) else "function"
blocks.append((node.name, kind, code))
page_lines = [
f"# {file_name}",
"",
f"Source file: [`{file_name}`](../{file_name})",
"",
FILE_INTROS[file_name],
"",
f"Main flow in one line: `{FILE_FLOW_STRIPS[file_name]}`",
"",
"How to read this page:",
"",
"- left side: the actual source code block",
"- right side: a plain-English explanation for a beginner",
"- read from top to bottom because later blocks depend on earlier ones",
"",
]
for title, kind, code in blocks:
explanation = explain_block(file_name, "__module__" if kind == "module" else title, kind)
page_lines.extend(
[
f"## {title}",
"",
'<table style="width:100%; table-layout:fixed; border-collapse:collapse;">',
"<tr>",
'<td style="width:50%; vertical-align:top; padding:8px;">',
code_panel(code),
"</td>",
'<td style="width:50%; vertical-align:top; padding:8px;">',
explanation_panel("What this block is doing", explanation),
flow_panel(file_name, title),
explanation_panel(
"How to think about it",
"Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?",
),
"</td>",
"</tr>",
"</table>",
"",
]
)
return "\n".join(page_lines) + "\n"
def main() -> int:
OUT_DIR.mkdir(parents=True, exist_ok=True)
for file_name in SOURCE_FILES:
doc_path = OUT_DIR / file_name.replace(".py", ".md")
doc_path.write_text(make_doc_for_file(file_name), encoding="utf-8")
(OUT_DIR / "CURRICULUM.md").write_text(CURRICULUM, encoding="utf-8")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View file

@ -0,0 +1,531 @@
# ct_caa_analysis.py
Source file: [`ct_caa_analysis.py`](../ct_caa_analysis.py)
CAA analyzer. This file resolves live DNS issuance policy and compares it against the public CA families that are actually covering the names today.
Main flow in one line: `DNS name -> effective CAA lookup -> allowed CA families -> compare with live cert families`
How to read this page:
- left side: the actual source code block
- right side: a plain-English explanation for a beginner
- read from top to bottom because later blocks depend on earlier ones
## Module setup
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">#!/usr/bin/env python3
from __future__ import annotations
from collections import Counter, defaultdict
from dataclasses import asdict, dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
import ct_dns_utils
import ct_scan</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Data structures and lookup logic for effective CAA policy analysis.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>Module setup</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## CaaObservation
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class CaaObservation:
name: str
effective_rr_owner: str | None
source_kind: str
source_label: str | None
aliases_seen: list[str]
caa_rows: list[tuple[int, str, str]]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One resolved CAA result before it is merged with certificate coverage data.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>CaaObservation</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## CaaNameRow
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class CaaNameRow:
name: str
zone: str
source_kind: str
effective_rr_owner: str | None
source_label: str | None
aliases_seen: list[str]
issue_values: list[str]
issuewild_values: list[str]
iodef_values: list[str]
allowed_ca_families: list[str]
current_covering_families: list[str]
current_covering_subject_cns: list[str]
current_covering_cert_count: int
current_multi_family_overlap: bool
current_policy_mismatch: bool
mismatch_families: list[str]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One final row that compares DNS policy with current live certificate families.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>CaaNameRow</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## CaaAnalysis
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class CaaAnalysis:
generated_at_utc: str
configured_domains: list[str]
total_names: int
rows: list[CaaNameRow]
source_kind_counts: Counter[str]
zone_counts: Counter[str]
multi_family_overlap_names: list[str]
policy_mismatch_names: list[str]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>The full CAA analysis bundle used by the monograph.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>CaaAnalysis</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## normalize_dns_name
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def normalize_dns_name(value: str) -&gt; str:
value = value.strip()
if value.upper().startswith(&quot;DNS:&quot;):
return ct_dns_utils.normalize_name(value[4:])
return ct_dns_utils.normalize_name(value)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block makes values consistent so matching and grouping do not get confused by superficial differences.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>normalize_dns_name</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## issuer_family
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def issuer_family(names: set[str]) -&gt; str:
lowered = &quot; &quot;.join(sorted(names)).lower()
if &quot;amazon&quot; in lowered:
return &quot;Amazon&quot;
if &quot;google trust services&quot; in lowered or &quot;cn=we1&quot; in lowered:
return &quot;Google Trust Services&quot;
if &quot;sectigo&quot; in lowered or &quot;comodo&quot; in lowered:
return &quot;Sectigo/COMODO&quot;
if any(token in lowered for token in [&quot;digicert&quot;, &quot;quovadis&quot;, &quot;thawte&quot;, &quot;geotrust&quot;, &quot;rapidssl&quot;, &quot;symantec&quot;, &quot;verisign&quot;]):
return &quot;DigiCert/QuoVadis&quot;
return &quot;Other&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>issuer_family</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## classify_zone
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def classify_zone(name: str, configured_domains: list[str]) -&gt; str:
for domain in sorted(configured_domains, key=len, reverse=True):
lowered_domain = domain.lower()
if name == lowered_domain or name.endswith(f&quot;.{lowered_domain}&quot;):
return lowered_domain
return &quot;other&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block applies rules and chooses a category label.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>classify_zone</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## cache_path
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def cache_path(cache_dir: Path, name: str) -&gt; Path:
return cache_dir / ct_dns_utils.cache_key(f&quot;caa-{name}&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>cache_path</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## serialize_observation
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def serialize_observation(observation: CaaObservation) -&gt; dict[str, Any]:
return {
&quot;name&quot;: observation.name,
&quot;effective_rr_owner&quot;: observation.effective_rr_owner,
&quot;source_kind&quot;: observation.source_kind,
&quot;source_label&quot;: observation.source_label,
&quot;aliases_seen&quot;: observation.aliases_seen,
&quot;caa_rows&quot;: [list(row) for row in observation.caa_rows],
}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>serialize_observation</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## deserialize_observation
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def deserialize_observation(payload: dict[str, Any]) -&gt; CaaObservation:
return CaaObservation(
name=payload[&quot;name&quot;],
effective_rr_owner=payload.get(&quot;effective_rr_owner&quot;),
source_kind=payload[&quot;source_kind&quot;],
source_label=payload.get(&quot;source_label&quot;),
aliases_seen=list(payload.get(&quot;aliases_seen&quot;, [])),
caa_rows=[(int(flag), str(tag), str(value)) for flag, tag, value in payload.get(&quot;caa_rows&quot;, [])],
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>deserialize_observation</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## parse_caa_response
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def parse_caa_response(lines: list[str]) -&gt; tuple[list[tuple[int, str, str]], list[str]]:
rows: list[tuple[int, str, str]] = []
aliases: list[str] = []
for line in lines:
parts = line.split(maxsplit=2)
if len(parts) == 3 and parts[0].isdigit():
flag, tag, value = parts
rows.append((int(flag), tag.lower(), value.strip().strip(&#x27;&quot;&#x27;).lower()))
elif line.endswith(&quot;.&quot;):
aliases.append(ct_dns_utils.normalize_name(line))
return rows, aliases</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>parse_caa_response</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## query_caa_lines
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def query_caa_lines(name: str) -&gt; list[str]:
output = ct_dns_utils.run_dig(name, &quot;CAA&quot;, short=True)
return [line.strip() for line in output.splitlines() if line.strip()]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block asks an external source for data and returns it in a shape the rest of the file can use.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>query_caa_lines</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## relevant_caa_live
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def relevant_caa_live(name: str) -&gt; CaaObservation:
labels = name.rstrip(&quot;.&quot;).lower().split(&quot;.&quot;)
for index in range(len(labels)):
candidate = &quot;.&quot;.join(labels[index:])
rows, aliases = parse_caa_response(query_caa_lines(candidate))
if rows:
if index == 0:
source_kind = &quot;alias_target&quot; if aliases else &quot;exact&quot;
else:
source_kind = &quot;parent_alias_target&quot; if aliases else &quot;parent&quot;
return CaaObservation(
name=name,
effective_rr_owner=candidate,
source_kind=source_kind,
source_label=aliases[-1] if aliases else candidate,
aliases_seen=aliases,
caa_rows=rows,
)
return CaaObservation(
name=name,
effective_rr_owner=None,
source_kind=&quot;none&quot;,
source_label=None,
aliases_seen=[],
caa_rows=[],
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Finds the effective live CAA for one name, including inheritance and alias behavior.</p>
<p><strong>Flow arrows</strong></p><p>One DNS name from the SAN universe. &#8594; <strong>relevant_caa_live</strong> &#8594; `build_analysis` uses this to learn the effective issuance policy per name.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## scan_name_cached
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def scan_name_cached(name: str, cache_dir: Path, ttl_seconds: int) -&gt; CaaObservation:
key = cache_path(cache_dir, name).name
cached = ct_dns_utils.load_json_cache(cache_dir, key, ttl_seconds)
if cached is not None:
cached.pop(&quot;cached_at&quot;, None)
return deserialize_observation(cached)
observation = relevant_caa_live(name)
ct_dns_utils.store_json_cache(cache_dir, key, serialize_observation(observation))
return observation</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>scan_name_cached</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## allowed_ca_families
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def allowed_ca_families(caa_rows: list[tuple[int, str, str]]) -&gt; list[str]:
families: set[str] = set()
for _flag, tag, value in caa_rows:
if tag != &quot;issue&quot;:
continue
normalized = value[:-1] if value.endswith(&quot;.&quot;) else value
if any(token in normalized for token in [&quot;amazon.com&quot;, &quot;amazontrust.com&quot;, &quot;awstrust.com&quot;, &quot;amazonaws.com&quot;, &quot;aws.amazon.com&quot;]):
families.add(&quot;Amazon&quot;)
if any(token in normalized for token in [&quot;sectigo.com&quot;, &quot;comodoca.com&quot;, &quot;comodo.com&quot;]):
families.add(&quot;Sectigo/COMODO&quot;)
if any(token in normalized for token in [&quot;digicert.com&quot;, &quot;digicert.ne.jp&quot;, &quot;thawte.com&quot;, &quot;geotrust.com&quot;, &quot;rapidssl.com&quot;, &quot;symantec.com&quot;, &quot;quovadisglobal.com&quot;, &quot;digitalcertvalidation.com&quot;]):
families.add(&quot;DigiCert/QuoVadis&quot;)
if &quot;pki.goog&quot; in normalized:
families.add(&quot;Google Trust Services&quot;)
if &quot;letsencrypt.org&quot; in normalized:
families.add(&quot;Let&#x27;s Encrypt&quot;)
if any(token in normalized for token in [&quot;telia.com&quot;, &quot;telia.fi&quot;, &quot;telia.se&quot;]):
families.add(&quot;Telia&quot;)
return sorted(families)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Raw CAA rows for one effective policy. &#8594; <strong>allowed_ca_families</strong> &#8594; `build_analysis` uses the normalized families for policy-vs-live comparison.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## issue_values
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def issue_values(caa_rows: list[tuple[int, str, str]], tag: str) -&gt; list[str]:
return sorted({value for _flag, row_tag, value in caa_rows if row_tag == tag})</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>issue_values</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## build_analysis
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def build_analysis(
hits: list[ct_scan.CertificateHit],
configured_domains: list[str],
cache_dir: Path,
ttl_seconds: int,
) -&gt; CaaAnalysis:
names = sorted(
{
normalize_dns_name(entry)
for hit in hits
for entry in hit.san_entries
if normalize_dns_name(entry)
}
)
coverage: dict[str, list[tuple[str, str]]] = defaultdict(list)
for hit in hits:
family = issuer_family(hit.issuer_names)
subject_cn = normalize_dns_name(hit.subject_cn)
for entry in hit.san_entries:
coverage[normalize_dns_name(entry)].append((subject_cn, family))
rows: list[CaaNameRow] = []
for name in names:
observation = scan_name_cached(name, cache_dir, ttl_seconds)
allowed_families = allowed_ca_families(observation.caa_rows)
current_families = sorted({family for _subject, family in coverage[name]})
mismatch_families = sorted(family for family in current_families if allowed_families and family not in allowed_families)
rows.append(
CaaNameRow(
name=name,
zone=classify_zone(name, configured_domains),
source_kind=observation.source_kind,
effective_rr_owner=observation.effective_rr_owner,
source_label=observation.source_label,
aliases_seen=observation.aliases_seen,
issue_values=issue_values(observation.caa_rows, &quot;issue&quot;),
issuewild_values=issue_values(observation.caa_rows, &quot;issuewild&quot;),
iodef_values=issue_values(observation.caa_rows, &quot;iodef&quot;),
allowed_ca_families=allowed_families,
current_covering_families=current_families,
current_covering_subject_cns=sorted({subject for subject, _family in coverage[name]}),
current_covering_cert_count=len(coverage[name]),
current_multi_family_overlap=len(current_families) &gt; 1,
current_policy_mismatch=bool(mismatch_families),
mismatch_families=mismatch_families,
)
)
return CaaAnalysis(
generated_at_utc=ct_scan.utc_iso(datetime.now(UTC)),
configured_domains=sorted(configured_domains),
total_names=len(rows),
rows=rows,
source_kind_counts=Counter(row.source_kind for row in rows),
zone_counts=Counter(row.zone for row in rows),
multi_family_overlap_names=sorted(row.name for row in rows if row.current_multi_family_overlap),
policy_mismatch_names=sorted(row.name for row in rows if row.current_policy_mismatch),
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Runs CAA across the whole SAN namespace and compares policy with live issuance.</p>
<p><strong>Flow arrows</strong></p><p>Current certificate hits and the configured zones. &#8594; <strong>build_analysis</strong> &#8594; The monograph uses this for the CAA chapter and appendix.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## rows_for_zone
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def rows_for_zone(analysis: CaaAnalysis, zone: str) -&gt; list[CaaNameRow]:
return [row for row in analysis.rows if row.zone == zone]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Filters the full analysis down to one configured DNS zone.</p>
<p><strong>Flow arrows</strong></p><p>The full CAA analysis bundle. &#8594; <strong>rows_for_zone</strong> &#8594; The monograph uses zone-filtered rows for per-zone policy tables.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## policy_counter
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def policy_counter(rows: list[CaaNameRow]) -&gt; Counter[tuple[str, ...]]:
counter: Counter[tuple[str, ...]] = Counter()
for row in rows:
key = tuple(row.allowed_ca_families) if row.allowed_ca_families else (&quot;UNRESTRICTED&quot;,)
counter[key] += 1
return counter</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>policy_counter</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## serialize_analysis
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def serialize_analysis(analysis: CaaAnalysis) -&gt; dict[str, Any]:
return {
&quot;generated_at_utc&quot;: analysis.generated_at_utc,
&quot;configured_domains&quot;: analysis.configured_domains,
&quot;total_names&quot;: analysis.total_names,
&quot;rows&quot;: [asdict(row) for row in analysis.rows],
&quot;source_kind_counts&quot;: dict(analysis.source_kind_counts),
&quot;zone_counts&quot;: dict(analysis.zone_counts),
&quot;multi_family_overlap_names&quot;: analysis.multi_family_overlap_names,
&quot;policy_mismatch_names&quot;: analysis.policy_mismatch_names,
}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_caa_analysis.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>serialize_analysis</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>

View file

@ -0,0 +1,501 @@
# ct_dns_utils.py
Source file: [`ct_dns_utils.py`](../ct_dns_utils.py)
Public DNS scanner. This file runs dig, follows alias chains, finds public addresses, and collapses raw DNS evidence into readable delivery labels.
Main flow in one line: `DNS name -> dig answers -> normalized observation -> provider hints -> delivery label`
How to read this page:
- left side: the actual source code block
- right side: a plain-English explanation for a beginner
- read from top to bottom because later blocks depend on earlier ones
## Module setup
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">#!/usr/bin/env python3
from __future__ import annotations
import hashlib
import ipaddress
import json
import re
import subprocess
import time
from dataclasses import asdict, dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
import ct_scan</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Shared DNS scanning helpers, cache helpers, and the logic that turns raw DNS answers into platform clues.</p>
<p><strong>Flow arrows</strong></p><p>Nothing yet; this is the starting point. &#8594; <strong>Module setup</strong> &#8594; The later DNS helpers all reuse these imports and small shared helpers.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## DnsObservation
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class DnsObservation:
original_name: str
original_status: str
cname_chain: list[str]
terminal_name: str
terminal_status: str
a_records: list[str]
aaaa_records: list[str]
ptr_records: list[str]
classification: str
stack_signature: str
provider_hints: list[str]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One complete DNS observation for one hostname.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>DnsObservation</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## normalize_name
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def normalize_name(name: str) -&gt; str:
return name.rstrip(&quot;.&quot;).lower()</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block makes values consistent so matching and grouping do not get confused by superficial differences.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>normalize_name</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## cache_key
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def cache_key(value: str) -&gt; str:
digest = hashlib.sha256(value.encode(&quot;utf-8&quot;)).hexdigest()[:16]
slug = re.sub(r&quot;[^a-z0-9.-]+&quot;, &quot;-&quot;, value.lower()).strip(&quot;-&quot;)
slug = slug[:80] or &quot;item&quot;
return f&quot;v1-{slug}-{digest}.json&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>cache_key</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## load_json_cache
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def load_json_cache(cache_dir: Path, key: str, ttl_seconds: int) -&gt; dict[str, Any] | None:
path = cache_dir / key
if not path.exists():
return None
payload = json.loads(path.read_text(encoding=&quot;utf-8&quot;))
cached_at = datetime.fromisoformat(payload[&quot;cached_at&quot;].replace(&quot;Z&quot;, &quot;+00:00&quot;))
age = time.time() - cached_at.astimezone(UTC).timestamp()
if age &gt; ttl_seconds:
return None
return payload</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block loads data from disk, cache, or an earlier stage so later code can work with it.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>load_json_cache</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## store_json_cache
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def store_json_cache(cache_dir: Path, key: str, payload: dict[str, Any]) -&gt; None:
cache_dir.mkdir(parents=True, exist_ok=True)
enriched = dict(payload)
enriched[&quot;cached_at&quot;] = ct_scan.utc_iso(datetime.now(UTC))
(cache_dir / key).write_text(json.dumps(enriched, indent=2, sort_keys=True), encoding=&quot;utf-8&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block saves an intermediate result so the next run can reuse it instead of recomputing everything.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>store_json_cache</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## run_dig
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def run_dig(name: str, rrtype: str, short: bool) -&gt; str:
cmd = [&quot;dig&quot;, &quot;+time=2&quot;, &quot;+tries=1&quot;]
if short:
cmd.append(&quot;+short&quot;)
else:
cmd.extend([&quot;+noall&quot;, &quot;+comments&quot;, &quot;+answer&quot;])
cmd.extend([name, rrtype])
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
return result.stdout</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>A hostname and record type. &#8594; <strong>run_dig</strong> &#8594; `scan_name_live`, `dig_status`, `dig_short`, and `ptr_lookup` all rely on this.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## dig_status
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def dig_status(name: str, rrtype: str = &quot;A&quot;) -&gt; str:
output = run_dig(name, rrtype, short=False)
match = re.search(r&quot;status:\s*([A-Z]+)&quot;, output)
if match:
return match.group(1)
if output.strip():
return &quot;NOERROR&quot;
return &quot;UNKNOWN&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>dig_status</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## dig_short
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def dig_short(name: str, rrtype: str) -&gt; list[str]:
output = run_dig(name, rrtype, short=True)
return [normalize_name(line) for line in output.splitlines() if line.strip()]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>dig_short</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## parse_answer_section
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def parse_answer_section(output: str) -&gt; list[tuple[str, str]]:
in_answer = False
parsed: list[tuple[str, str]] = []
for raw_line in output.splitlines():
line = raw_line.strip()
if not line:
continue
if line.startswith(&quot;;; ANSWER SECTION:&quot;):
in_answer = True
continue
if not in_answer or line.startswith(&quot;;;&quot;):
continue
match = re.match(r&quot;^\S+\s+\d+\s+IN\s+(\S+)\s+(.+)$&quot;, line)
if not match:
continue
rrtype, rdata = match.groups()
parsed.append((rrtype.upper(), normalize_name(rdata)))
return parsed</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>parse_answer_section</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## is_ip_address
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def is_ip_address(value: str) -&gt; bool:
try:
ipaddress.ip_address(value)
return True
except ValueError:
return False</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>is_ip_address</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## classify_observation
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def classify_observation(chain: list[str], terminal_status: str, a_records: list[str], aaaa_records: list[str]) -&gt; str:
has_addresses = bool(a_records or aaaa_records)
if chain and has_addresses:
return &quot;cname_to_address&quot;
if chain and not has_addresses:
return &quot;dangling_cname&quot;
if has_addresses:
return &quot;direct_address&quot;
if terminal_status == &quot;NXDOMAIN&quot;:
return &quot;nxdomain&quot;
if terminal_status == &quot;NOERROR&quot;:
return &quot;no_data&quot;
return &quot;other&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block applies rules and chooses a category label.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>classify_observation</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## infer_provider_hints
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def infer_provider_hints(observation: DnsObservation) -&gt; list[str]:
text = &quot; &quot;.join(
[
observation.original_name,
*observation.cname_chain,
observation.terminal_name,
*observation.ptr_records,
]
).lower()
hints: list[str] = []
if &quot;campaign.adobe.com&quot; in text:
hints.append(&quot;Adobe Campaign&quot;)
if &quot;cloudfront.net&quot; in text:
hints.append(&quot;AWS CloudFront&quot;)
if &quot;elb.amazonaws.com&quot; in text or &quot;compute.amazonaws.com&quot; in text:
hints.append(&quot;AWS&quot;)
if &quot;apigee.net&quot; in text or &quot;googleusercontent.com&quot; in text:
hints.append(&quot;Google Apigee&quot;)
if &quot;pegacloud.net&quot; in text or &quot;.pega.net&quot; in text:
hints.append(&quot;Pega Cloud&quot;)
if &quot;useinfinite.io&quot; in text:
hints.append(&quot;Infinite / agency alias&quot;)
if any(ip.startswith(&quot;13.107.&quot;) for ip in observation.a_records) or any(ip.startswith(&quot;2620:1ec:&quot;) for ip in observation.aaaa_records):
hints.append(&quot;Microsoft Edge&quot;)
if not hints:
hints.append(&quot;Unclassified&quot;)
return hints</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Reads the raw DNS trail and pulls out likely platform or vendor clues.</p>
<p><strong>Flow arrows</strong></p><p>One normalized DNS observation. &#8594; <strong>infer_provider_hints</strong> &#8594; `infer_stack_signature` and the report layers use the hints it produces.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## infer_stack_signature
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def infer_stack_signature(observation: DnsObservation) -&gt; str:
hints = infer_provider_hints(observation)
if observation.classification == &quot;nxdomain&quot;:
return &quot;No public DNS (NXDOMAIN)&quot;
if observation.classification == &quot;no_data&quot;:
return &quot;No public address data&quot;
if &quot;Adobe Campaign&quot; in hints and &quot;AWS CloudFront&quot; in hints:
return &quot;Adobe Campaign -&gt; AWS CloudFront&quot;
if &quot;Adobe Campaign&quot; in hints and &quot;AWS&quot; in hints:
return &quot;Adobe Campaign -&gt; AWS ALB&quot;
if &quot;Adobe Campaign&quot; in hints and observation.a_records:
return &quot;Adobe Campaign direct IP&quot;
if &quot;AWS CloudFront&quot; in hints:
return &quot;AWS CloudFront&quot;
if &quot;Google Apigee&quot; in hints:
return &quot;Google Apigee&quot;
if &quot;Pega Cloud&quot; in hints and &quot;AWS&quot; in hints:
return &quot;Pega Cloud -&gt; AWS ALB&quot;
if &quot;Infinite / agency alias&quot; in hints and observation.classification == &quot;dangling_cname&quot;:
return &quot;Dangling agency alias&quot;
if &quot;Microsoft Edge&quot; in hints:
return &quot;Direct Microsoft edge&quot;
if &quot;AWS&quot; in hints:
return &quot;Direct AWS&quot;
if observation.classification == &quot;direct_address&quot;:
return &quot;Direct address (provider unclear)&quot;
if observation.classification == &quot;cname_to_address&quot;:
return &quot;CNAME to address (provider unclear)&quot;
return hints[0]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Collapses several low-level DNS clues into one human-readable delivery label.</p>
<p><strong>Flow arrows</strong></p><p>One DNS observation plus provider clues. &#8594; <strong>infer_stack_signature</strong> &#8594; `ct_master_report` uses the resulting label in naming and DNS chapters.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## scan_name_live
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def scan_name_live(name: str) -&gt; DnsObservation:
name = normalize_name(name)
a_output = run_dig(name, &quot;A&quot;, short=False)
aaaa_output = run_dig(name, &quot;AAAA&quot;, short=False)
original_status = dig_status(name, &quot;A&quot;)
a_answers = parse_answer_section(a_output)
aaaa_answers = parse_answer_section(aaaa_output)
chain: list[str] = []
for rrtype, rdata in a_answers + aaaa_answers:
if rrtype == &quot;CNAME&quot; and rdata not in chain:
chain.append(rdata)
a_records = sorted({rdata for rrtype, rdata in a_answers if rrtype == &quot;A&quot; and is_ip_address(rdata)})
aaaa_records = sorted({rdata for rrtype, rdata in aaaa_answers if rrtype == &quot;AAAA&quot; and is_ip_address(rdata)})
terminal_name = chain[-1] if chain else name
terminal_status = original_status
observation = DnsObservation(
original_name=name,
original_status=original_status,
cname_chain=chain,
terminal_name=terminal_name,
terminal_status=terminal_status,
a_records=a_records,
aaaa_records=aaaa_records,
ptr_records=[],
classification=classify_observation(chain, terminal_status, a_records, aaaa_records),
stack_signature=&quot;&quot;,
provider_hints=[],
)
observation.provider_hints = infer_provider_hints(observation)
observation.stack_signature = infer_stack_signature(observation)
return observation</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Runs the live DNS walk for one hostname.</p>
<p><strong>Flow arrows</strong></p><p>One DNS name from a SAN entry. &#8594; <strong>scan_name_live</strong> &#8594; `scan_name_cached` returns this result shape to higher-level analytics.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## scan_name_cached
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def scan_name_cached(name: str, cache_dir: Path, ttl_seconds: int) -&gt; DnsObservation:
key = cache_key(name)
cached = load_json_cache(cache_dir, key, ttl_seconds)
if cached is not None:
payload = dict(cached)
payload.pop(&quot;cached_at&quot;, None)
return DnsObservation(**payload)
observation = scan_name_live(name)
store_json_cache(cache_dir, key, asdict(observation))
return observation</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Reuses a recent DNS result if possible, otherwise performs the live scan.</p>
<p><strong>Flow arrows</strong></p><p>A DNS name plus cache settings. &#8594; <strong>scan_name_cached</strong> &#8594; `ct_master_report.enrich_dns` uses this for every SAN name in the current corpus.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## ptr_lookup
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def ptr_lookup(ip: str, cache_dir: Path, ttl_seconds: int) -&gt; list[str]:
key = cache_key(f&quot;ptr-{ip}&quot;)
cached = load_json_cache(cache_dir, key, ttl_seconds)
if cached is not None:
return list(cached.get(&quot;answers&quot;, []))
output = subprocess.run(
[&quot;dig&quot;, &quot;+time=2&quot;, &quot;+tries=1&quot;, &quot;+short&quot;, &quot;-x&quot;, ip, &quot;PTR&quot;],
capture_output=True,
text=True,
check=False,
).stdout
answers = [normalize_name(line) for line in output.splitlines() if line.strip()]
store_json_cache(cache_dir, key, {&quot;answers&quot;: answers})
return answers</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>ptr_lookup</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## provider_explanations
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def provider_explanations() -&gt; dict[str, str]:
return {
&quot;Adobe Campaign&quot;: &quot;A marketing and communication platform often used to send customer messages, email journeys, and campaign traffic. In DNS terms, it can sit in front of cloud infrastructure rather than hosting the final application by itself.&quot;,
&quot;AWS&quot;: &quot;Amazon Web Services, a large public cloud platform. In this report it usually means the endpoint ultimately lands on Amazon-hosted compute or load-balancing infrastructure.&quot;,
&quot;AWS ALB&quot;: &quot;AWS Application Load Balancer. A traffic-distribution front door that sends incoming web requests to one or more backend services.&quot;,
&quot;AWS CloudFront&quot;: &quot;Amazon&#x27;s global content-delivery and edge network. It is often used to front websites, APIs, and static assets close to users.&quot;,
&quot;Google Apigee&quot;: &quot;An API gateway and API-management layer. If a hostname lands here, it usually means the public endpoint is being governed as an API product rather than being exposed directly from an application server.&quot;,
&quot;Pega Cloud&quot;: &quot;A managed hosting platform for Pega applications and workflow systems. It often fronts case-management or process-heavy applications.&quot;,
&quot;Microsoft Edge&quot;: &quot;Microsoft-operated edge infrastructure. In DNS this usually means the public name lands on Microsoft&#x27;s front-door network rather than directly on a private application host.&quot;,
&quot;Infinite / agency alias&quot;: &quot;A third-party aliasing pattern typically used by an agency or service intermediary. It points traffic onward to the actual delivery platform.&quot;,
&quot;CNAME&quot;: &quot;A DNS alias record. It says one hostname is really another hostname, rather than directly mapping to an IP address.&quot;,
&quot;A record&quot;: &quot;A DNS record that maps a hostname to an IPv4 address.&quot;,
&quot;AAAA record&quot;: &quot;A DNS record that maps a hostname to an IPv6 address.&quot;,
&quot;PTR record&quot;: &quot;A reverse-DNS record. It maps an IP address back to a hostname and is useful as a provider clue, not as proof of ownership.&quot;,
&quot;NXDOMAIN&quot;: &quot;A DNS response meaning the name does not exist publicly.&quot;,
}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Supplies the glossary text used later in the reports.</p>
<p><strong>Flow arrows</strong></p><p>The delivery labels used by the report. &#8594; <strong>provider_explanations</strong> &#8594; The monograph glossary uses these explanations directly.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>

View file

@ -0,0 +1,960 @@
# ct_focus_subjects.py
Source file: [`ct_focus_subjects.py`](../ct_focus_subjects.py)
Focused-cohort analyzer. This file takes your special hand-picked Subject CN list and compares it against the wider certificate and DNS estate.
Main flow in one line: `focus-subject file -> cohort entries -> compare against current and historical estate -> bucketed cohort explanation`
How to read this page:
- left side: the actual source code block
- right side: a plain-English explanation for a beginner
- read from top to bottom because later blocks depend on earlier ones
## Module setup
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">#!/usr/bin/env python3
from __future__ import annotations
import re
from collections import Counter
from dataclasses import dataclass
from pathlib import Path
from statistics import median
import ct_dns_utils
import ct_lineage_report
import ct_master_report
import ct_scan
ENVIRONMENT_HINTS = {
&quot;alpha&quot;,
&quot;beta&quot;,
&quot;dev&quot;,
&quot;qa&quot;,
&quot;uat&quot;,
&quot;sit&quot;,
&quot;stage&quot;,
&quot;stg&quot;,
&quot;preprod&quot;,
&quot;prod&quot;,
&quot;release&quot;,
&quot;squads&quot;,
&quot;sandbox&quot;,
}
VENDOR_HINTS = {
&quot;vendor&quot;,
&quot;external&quot;,
&quot;hoster&quot;,
&quot;product&quot;,
&quot;mitek&quot;,
&quot;scrive&quot;,
&quot;pega&quot;,
}
IDENTITY_HINTS = {
&quot;id&quot;,
&quot;idp&quot;,
&quot;identity&quot;,
&quot;auth&quot;,
&quot;sso&quot;,
&quot;online&quot;,
&quot;mail&quot;,
&quot;email&quot;,
&quot;secmail&quot;,
&quot;chat&quot;,
&quot;appointment&quot;,
&quot;appointments&quot;,
}
CUSTOMER_HINTS = {
&quot;brand&quot;,
&quot;branding&quot;,
&quot;campaign&quot;,
&quot;experience&quot;,
&quot;welcome&quot;,
&quot;thankyou&quot;,
&quot;gifts&quot;,
&quot;investment&quot;,
&quot;client&quot;,
&quot;customers&quot;,
&quot;information&quot;,
&quot;club&quot;,
&quot;risk&quot;,
}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Rules and data shapes for analyzing the special hand-picked Subject-CN cohort.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>Module setup</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## FocusSubject
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class FocusSubject:
subject_cn: str
analyst_note: str</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One line from the local focus-subject file.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>FocusSubject</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## FocusSubjectDetail
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class FocusSubjectDetail:
subject_cn: str
analyst_note: str
analyst_theme: str
taxonomy_bucket: str
taxonomy_reason: str
observed_role: str
basket_status: str
current_direct_certificates: int
historical_direct_certificates: int
current_non_focus_san_carriers: int
historical_non_focus_san_carriers: int
current_revoked_certificates: int
current_not_revoked_certificates: int
current_dns_outcome: str
current_dns_classification: str
current_issuer_families: str
historical_issuer_families: str
current_san_size_span: str
historical_san_size_span: str
max_direct_to_carrier_overlap_days: int
carrier_subjects: str
current_red_flags: str
past_red_flags: str</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One detailed analytical row for one focused Subject CN.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>FocusSubjectDetail</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## FocusCohortAnalysis
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class FocusCohortAnalysis:
focus_subjects: list[FocusSubject]
details: list[FocusSubjectDetail]
provided_subjects_count: int
historically_seen_subjects_count: int
current_direct_subjects_count: int
current_carried_only_subjects_count: int
historical_non_focus_carried_subjects_count: int
unseen_subjects: list[str]
current_focus_certificate_count: int
current_rest_certificate_count: int
focus_revoked_current_count: int
focus_not_revoked_current_count: int
rest_revoked_current_count: int
rest_not_revoked_current_count: int
focus_revoked_share: str
rest_revoked_share: str
focus_median_san_entries: int
focus_average_san_entries: str
rest_median_san_entries: int
rest_average_san_entries: str
focus_multi_zone_certificate_count: int
rest_multi_zone_certificate_count: int
focus_current_subject_dns_classes: Counter[str]
rest_current_subject_dns_classes: Counter[str]
focus_current_subject_dns_stacks: Counter[str]
rest_current_subject_dns_stacks: Counter[str]
focus_current_issuer_families: Counter[str]
rest_current_issuer_families: Counter[str]
focus_current_red_flag_subjects: int
focus_past_red_flag_subjects: int
focus_any_red_flag_subjects: int
bucket_counts: Counter[str]
notables: list[FocusSubjectDetail]
transition_rows: list[FocusSubjectDetail]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>The full cohort comparison bundle used in the monograph.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>FocusCohortAnalysis</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## load_focus_subjects
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def load_focus_subjects(path: Path) -&gt; list[FocusSubject]:
if not path.exists():
return []
subjects: list[FocusSubject] = []
seen: set[str] = set()
for raw_line in path.read_text(encoding=&quot;utf-8&quot;).splitlines():
line = raw_line.strip()
if not line or line.startswith(&quot;#&quot;):
continue
match = re.match(r&quot;^(?P&lt;cn&gt;[^()]+?)(?:\s*\((?P&lt;meta&gt;.*)\))?$&quot;, line)
if not match:
continue
subject_cn = match.group(&quot;cn&quot;).strip().lower()
if subject_cn in seen:
continue
seen.add(subject_cn)
subjects.append(
FocusSubject(
subject_cn=subject_cn,
analyst_note=(match.group(&quot;meta&quot;) or &quot;&quot;).strip(),
)
)
return subjects</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Reads the local focus-subject list and any analyst notes attached to it.</p>
<p><strong>Flow arrows</strong></p><p>The local focus-subject file. &#8594; <strong>load_focus_subjects</strong> &#8594; `build_analysis` uses these parsed cohort entries.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## dns_names
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def dns_names(san_entries: list[str]) -&gt; set[str]:
return {entry[4:].lower() for entry in san_entries if entry.startswith(&quot;DNS:&quot;)}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>dns_names</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## overlap_days
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def overlap_days(
left_start,
left_end,
right_start,
right_end,
) -&gt; int:
start = max(left_start, right_start)
end = min(left_end, right_end)
if end &lt;= start:
return 0
return max(1, (end - start).days)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>overlap_days</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## pct
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def pct(count: int, total: int) -&gt; str:
if total &lt;= 0:
return &quot;0.0%&quot;
return f&quot;{(count / total) * 100:.1f}%&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This is a small helper that keeps the larger analytical code cleaner and easier to reuse.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>pct</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## short_issuer_family
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def short_issuer_family(issuer_name: str) -&gt; str:
lowered = issuer_name.lower()
if &quot;amazon&quot; in lowered:
return &quot;Amazon&quot;
if &quot;sectigo&quot; in lowered or &quot;comodo&quot; in lowered:
return &quot;Sectigo/COMODO&quot;
if &quot;google trust services&quot; in lowered or &quot;cn=we1&quot; in lowered:
return &quot;Google Trust Services&quot;
return &quot;Other&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>short_issuer_family</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## median_int
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def median_int(values: list[int]) -&gt; int:
if not values:
return 0
return int(median(values))</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>median_int</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## average_text
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def average_text(values: list[int]) -&gt; str:
if not values:
return &quot;0.0&quot;
return f&quot;{(sum(values) / len(values)):.1f}&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>average_text</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## san_size_span
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def san_size_span(current_hits: list[ct_scan.CertificateHit]) -&gt; str:
sizes = sorted({len(hit.san_entries) for hit in current_hits})
if not sizes:
return &quot;-&quot;
if len(sizes) == 1:
return str(sizes[0])
return &quot;, &quot;.join(str(value) for value in sizes[:4]) + (&quot;&quot; if len(sizes) &lt;= 4 else f&quot;, ... (+{len(sizes) - 4} more)&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>san_size_span</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## historical_san_size_span
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def historical_san_size_span(certificates: list[ct_lineage_report.HistoricalCertificate]) -&gt; str:
sizes = sorted({len(certificate.san_entries) for certificate in certificates})
if not sizes:
return &quot;-&quot;
if len(sizes) == 1:
return str(sizes[0])
return &quot;, &quot;.join(str(value) for value in sizes[:4]) + (&quot;&quot; if len(sizes) &lt;= 4 else f&quot;, ... (+{len(sizes) - 4} more)&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>historical_san_size_span</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## summarize_names
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def summarize_names(values: set[str], limit: int = 4) -&gt; str:
if not values:
return &quot;-&quot;
ordered = sorted(values, key=str.casefold)
if len(ordered) &lt;= limit:
return &quot;, &quot;.join(ordered)
return &quot;, &quot;.join(ordered[:limit]) + f&quot;, ... (+{len(ordered) - limit} more)&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block compresses many detailed rows into a smaller, easier-to-read summary.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>summarize_names</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## zone_count_from_sans
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def zone_count_from_sans(san_entries: list[str]) -&gt; int:
return len(
{
ct_scan.san_tail_split(entry[4:])[1]
for entry in san_entries
if entry.startswith(&quot;DNS:&quot;)
}
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>zone_count_from_sans</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## max_san_count_current
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def max_san_count_current(hits: list[ct_scan.CertificateHit]) -&gt; int:
return max((len(hit.san_entries) for hit in hits), default=0)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>max_san_count_current</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## max_san_count_historical
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def max_san_count_historical(certificates: list[ct_lineage_report.HistoricalCertificate]) -&gt; int:
return max((len(certificate.san_entries) for certificate in certificates), default=0)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>max_san_count_historical</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## max_zone_count_current
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def max_zone_count_current(hits: list[ct_scan.CertificateHit]) -&gt; int:
return max((zone_count_from_sans(hit.san_entries) for hit in hits), default=0)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>max_zone_count_current</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## bucket_sort_key
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def bucket_sort_key(value: str) -&gt; tuple[int, str]:
order = {
&quot;direct_front_door&quot;: 0,
&quot;platform_matrix_anchor&quot;: 1,
&quot;ambiguous_legacy&quot;: 2,
}
return (order.get(value, 99), value)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>bucket_sort_key</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## taxonomy_bucket_label
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def taxonomy_bucket_label(bucket: str) -&gt; str:
return {
&quot;direct_front_door&quot;: &quot;Front-door direct name&quot;,
&quot;platform_matrix_anchor&quot;: &quot;Platform-anchor matrix name&quot;,
&quot;ambiguous_legacy&quot;: &quot;Ambiguous or legacy residue&quot;,
}.get(bucket, bucket)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>taxonomy_bucket_label</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## analyst_theme
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def analyst_theme(subject: FocusSubject) -&gt; str:
tokens = set(re.findall(r&quot;[a-z0-9]+&quot;, f&quot;{subject.subject_cn} {subject.analyst_note}&quot;.lower()))
if ENVIRONMENT_HINTS &amp; tokens:
return &quot;environment or platform anchor&quot;
if VENDOR_HINTS &amp; tokens:
return &quot;vendor or product integration&quot;
if IDENTITY_HINTS &amp; tokens:
return &quot;identity, messaging, or service front&quot;
if CUSTOMER_HINTS &amp; tokens:
return &quot;customer proposition or campaign front&quot;
left_label = subject.subject_cn.split(&quot;.&quot;)[0].lower()
if re.fullmatch(r&quot;\d+&quot;, left_label) or re.fullmatch(r&quot;[a-z]{2,6}\d{1,4}&quot;, left_label):
return &quot;opaque or legacy label&quot;
return &quot;human-named branded or service endpoint&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>analyst_theme</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## classify_taxonomy_bucket
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def classify_taxonomy_bucket(
subject: FocusSubject,
current_hits: list[ct_scan.CertificateHit],
historical_hits: list[ct_lineage_report.HistoricalCertificate],
current_carriers: list[ct_scan.CertificateHit],
historical_carriers: list[ct_lineage_report.HistoricalCertificate],
) -&gt; tuple[str, str]:
tokens = set(re.findall(r&quot;[a-z0-9]+&quot;, f&quot;{subject.subject_cn} {subject.analyst_note}&quot;.lower()))
left_label = subject.subject_cn.split(&quot;.&quot;)[0].lower()
opaque_label = bool(
re.fullmatch(r&quot;\d+&quot;, left_label)
or re.fullmatch(r&quot;[a-z]{1,4}\d{1,4}&quot;, left_label)
)
current_direct_exists = bool(current_hits)
historical_direct_exists = bool(historical_hits)
max_current_sans = max_san_count_current(current_hits)
max_historical_sans = max_san_count_historical(historical_hits)
max_any_sans = max(max_current_sans, max_historical_sans)
max_current_zones = max_zone_count_current(current_hits)
carrier_only_today = not current_direct_exists and bool(current_carriers)
carrier_only_history = (not current_direct_exists and not historical_direct_exists and bool(historical_carriers))
environment_signal = bool(ENVIRONMENT_HINTS &amp; tokens)
if max_any_sans &gt;= 20:
return (
&quot;platform_matrix_anchor&quot;,
&quot;Large SAN matrix coverage indicates an umbrella certificate for a managed platform slice rather than one standalone public front door.&quot;,
)
if carrier_only_today or carrier_only_history:
return (
&quot;ambiguous_legacy&quot;,
&quot;This name now appears mainly as a carried SAN passenger or as historical residue, so it no longer behaves like a stable standalone certificate front.&quot;,
)
if current_direct_exists and max_any_sans &lt;= 4 and max_current_zones &lt;= 1 and not opaque_label and not environment_signal:
return (
&quot;direct_front_door&quot;,
&quot;Small direct certificates, single-zone scope, and a human-readable service label fit the pattern of a branded or service-facing public entry point.&quot;,
)
if historical_direct_exists and not current_direct_exists and max_any_sans &lt;= 4 and not opaque_label:
return (
&quot;ambiguous_legacy&quot;,
&quot;The historical certificates look like a simple direct front, but there is no current direct certificate anymore, which makes this mostly migration residue rather than a live front-door pattern.&quot;,
)
if max_any_sans &lt;= 4 and opaque_label:
return (
&quot;ambiguous_legacy&quot;,
&quot;The direct certificate shape is small and simple, but the left-most label is too opaque to treat as a clear branded or service-front naming pattern.&quot;,
)
if environment_signal and max_any_sans &lt;= 19:
return (
&quot;ambiguous_legacy&quot;,
&quot;Environment-style wording is present, but the SAN coverage is not broad enough to prove a full platform-matrix certificate role.&quot;,
)
if max_any_sans &gt; 4:
return (
&quot;ambiguous_legacy&quot;,
&quot;Direct issuance exists, but the SAN set is broader or more variable than a simple one-service front, which leaves the role mixed.&quot;,
)
return (
&quot;ambiguous_legacy&quot;,
&quot;The evidence is mixed or too thin to place this name cleanly in one of the stronger bucket patterns.&quot;,
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Places a name into the direct-front, platform-anchor, or ambiguous bucket.</p>
<p><strong>Flow arrows</strong></p><p>One focused Subject CN plus surrounding evidence. &#8594; <strong>classify_taxonomy_bucket</strong> &#8594; `build_analysis` uses the bucket label in the focused-cohort chapter.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## observed_role
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def observed_role(
subject: FocusSubject,
current_hits: list[ct_scan.CertificateHit],
current_carriers: list[ct_scan.CertificateHit],
historical_carriers: list[ct_lineage_report.HistoricalCertificate],
observation: ct_dns_utils.DnsObservation,
) -&gt; str:
tokens = set(re.findall(r&quot;[a-z0-9]+&quot;, f&quot;{subject.subject_cn} {subject.analyst_note}&quot;.lower()))
if not current_hits and current_carriers:
return &quot;carried today inside another certificate&quot;
if not current_hits and historical_carriers:
return &quot;historical carried alias or retired passenger&quot;
if not current_hits:
return &quot;not seen in the CT corpus&quot;
max_san_entries = max(len(hit.san_entries) for hit in current_hits)
if max_san_entries &gt;= 20 or (ENVIRONMENT_HINTS &amp; tokens):
return &quot;platform matrix or environment anchor&quot;
revoked = sum(1 for hit in current_hits if hit.revocation_status == &quot;revoked&quot;)
if revoked &gt;= 3:
return &quot;high-churn direct service front&quot;
if VENDOR_HINTS &amp; tokens:
return &quot;direct vendor or product integration front&quot;
if IDENTITY_HINTS &amp; tokens:
return &quot;direct service or identity front&quot;
if CUSTOMER_HINTS &amp; tokens:
return &quot;direct branded or customer proposition front&quot;
if observation.classification in {&quot;direct_address&quot;, &quot;cname_to_address&quot;}:
return &quot;direct standalone service front&quot;
return &quot;standalone branded or service endpoint&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Tries to describe what role the name appears to play in the public estate.</p>
<p><strong>Flow arrows</strong></p><p>One focused Subject CN plus public evidence. &#8594; <strong>observed_role</strong> &#8594; `build_analysis` stores the plain-English role description.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## basket_status
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def basket_status(
current_hits: list[ct_scan.CertificateHit],
current_carriers: list[ct_scan.CertificateHit],
historical_hits: list[ct_lineage_report.HistoricalCertificate],
historical_carriers: list[ct_lineage_report.HistoricalCertificate],
) -&gt; str:
if current_hits and current_carriers:
return &quot;current direct-and-carried overlap&quot;
if current_hits:
return &quot;current direct subject certificate&quot;
if current_carriers:
return &quot;current SAN passenger only&quot;
if historical_hits and historical_carriers:
return &quot;historical direct-and-carried only&quot;
if historical_hits:
return &quot;historical direct only&quot;
if historical_carriers:
return &quot;historical SAN passenger only&quot;
return &quot;not seen&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>basket_status</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## red_flag_text
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def red_flag_text(row_lookup: dict[str, str], subject_cn: str) -&gt; str:
return row_lookup.get(subject_cn.lower(), &quot;-&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_focus_subjects.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>red_flag_text</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## build_analysis
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def build_analysis(
subjects: list[FocusSubject],
report: dict[str, object],
assessment: ct_lineage_report.HistoricalAssessment,
dns_cache_dir: Path,
dns_cache_ttl_seconds: int,
) -&gt; FocusCohortAnalysis | None:
if not subjects:
return None
focus_set = {subject.subject_cn for subject in subjects}
current_hits = report[&quot;hits&quot;]
current_by_cn: dict[str, list[ct_scan.CertificateHit]] = {}
for hit in current_hits:
current_by_cn.setdefault(hit.subject_cn.lower(), []).append(hit)
historical_by_cn: dict[str, list[ct_lineage_report.HistoricalCertificate]] = {}
for certificate in assessment.certificates:
historical_by_cn.setdefault(certificate.subject_cn.lower(), []).append(certificate)
non_focus_current = [hit for hit in current_hits if hit.subject_cn.lower() not in focus_set]
non_focus_historical = [certificate for certificate in assessment.certificates if certificate.subject_cn.lower() not in focus_set]
observation_by_name = report[&quot;observation_by_name&quot;]
detail_rows: list[FocusSubjectDetail] = []
transition_rows: list[FocusSubjectDetail] = []
current_red_flag_lookup = {row.subject_cn.lower(): row.flags for row in assessment.current_red_flag_rows}
past_red_flag_lookup = {row.subject_cn.lower(): row.flags for row in assessment.past_red_flag_rows}
for subject in subjects:
current_direct = current_by_cn.get(subject.subject_cn, [])
historical_direct = historical_by_cn.get(subject.subject_cn, [])
current_carriers = [hit for hit in non_focus_current if subject.subject_cn in dns_names(hit.san_entries)]
historical_carriers = [
certificate
for certificate in non_focus_historical
if subject.subject_cn in dns_names(certificate.san_entries)
]
observation = observation_by_name.get(subject.subject_cn) or ct_dns_utils.scan_name_cached(
subject.subject_cn,
dns_cache_dir,
dns_cache_ttl_seconds,
)
current_issuer_families = Counter(
short_issuer_family(ct_scan.primary_issuer_name(hit))
for hit in current_direct
)
historical_issuer_families = Counter(
certificate.issuer_family
for certificate in historical_direct
)
max_overlap = 0
for direct_certificate in historical_direct:
for carrier_certificate in historical_carriers:
max_overlap = max(
max_overlap,
overlap_days(
direct_certificate.validity_not_before,
direct_certificate.effective_not_after,
carrier_certificate.validity_not_before,
carrier_certificate.effective_not_after,
),
)
taxonomy_bucket, taxonomy_reason = classify_taxonomy_bucket(
subject,
current_direct,
historical_direct,
current_carriers,
historical_carriers,
)
detail = FocusSubjectDetail(
subject_cn=subject.subject_cn,
analyst_note=subject.analyst_note or &quot;-&quot;,
analyst_theme=analyst_theme(subject),
taxonomy_bucket=taxonomy_bucket,
taxonomy_reason=taxonomy_reason,
observed_role=observed_role(subject, current_direct, current_carriers, historical_carriers, observation),
basket_status=basket_status(current_direct, current_carriers, historical_direct, historical_carriers),
current_direct_certificates=len(current_direct),
historical_direct_certificates=len(historical_direct),
current_non_focus_san_carriers=len(current_carriers),
historical_non_focus_san_carriers=len(historical_carriers),
current_revoked_certificates=sum(1 for hit in current_direct if hit.revocation_status == &quot;revoked&quot;),
current_not_revoked_certificates=sum(1 for hit in current_direct if hit.revocation_status == &quot;not_revoked&quot;),
current_dns_outcome=observation.stack_signature,
current_dns_classification=observation.classification,
current_issuer_families=&quot;, &quot;.join(
f&quot;{name} ({count})&quot;
for name, count in current_issuer_families.most_common()
) or &quot;-&quot;,
historical_issuer_families=&quot;, &quot;.join(
f&quot;{name} ({count})&quot;
for name, count in historical_issuer_families.most_common()
) or &quot;-&quot;,
current_san_size_span=san_size_span(current_direct),
historical_san_size_span=historical_san_size_span(historical_direct),
max_direct_to_carrier_overlap_days=max_overlap,
carrier_subjects=summarize_names({hit.subject_cn for hit in current_carriers} | {certificate.subject_cn for certificate in historical_carriers}),
current_red_flags=red_flag_text(current_red_flag_lookup, subject.subject_cn),
past_red_flags=red_flag_text(past_red_flag_lookup, subject.subject_cn),
)
detail_rows.append(detail)
if detail.current_non_focus_san_carriers or detail.historical_non_focus_san_carriers:
transition_rows.append(detail)
focus_current_hits = [hit for hit in current_hits if hit.subject_cn.lower() in focus_set]
rest_current_hits = [hit for hit in current_hits if hit.subject_cn.lower() not in focus_set]
def zone_count(hit: ct_scan.CertificateHit) -&gt; int:
return len({ct_scan.san_tail_split(entry[4:])[1] for entry in hit.san_entries if entry.startswith(&quot;DNS:&quot;)})
focus_current_subject_names = sorted({hit.subject_cn.lower() for hit in focus_current_hits})
rest_current_subject_names = sorted({hit.subject_cn.lower() for hit in rest_current_hits})
def observation_for_subject(name: str) -&gt; ct_dns_utils.DnsObservation:
return observation_by_name.get(name) or ct_dns_utils.scan_name_cached(name, dns_cache_dir, dns_cache_ttl_seconds)
focus_current_subject_observations = [observation_for_subject(name) for name in focus_current_subject_names]
rest_current_subject_observations = [observation_for_subject(name) for name in rest_current_subject_names]
focus_current_issuer_families = Counter(
short_issuer_family(ct_scan.primary_issuer_name(hit))
for hit in focus_current_hits
)
rest_current_issuer_families = Counter(
short_issuer_family(ct_scan.primary_issuer_name(hit))
for hit in rest_current_hits
)
current_red_flag_subjects = {row.subject_cn.lower() for row in assessment.current_red_flag_rows}
past_red_flag_subjects = {row.subject_cn.lower() for row in assessment.past_red_flag_rows}
notables = sorted(
detail_rows,
key=lambda item: (
bucket_sort_key(item.taxonomy_bucket),
-(
(item.current_revoked_certificates &gt; 0)
+ (item.current_non_focus_san_carriers &gt; 0)
+ (item.historical_non_focus_san_carriers &gt; 0)
+ (item.current_red_flags != &quot;-&quot;)
+ (item.past_red_flags != &quot;-&quot;)
),
-item.current_direct_certificates,
item.subject_cn,
),
)[:10]
return FocusCohortAnalysis(
focus_subjects=subjects,
details=sorted(detail_rows, key=lambda item: (bucket_sort_key(item.taxonomy_bucket), item.subject_cn.casefold())),
provided_subjects_count=len(subjects),
historically_seen_subjects_count=sum(
1
for item in detail_rows
if item.historical_direct_certificates &gt; 0 or item.historical_non_focus_san_carriers &gt; 0
),
current_direct_subjects_count=sum(1 for item in detail_rows if item.current_direct_certificates &gt; 0),
current_carried_only_subjects_count=sum(
1
for item in detail_rows
if item.current_direct_certificates == 0 and item.current_non_focus_san_carriers &gt; 0
),
historical_non_focus_carried_subjects_count=sum(
1
for item in detail_rows
if item.historical_non_focus_san_carriers &gt; 0
),
unseen_subjects=[item.subject_cn for item in detail_rows if item.basket_status == &quot;not seen&quot;],
current_focus_certificate_count=len(focus_current_hits),
current_rest_certificate_count=len(rest_current_hits),
focus_revoked_current_count=sum(1 for hit in focus_current_hits if hit.revocation_status == &quot;revoked&quot;),
focus_not_revoked_current_count=sum(1 for hit in focus_current_hits if hit.revocation_status == &quot;not_revoked&quot;),
rest_revoked_current_count=sum(1 for hit in rest_current_hits if hit.revocation_status == &quot;revoked&quot;),
rest_not_revoked_current_count=sum(1 for hit in rest_current_hits if hit.revocation_status == &quot;not_revoked&quot;),
focus_revoked_share=pct(
sum(1 for hit in focus_current_hits if hit.revocation_status == &quot;revoked&quot;),
len(focus_current_hits),
),
rest_revoked_share=pct(
sum(1 for hit in rest_current_hits if hit.revocation_status == &quot;revoked&quot;),
len(rest_current_hits),
),
focus_median_san_entries=median_int([len(hit.san_entries) for hit in focus_current_hits]),
focus_average_san_entries=average_text([len(hit.san_entries) for hit in focus_current_hits]),
rest_median_san_entries=median_int([len(hit.san_entries) for hit in rest_current_hits]),
rest_average_san_entries=average_text([len(hit.san_entries) for hit in rest_current_hits]),
focus_multi_zone_certificate_count=sum(1 for hit in focus_current_hits if zone_count(hit) &gt; 1),
rest_multi_zone_certificate_count=sum(1 for hit in rest_current_hits if zone_count(hit) &gt; 1),
focus_current_subject_dns_classes=Counter(observation.classification for observation in focus_current_subject_observations),
rest_current_subject_dns_classes=Counter(observation.classification for observation in rest_current_subject_observations),
focus_current_subject_dns_stacks=Counter(observation.stack_signature for observation in focus_current_subject_observations),
rest_current_subject_dns_stacks=Counter(observation.stack_signature for observation in rest_current_subject_observations),
focus_current_issuer_families=focus_current_issuer_families,
rest_current_issuer_families=rest_current_issuer_families,
focus_current_red_flag_subjects=sum(1 for subject in subjects if subject.subject_cn in current_red_flag_subjects),
focus_past_red_flag_subjects=sum(1 for subject in subjects if subject.subject_cn in past_red_flag_subjects),
focus_any_red_flag_subjects=sum(
1
for subject in subjects
if subject.subject_cn in current_red_flag_subjects or subject.subject_cn in past_red_flag_subjects
),
bucket_counts=Counter(item.taxonomy_bucket for item in detail_rows),
notables=notables,
transition_rows=sorted(
transition_rows,
key=lambda item: (
-(item.current_non_focus_san_carriers + item.historical_non_focus_san_carriers),
-item.max_direct_to_carrier_overlap_days,
item.subject_cn.casefold(),
),
),
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Runs the full comparison between the focused cohort and the rest of the estate.</p>
<p><strong>Flow arrows</strong></p><p>The focus-subject list, current-state report, and historical assessment. &#8594; <strong>build_analysis</strong> &#8594; The monograph uses the resulting bundle for Chapter 8 and Appendix D.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

2168
teachingNoobs/ct_scan.md Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,645 @@
# ct_usage_assessment.py
Source file: [`ct_usage_assessment.py`](../ct_usage_assessment.py)
Certificate-purpose analyzer. This file looks at EKU and KeyUsage to decide what each certificate is technically allowed to do.
Main flow in one line: `certificate bytes -> EKU and KeyUsage -> purpose label -> summary counts`
How to read this page:
- left side: the actual source code block
- right side: a plain-English explanation for a beginner
- read from top to bottom because later blocks depend on earlier ones
## Module setup
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">#!/usr/bin/env python3
from __future__ import annotations
import argparse
import hashlib
import json
from collections import Counter, defaultdict
from dataclasses import asdict, dataclass
from datetime import UTC, datetime
from pathlib import Path
from cryptography import x509
from cryptography.x509.oid import ExtensionOID
import ct_scan
SERVER_AUTH_OID = &quot;1.3.6.1.5.5.7.3.1&quot;
CLIENT_AUTH_OID = &quot;1.3.6.1.5.5.7.3.2&quot;
CODE_SIGNING_OID = &quot;1.3.6.1.5.5.7.3.3&quot;
EMAIL_PROTECTION_OID = &quot;1.3.6.1.5.5.7.3.4&quot;
TIME_STAMPING_OID = &quot;1.3.6.1.5.5.7.3.8&quot;
OCSP_SIGNING_OID = &quot;1.3.6.1.5.5.7.3.9&quot;
ANY_EXTENDED_KEY_USAGE_OID = &quot;2.5.29.37.0&quot;
EKU_LABELS = {
SERVER_AUTH_OID: &quot;serverAuth&quot;,
CLIENT_AUTH_OID: &quot;clientAuth&quot;,
CODE_SIGNING_OID: &quot;codeSigning&quot;,
EMAIL_PROTECTION_OID: &quot;emailProtection&quot;,
TIME_STAMPING_OID: &quot;timeStamping&quot;,
OCSP_SIGNING_OID: &quot;OCSPSigning&quot;,
ANY_EXTENDED_KEY_USAGE_OID: &quot;anyExtendedKeyUsage&quot;,
}</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Purpose-analysis constants and small data shapes for EKU and KeyUsage classification.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>Module setup</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## PurposeClassification
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class PurposeClassification:
fingerprint_sha256: str
subject_cn: str
issuer_name: str
category: str
eku_oids: list[str]
key_usage_flags: list[str]
valid_from_utc: str
valid_to_utc: str
matched_domains: list[str]
san_dns_names: list[str]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>One certificate plus the usage label assigned to it.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>PurposeClassification</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## AssessmentSummary
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">@dataclass
class AssessmentSummary:
generated_at_utc: str
source_cache_domains: list[str]
unique_leaf_certificates: int
category_counts: dict[str, int]
eku_templates: dict[str, int]
key_usage_templates: dict[str, int]
issuer_breakdown: dict[str, dict[str, int]]
validity_start_years: dict[str, dict[str, int]]
san_type_counts: dict[str, int]
subject_cn_in_dns_san_count: int
subject_cn_not_in_dns_san_count: int
dual_eku_subject_cns_with_server_only_sibling: list[str]
dual_eku_subject_cns_without_server_only_sibling: list[str]</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>The roll-up numbers that power the purpose chapter.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>AssessmentSummary</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## utc_now_iso
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def utc_now_iso() -&gt; str:
return datetime.now(UTC).isoformat(timespec=&quot;seconds&quot;).replace(&quot;+00:00&quot;, &quot;Z&quot;)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_usage_assessment.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>utc_now_iso</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## parse_args
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def parse_args() -&gt; argparse.Namespace:
parser = argparse.ArgumentParser(
description=&quot;Assess certificate intended usage from EKU and KeyUsage.&quot;
)
parser.add_argument(
&quot;--domains-file&quot;,
type=Path,
default=Path(&quot;domains.local.txt&quot;),
help=&quot;Configurable list of search domains, one per line.&quot;,
)
parser.add_argument(
&quot;--cache-dir&quot;,
type=Path,
default=Path(&quot;.cache/ct-search&quot;),
help=&quot;Directory used by ct_scan.py for cached CT results.&quot;,
)
parser.add_argument(
&quot;--cache-ttl-seconds&quot;,
type=int,
default=86400,
help=&quot;Reuse cached CT results up to this age before refreshing from crt.sh.&quot;,
)
parser.add_argument(
&quot;--max-candidates&quot;,
type=int,
default=10000,
help=&quot;Maximum raw crt.sh identity rows to inspect per configured domain.&quot;,
)
parser.add_argument(
&quot;--attempts&quot;,
type=int,
default=3,
help=&quot;Retry attempts for live crt.sh database queries.&quot;,
)
parser.add_argument(
&quot;--markdown-output&quot;,
type=Path,
default=Path(&quot;output/certificate-purpose-assessment.md&quot;),
help=&quot;Human-readable assessment output.&quot;,
)
parser.add_argument(
&quot;--json-output&quot;,
type=Path,
default=Path(&quot;output/certificate-purpose-assessment.json&quot;),
help=&quot;Machine-readable assessment output.&quot;,
)
parser.add_argument(
&quot;--verbose&quot;,
action=&quot;store_true&quot;,
help=&quot;Print refresh activity to stderr.&quot;,
)
return parser.parse_args()</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block defines the command-line knobs for the file: input paths, cache settings, output paths, and other runtime switches.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>parse_args</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## load_records
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def load_records(
domains: list[str],
cache_dir: Path,
cache_ttl_seconds: int,
max_candidates: int,
attempts: int,
verbose: bool,
) -&gt; list[ct_scan.DatabaseRecord]:
all_records: list[ct_scan.DatabaseRecord] = []
for domain in domains:
records = ct_scan.load_cached_records(cache_dir, domain, cache_ttl_seconds, max_candidates)
if records is None:
records = ct_scan.query_domain(domain, max_candidates=max_candidates, attempts=attempts, verbose=verbose)
ct_scan.store_cached_records(cache_dir, domain, max_candidates=max_candidates, records=records)
all_records.extend(records)
return all_records</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block loads data from disk, cache, or an earlier stage so later code can work with it.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>load_records</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## extract_eku_oids
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def extract_eku_oids(cert: x509.Certificate) -&gt; list[str]:
try:
extension = cert.extensions.get_extension_for_oid(ExtensionOID.EXTENDED_KEY_USAGE)
except x509.ExtensionNotFound:
return []
return sorted(oid.dotted_string for oid in extension.value)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block pulls one specific piece of information out of a larger object.</p>
<p><strong>Flow arrows</strong></p><p>One certificate object. &#8594; <strong>extract_eku_oids</strong> &#8594; `classify_purpose` uses these OIDs to decide the category.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## extract_key_usage_flags
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def extract_key_usage_flags(cert: x509.Certificate) -&gt; list[str]:
try:
key_usage = cert.extensions.get_extension_for_oid(ExtensionOID.KEY_USAGE).value
except x509.ExtensionNotFound:
return []
flags: list[str] = []
for attribute in (
&quot;digital_signature&quot;,
&quot;content_commitment&quot;,
&quot;key_encipherment&quot;,
&quot;data_encipherment&quot;,
&quot;key_agreement&quot;,
&quot;key_cert_sign&quot;,
&quot;crl_sign&quot;,
):
if getattr(key_usage, attribute):
flags.append(attribute)
if key_usage.key_agreement:
if key_usage.encipher_only:
flags.append(&quot;encipher_only&quot;)
if key_usage.decipher_only:
flags.append(&quot;decipher_only&quot;)
return flags</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block pulls one specific piece of information out of a larger object.</p>
<p><strong>Flow arrows</strong></p><p>One certificate object. &#8594; <strong>extract_key_usage_flags</strong> &#8594; `build_classifications` stores these flags as supporting evidence.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## classify_purpose
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def classify_purpose(eku_oids: list[str]) -&gt; str:
eku_set = set(eku_oids)
has_server = SERVER_AUTH_OID in eku_set or ANY_EXTENDED_KEY_USAGE_OID in eku_set
has_client = CLIENT_AUTH_OID in eku_set or ANY_EXTENDED_KEY_USAGE_OID in eku_set
has_code_signing = CODE_SIGNING_OID in eku_set
has_email = EMAIL_PROTECTION_OID in eku_set
if not eku_oids:
return &quot;no_eku&quot;
if has_server and not has_client and not has_code_signing and not has_email:
return &quot;tls_server_only&quot;
if has_server and has_client and not has_code_signing and not has_email:
return &quot;tls_server_and_client&quot;
if has_client and not has_server and not has_code_signing and not has_email:
return &quot;client_auth_only&quot;
if has_email and not has_server and not has_client and not has_code_signing:
return &quot;smime_only&quot;
if has_code_signing and not has_server and not has_client and not has_email:
return &quot;code_signing_only&quot;
return &quot;mixed_or_other&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This block applies rules and chooses a category label.</p>
<p><strong>Flow arrows</strong></p><p>The EKU OID list from one certificate. &#8594; <strong>classify_purpose</strong> &#8594; `build_classifications` turns that decision into a per-certificate record.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## format_eku_template
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def format_eku_template(eku_oids: list[str]) -&gt; str:
if not eku_oids:
return &quot;(none)&quot;
return &quot;, &quot;.join(EKU_LABELS.get(oid, oid) for oid in eku_oids)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_usage_assessment.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>format_eku_template</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## format_key_usage_template
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def format_key_usage_template(flags: list[str]) -&gt; str:
if not flags:
return &quot;(missing)&quot;
return &quot;, &quot;.join(flags)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>This function is one of the building blocks inside `ct_usage_assessment.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>format_key_usage_template</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## build_classifications
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def build_classifications(
hits: list[ct_scan.CertificateHit],
records: list[ct_scan.DatabaseRecord],
) -&gt; list[PurposeClassification]:
certificates_by_fingerprint: dict[str, x509.Certificate] = {}
for record in records:
cert = x509.load_der_x509_certificate(record.certificate_der)
is_leaf, _reason = ct_scan.is_leaf_certificate(cert)
if not is_leaf:
continue
fingerprint_sha256 = hashlib.sha256(record.certificate_der).hexdigest()
certificates_by_fingerprint.setdefault(fingerprint_sha256, cert)
results: list[PurposeClassification] = []
for hit in hits:
cert = certificates_by_fingerprint[hit.fingerprint_sha256]
san_dns_names = sorted(entry[4:] for entry in hit.san_entries if entry.startswith(&quot;DNS:&quot;))
results.append(
PurposeClassification(
fingerprint_sha256=hit.fingerprint_sha256,
subject_cn=hit.subject_cn,
issuer_name=ct_scan.primary_issuer_name(hit),
category=classify_purpose(extract_eku_oids(cert)),
eku_oids=extract_eku_oids(cert),
key_usage_flags=extract_key_usage_flags(cert),
valid_from_utc=ct_scan.utc_iso(hit.validity_not_before),
valid_to_utc=ct_scan.utc_iso(hit.validity_not_after),
matched_domains=sorted(hit.matched_domains),
san_dns_names=san_dns_names,
)
)
results.sort(
key=lambda item: (
item.category,
item.subject_cn.casefold(),
item.valid_from_utc,
item.fingerprint_sha256,
)
)
return results</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Walks through all current certificates and labels them by intended usage.</p>
<p><strong>Flow arrows</strong></p><p>The cleaned current hits plus raw records. &#8594; <strong>build_classifications</strong> &#8594; `summarize` compresses these rows into report-level counts.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## summarize
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def summarize(classifications: list[PurposeClassification], domains: list[str]) -&gt; AssessmentSummary:
category_counts = Counter(item.category for item in classifications)
eku_templates = Counter(format_eku_template(item.eku_oids) for item in classifications)
key_usage_templates = Counter(format_key_usage_template(item.key_usage_flags) for item in classifications)
issuer_breakdown: dict[str, Counter[str]] = defaultdict(Counter)
validity_start_years: dict[str, Counter[str]] = defaultdict(Counter)
san_type_counts: Counter[str] = Counter()
subject_cn_in_dns_san_count = 0
subject_cn_not_in_dns_san_count = 0
categories_by_canonical_cn: dict[str, set[str]] = defaultdict(set)
for item in classifications:
issuer_breakdown[item.category][item.issuer_name] += 1
validity_start_years[item.category][item.valid_from_utc[:4]] += 1
san_type_counts[&quot;DNSName&quot;] += len(item.san_dns_names)
if item.subject_cn in set(item.san_dns_names):
subject_cn_in_dns_san_count += 1
else:
subject_cn_not_in_dns_san_count += 1
categories_by_canonical_cn[ct_scan.canonicalize_subject_cn(item.subject_cn)].add(item.category)
dual_with_server_only = sorted(
canonical_cn
for canonical_cn, values in categories_by_canonical_cn.items()
if &quot;tls_server_and_client&quot; in values and &quot;tls_server_only&quot; in values
)
dual_without_server_only = sorted(
canonical_cn
for canonical_cn, values in categories_by_canonical_cn.items()
if values == {&quot;tls_server_and_client&quot;}
)
return AssessmentSummary(
generated_at_utc=utc_now_iso(),
source_cache_domains=domains,
unique_leaf_certificates=len(classifications),
category_counts=dict(category_counts),
eku_templates=dict(eku_templates.most_common()),
key_usage_templates=dict(key_usage_templates.most_common()),
issuer_breakdown={category: dict(counter.most_common()) for category, counter in issuer_breakdown.items()},
validity_start_years={
category: dict(sorted(counter.items()))
for category, counter in validity_start_years.items()
},
san_type_counts=dict(san_type_counts),
subject_cn_in_dns_san_count=subject_cn_in_dns_san_count,
subject_cn_not_in_dns_san_count=subject_cn_not_in_dns_san_count,
dual_eku_subject_cns_with_server_only_sibling=dual_with_server_only,
dual_eku_subject_cns_without_server_only_sibling=dual_without_server_only,
)</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Compresses the per-certificate labels into counts, templates, and issuer breakdowns.</p>
<p><strong>Flow arrows</strong></p><p>The per-certificate purpose labels. &#8594; <strong>summarize</strong> &#8594; Current-state and monograph chapters use the summary counts and templates.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## render_markdown
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def render_markdown(summary: AssessmentSummary, classifications: list[PurposeClassification]) -&gt; str:
lines: list[str] = []
lines.append(&quot;# Certificate Purpose Assessment&quot;)
lines.append(&quot;&quot;)
lines.append(f&quot;Generated at: `{summary.generated_at_utc}`&quot;)
lines.append(f&quot;Configured domains: `{&#x27;, &#x27;.join(summary.source_cache_domains)}`&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Headline Verdict&quot;)
lines.append(&quot;&quot;)
lines.append(f&quot;- Unique current leaf certificates assessed: **{summary.unique_leaf_certificates}**&quot;)
lines.append(f&quot;- TLS server only: **{summary.category_counts.get(&#x27;tls_server_only&#x27;, 0)}**&quot;)
lines.append(f&quot;- TLS server and client auth: **{summary.category_counts.get(&#x27;tls_server_and_client&#x27;, 0)}**&quot;)
lines.append(f&quot;- Client auth only: **{summary.category_counts.get(&#x27;client_auth_only&#x27;, 0)}**&quot;)
lines.append(f&quot;- S/MIME only: **{summary.category_counts.get(&#x27;smime_only&#x27;, 0)}**&quot;)
lines.append(f&quot;- Code signing only: **{summary.category_counts.get(&#x27;code_signing_only&#x27;, 0)}**&quot;)
lines.append(f&quot;- Mixed or other: **{summary.category_counts.get(&#x27;mixed_or_other&#x27;, 0)}**&quot;)
lines.append(f&quot;- No EKU: **{summary.category_counts.get(&#x27;no_eku&#x27;, 0)}**&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## What This Means&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;- The corpus contains **only TLS-capable certificates**. There are no client-only, S/MIME, or code-signing certificates.&quot;)
lines.append(&quot;- All SAN entries seen in this corpus are DNS names.&quot;)
lines.append(f&quot;- Subject CN appears literally in a DNS SAN for **{summary.subject_cn_in_dns_san_count} of {summary.unique_leaf_certificates}** certificates.&quot;)
lines.append(&quot;- The only ambiguity is whether to keep or set aside the certificates whose EKU allows both `serverAuth` and `clientAuth`.&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Rework Options&quot;)
lines.append(&quot;&quot;)
lines.append(f&quot;- Keep the full operational server corpus: **{summary.unique_leaf_certificates}** certificates.&quot;)
lines.append(f&quot;- Keep only strict server-auth certificates: **{summary.category_counts.get(&#x27;tls_server_only&#x27;, 0)}** certificates.&quot;)
lines.append(f&quot;- Create a review bucket for dual-EKU certificates: **{summary.category_counts.get(&#x27;tls_server_and_client&#x27;, 0)}** certificates.&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## EKU Templates&quot;)
lines.append(&quot;&quot;)
for template, count in summary.eku_templates.items():
lines.append(f&quot;- `{template}`: {count}&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## KeyUsage Templates&quot;)
lines.append(&quot;&quot;)
for template, count in summary.key_usage_templates.items():
lines.append(f&quot;- `{template}`: {count}&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Issuer Breakdown&quot;)
lines.append(&quot;&quot;)
for category in sorted(summary.issuer_breakdown):
lines.append(f&quot;### `{category}`&quot;)
lines.append(&quot;&quot;)
for issuer_name, count in summary.issuer_breakdown[category].items():
lines.append(f&quot;- `{issuer_name}`: {count}&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Time Pattern&quot;)
lines.append(&quot;&quot;)
dual_years = set(summary.validity_start_years.get(&quot;tls_server_and_client&quot;, {}))
server_years = set(summary.validity_start_years.get(&quot;tls_server_only&quot;, {}))
if dual_years and len(dual_years) == 1:
lines.append(
f&quot;- The dual-EKU bucket is entirely composed of certificates whose current validity starts in **{next(iter(sorted(dual_years)))}**.&quot;
)
if dual_years and server_years and dual_years != server_years:
lines.append(&quot;- The year split suggests at least some change in issuance policy over time.&quot;)
else:
lines.append(&quot;- Time alone does not prove a migration. The stronger signal is the template split by issuer and EKU.&quot;)
lines.append(&quot;&quot;)
for category in sorted(summary.validity_start_years):
year_counts = &quot;, &quot;.join(f&quot;{year}: {count}&quot; for year, count in summary.validity_start_years[category].items())
lines.append(f&quot;- `{category}`: {year_counts}&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Interpretation&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;- The `tls_server_and_client` certificates still look like hostname certificates, not user or robot identity certificates.&quot;)
lines.append(&quot;- Evidence: public DNS-style Subject CNs, DNS-only SANs, public WebPKI server-auth issuers, and no email or personal-name SAN material.&quot;)
lines.append(&quot;- The most plausible reading is **legacy or permissive server certificate templates** that also included `clientAuth`, not a separate client-certificate estate.&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Dual-EKU Hostname Overlap&quot;)
lines.append(&quot;&quot;)
lines.append(
f&quot;- Dual-EKU subject CN families that also have a strict server-only sibling: **{len(summary.dual_eku_subject_cns_with_server_only_sibling)}**&quot;
)
lines.append(
f&quot;- Dual-EKU subject CN families that currently appear only in the dual-EKU bucket: **{len(summary.dual_eku_subject_cns_without_server_only_sibling)}**&quot;
)
lines.append(&quot;&quot;)
if summary.dual_eku_subject_cns_with_server_only_sibling:
lines.append(&quot;### Dual-EKU Families With Server-Only Siblings&quot;)
lines.append(&quot;&quot;)
for subject_cn in summary.dual_eku_subject_cns_with_server_only_sibling:
lines.append(f&quot;- `{subject_cn}`&quot;)
lines.append(&quot;&quot;)
if summary.dual_eku_subject_cns_without_server_only_sibling:
lines.append(&quot;### Dual-EKU Families Without Server-Only Siblings&quot;)
lines.append(&quot;&quot;)
for subject_cn in summary.dual_eku_subject_cns_without_server_only_sibling:
lines.append(f&quot;- `{subject_cn}`&quot;)
lines.append(&quot;&quot;)
lines.append(&quot;## Detailed Dual-EKU Certificates&quot;)
lines.append(&quot;&quot;)
dual_items = [item for item in classifications if item.category == &quot;tls_server_and_client&quot;]
if not dual_items:
lines.append(&quot;- None&quot;)
lines.append(&quot;&quot;)
else:
for item in dual_items:
dns_sample = &quot;, &quot;.join(item.san_dns_names[:8])
if len(item.san_dns_names) &gt; 8:
dns_sample += &quot;, ...&quot;
lines.append(f&quot;### `{item.subject_cn}`&quot;)
lines.append(&quot;&quot;)
lines.append(f&quot;- Issuer: `{item.issuer_name}`&quot;)
lines.append(f&quot;- Validity: `{item.valid_from_utc}` to `{item.valid_to_utc}`&quot;)
lines.append(f&quot;- Matched search domains: `{&#x27;, &#x27;.join(item.matched_domains)}`&quot;)
lines.append(f&quot;- EKU: `{format_eku_template(item.eku_oids)}`&quot;)
lines.append(f&quot;- KeyUsage: `{format_key_usage_template(item.key_usage_flags)}`&quot;)
lines.append(f&quot;- DNS SAN count: `{len(item.san_dns_names)}`&quot;)
lines.append(f&quot;- DNS SAN sample: `{dns_sample}`&quot;)
lines.append(&quot;&quot;)
return &quot;\n&quot;.join(lines) + &quot;\n&quot;</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>Writes the standalone purpose report.</p>
<p><strong>Flow arrows</strong></p><p>Earlier blocks or operator input feed this block. &#8594; <strong>render_markdown</strong> &#8594; Later blocks in the same file or in the next analytical stage consume its output.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>
## main
<table style="width:100%; table-layout:fixed; border-collapse:collapse;">
<tr>
<td style="width:50%; vertical-align:top; padding:8px;">
<pre style="margin:0; padding:14px; overflow-x:auto; background:#111827; color:#e5e7eb; border-radius:10px; border:1px solid #374151; font-size:12px; line-height:1.45;"><code class="language-python">def main() -&gt; int:
args = parse_args()
domains = ct_scan.load_domains(args.domains_file)
records = load_records(
domains=domains,
cache_dir=args.cache_dir,
cache_ttl_seconds=args.cache_ttl_seconds,
max_candidates=args.max_candidates,
attempts=args.attempts,
verbose=args.verbose,
)
hits, verification = ct_scan.build_hits(records)
classifications = build_classifications(hits, records)
summary = summarize(classifications, domains)
markdown_payload = render_markdown(summary, classifications)
json_payload = {
&quot;summary&quot;: asdict(summary),
&quot;verification&quot;: asdict(verification),
&quot;classifications&quot;: [asdict(item) for item in classifications],
}
args.markdown_output.parent.mkdir(parents=True, exist_ok=True)
args.json_output.parent.mkdir(parents=True, exist_ok=True)
args.markdown_output.write_text(markdown_payload, encoding=&quot;utf-8&quot;)
args.json_output.write_text(json.dumps(json_payload, indent=2, sort_keys=True), encoding=&quot;utf-8&quot;)
return 0</code></pre>
</td>
<td style="width:50%; vertical-align:top; padding:8px;">
<p><strong>What this block is doing</strong></p><p>The standalone command-line entrypoint for the purpose analyzer.</p>
<p><strong>Flow arrows</strong></p><p>CLI arguments from the operator. &#8594; <strong>main</strong> &#8594; Runs the standalone purpose analysis end to end.</p>
<p><strong>How to think about it</strong></p><p>Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?</p>
</td>
</tr>
</table>