mirror of https://github.com/saymrwulf/CertTransparencySearch.git synced 2026-05-14 20:37:52 +00:00

saymrwulf 388d435158 Add teaching docs for analytics code

2026-04-01 18:57:58 +02:00

30 KiB

Raw Blame History

ct_dns_utils.py

Source file: ct_dns_utils.py

Public DNS scanner. This file runs dig, follows alias chains, finds public addresses, and collapses raw DNS evidence into readable delivery labels.

Main flow in one line: DNS name -> dig answers -> normalized observation -> provider hints -> delivery label

How to read this page:

left side: the actual source code block
right side: a plain-English explanation for a beginner
read from top to bottom because later blocks depend on earlier ones

Module setup

#!/usr/bin/env python3
from future import annotations
import hashlib
import ipaddress
import json
import re
import subprocess
import time
from dataclasses import asdict, dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
import ct_scan

What this block is doing

Shared DNS scanning helpers, cache helpers, and the logic that turns raw DNS answers into platform clues.

Flow arrows

Nothing yet; this is the starting point. → Module setup → The later DNS helpers all reuse these imports and small shared helpers.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

DnsObservation

@dataclass
class DnsObservation:
    original_name: str
    original_status: str
    cname_chain: list[str]
    terminal_name: str
    terminal_status: str
    a_records: list[str]
    aaaa_records: list[str]
    ptr_records: list[str]
    classification: str
    stack_signature: str
    provider_hints: list[str]

What this block is doing

One complete DNS observation for one hostname.

Flow arrows

Earlier blocks or operator input feed this block. → DnsObservation → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

normalize_name

def normalize_name(name: str) -> str:
    return name.rstrip(".").lower()

What this block is doing

This block makes values consistent so matching and grouping do not get confused by superficial differences.

Flow arrows

Earlier blocks or operator input feed this block. → normalize_name → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

cache_key

def cache_key(value: str) -> str:
    digest = hashlib.sha256(value.encode("utf-8")).hexdigest()[:16]
    slug = re.sub(r"[^a-z0-9.-]+", "-", value.lower()).strip("-")
    slug = slug[:80] or "item"
    return f"v1-{slug}-{digest}.json"

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → cache_key → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

load_json_cache

def load_json_cache(cache_dir: Path, key: str, ttl_seconds: int) -> dict[str, Any] | None:
    path = cache_dir / key
    if not path.exists():
        return None
    payload = json.loads(path.read_text(encoding="utf-8"))
    cached_at = datetime.fromisoformat(payload["cached_at"].replace("Z", "+00:00"))
    age = time.time() - cached_at.astimezone(UTC).timestamp()
    if age > ttl_seconds:
        return None
    return payload

What this block is doing

This block loads data from disk, cache, or an earlier stage so later code can work with it.

Flow arrows

Earlier blocks or operator input feed this block. → load_json_cache → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

store_json_cache

def store_json_cache(cache_dir: Path, key: str, payload: dict[str, Any]) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    enriched = dict(payload)
    enriched["cached_at"] = ct_scan.utc_iso(datetime.now(UTC))
    (cache_dir / key).write_text(json.dumps(enriched, indent=2, sort_keys=True), encoding="utf-8")

What this block is doing

This block saves an intermediate result so the next run can reuse it instead of recomputing everything.

Flow arrows

Earlier blocks or operator input feed this block. → store_json_cache → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

run_dig

def run_dig(name: str, rrtype: str, short: bool) -> str:
    cmd = ["dig", "+time=2", "+tries=1"]
    if short:
        cmd.append("+short")
    else:
        cmd.extend(["+noall", "+comments", "+answer"])
    cmd.extend([name, rrtype])
    result = subprocess.run(cmd, capture_output=True, text=True, check=False)
    return result.stdout

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

A hostname and record type. → run_dig → `scan_name_live`, `dig_status`, `dig_short`, and `ptr_lookup` all rely on this.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

dig_status

def dig_status(name: str, rrtype: str = "A") -> str:
    output = run_dig(name, rrtype, short=False)
    match = re.search(r"status:\s*([A-Z]+)", output)
    if match:
        return match.group(1)
    if output.strip():
        return "NOERROR"
    return "UNKNOWN"

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → dig_status → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

dig_short

def dig_short(name: str, rrtype: str) -> list[str]:
    output = run_dig(name, rrtype, short=True)
    return [normalize_name(line) for line in output.splitlines() if line.strip()]

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → dig_short → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

parse_answer_section

def parse_answer_section(output: str) -> list[tuple[str, str]]:
    in_answer = False
    parsed: list[tuple[str, str]] = []
    for raw_line in output.splitlines():
        line = raw_line.strip()
        if not line:
            continue
        if line.startswith(";; ANSWER SECTION:"):
            in_answer = True
            continue
        if not in_answer or line.startswith(";;"):
            continue
        match = re.match(r"^\S+\s+\d+\s+IN\s+(\S+)\s+(.+)$", line)
        if not match:
            continue
        rrtype, rdata = match.groups()
        parsed.append((rrtype.upper(), normalize_name(rdata)))
    return parsed

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → parse_answer_section → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

is_ip_address

def is_ip_address(value: str) -> bool:
    try:
        ipaddress.ip_address(value)
        return True
    except ValueError:
        return False

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → is_ip_address → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

classify_observation

def classify_observation(chain: list[str], terminal_status: str, a_records: list[str], aaaa_records: list[str]) -> str:
    has_addresses = bool(a_records or aaaa_records)
    if chain and has_addresses:
        return "cname_to_address"
    if chain and not has_addresses:
        return "dangling_cname"
    if has_addresses:
        return "direct_address"
    if terminal_status == "NXDOMAIN":
        return "nxdomain"
    if terminal_status == "NOERROR":
        return "no_data"
    return "other"

What this block is doing

This block applies rules and chooses a category label.

Flow arrows

Earlier blocks or operator input feed this block. → classify_observation → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

infer_provider_hints

def infer_provider_hints(observation: DnsObservation) -> list[str]:
    text = " ".join(
        [
            observation.original_name,
            *observation.cname_chain,
            observation.terminal_name,
            *observation.ptr_records,
        ]
    ).lower()
    hints: list[str] = []
    if "campaign.adobe.com" in text:
        hints.append("Adobe Campaign")
    if "cloudfront.net" in text:
        hints.append("AWS CloudFront")
    if "elb.amazonaws.com" in text or "compute.amazonaws.com" in text:
        hints.append("AWS")
    if "apigee.net" in text or "googleusercontent.com" in text:
        hints.append("Google Apigee")
    if "pegacloud.net" in text or ".pega.net" in text:
        hints.append("Pega Cloud")
    if "useinfinite.io" in text:
        hints.append("Infinite / agency alias")
    if any(ip.startswith("13.107.") for ip in observation.a_records) or any(ip.startswith("2620:1ec:") for ip in observation.aaaa_records):
        hints.append("Microsoft Edge")
    if not hints:
        hints.append("Unclassified")
    return hints

What this block is doing

Reads the raw DNS trail and pulls out likely platform or vendor clues.

Flow arrows

One normalized DNS observation. → infer_provider_hints → `infer_stack_signature` and the report layers use the hints it produces.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

infer_stack_signature

def infer_stack_signature(observation: DnsObservation) -> str:
    hints = infer_provider_hints(observation)
    if observation.classification == "nxdomain":
        return "No public DNS (NXDOMAIN)"
    if observation.classification == "no_data":
        return "No public address data"
    if "Adobe Campaign" in hints and "AWS CloudFront" in hints:
        return "Adobe Campaign -> AWS CloudFront"
    if "Adobe Campaign" in hints and "AWS" in hints:
        return "Adobe Campaign -> AWS ALB"
    if "Adobe Campaign" in hints and observation.a_records:
        return "Adobe Campaign direct IP"
    if "AWS CloudFront" in hints:
        return "AWS CloudFront"
    if "Google Apigee" in hints:
        return "Google Apigee"
    if "Pega Cloud" in hints and "AWS" in hints:
        return "Pega Cloud -> AWS ALB"
    if "Infinite / agency alias" in hints and observation.classification == "dangling_cname":
        return "Dangling agency alias"
    if "Microsoft Edge" in hints:
        return "Direct Microsoft edge"
    if "AWS" in hints:
        return "Direct AWS"
    if observation.classification == "direct_address":
        return "Direct address (provider unclear)"
    if observation.classification == "cname_to_address":
        return "CNAME to address (provider unclear)"
    return hints[0]

What this block is doing

Collapses several low-level DNS clues into one human-readable delivery label.

Flow arrows

One DNS observation plus provider clues. → infer_stack_signature → `ct_master_report` uses the resulting label in naming and DNS chapters.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

scan_name_live

def scan_name_live(name: str) -> DnsObservation:
    name = normalize_name(name)
    a_output = run_dig(name, "A", short=False)
    aaaa_output = run_dig(name, "AAAA", short=False)
    original_status = dig_status(name, "A")
    a_answers = parse_answer_section(a_output)
    aaaa_answers = parse_answer_section(aaaa_output)
    chain: list[str] = []
    for rrtype, rdata in a_answers + aaaa_answers:
        if rrtype == "CNAME" and rdata not in chain:
            chain.append(rdata)
    a_records = sorted({rdata for rrtype, rdata in a_answers if rrtype == "A" and is_ip_address(rdata)})
    aaaa_records = sorted({rdata for rrtype, rdata in aaaa_answers if rrtype == "AAAA" and is_ip_address(rdata)})
    terminal_name = chain[-1] if chain else name
    terminal_status = original_status
    observation = DnsObservation(
        original_name=name,
        original_status=original_status,
        cname_chain=chain,
        terminal_name=terminal_name,
        terminal_status=terminal_status,
        a_records=a_records,
        aaaa_records=aaaa_records,
        ptr_records=[],
        classification=classify_observation(chain, terminal_status, a_records, aaaa_records),
        stack_signature="",
        provider_hints=[],
    )
    observation.provider_hints = infer_provider_hints(observation)
    observation.stack_signature = infer_stack_signature(observation)
    return observation

What this block is doing

Runs the live DNS walk for one hostname.

Flow arrows

One DNS name from a SAN entry. → scan_name_live → `scan_name_cached` returns this result shape to higher-level analytics.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

scan_name_cached

def scan_name_cached(name: str, cache_dir: Path, ttl_seconds: int) -> DnsObservation:
    key = cache_key(name)
    cached = load_json_cache(cache_dir, key, ttl_seconds)
    if cached is not None:
        payload = dict(cached)
        payload.pop("cached_at", None)
        return DnsObservation(**payload)
    observation = scan_name_live(name)
    store_json_cache(cache_dir, key, asdict(observation))
    return observation

What this block is doing

Reuses a recent DNS result if possible, otherwise performs the live scan.

Flow arrows

A DNS name plus cache settings. → scan_name_cached → `ct_master_report.enrich_dns` uses this for every SAN name in the current corpus.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

ptr_lookup

def ptr_lookup(ip: str, cache_dir: Path, ttl_seconds: int) -> list[str]:
    key = cache_key(f"ptr-{ip}")
    cached = load_json_cache(cache_dir, key, ttl_seconds)
    if cached is not None:
        return list(cached.get("answers", []))
    output = subprocess.run(
        ["dig", "+time=2", "+tries=1", "+short", "-x", ip, "PTR"],
        capture_output=True,
        text=True,
        check=False,
    ).stdout
    answers = [normalize_name(line) for line in output.splitlines() if line.strip()]
    store_json_cache(cache_dir, key, {"answers": answers})
    return answers

What this block is doing

This function is one of the building blocks inside `ct_dns_utils.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine.

Flow arrows

Earlier blocks or operator input feed this block. → ptr_lookup → Later blocks in the same file or in the next analytical stage consume its output.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

provider_explanations

def provider_explanations() -> dict[str, str]:
    return {
        "Adobe Campaign": "A marketing and communication platform often used to send customer messages, email journeys, and campaign traffic. In DNS terms, it can sit in front of cloud infrastructure rather than hosting the final application by itself.",
        "AWS": "Amazon Web Services, a large public cloud platform. In this report it usually means the endpoint ultimately lands on Amazon-hosted compute or load-balancing infrastructure.",
        "AWS ALB": "AWS Application Load Balancer. A traffic-distribution front door that sends incoming web requests to one or more backend services.",
        "AWS CloudFront": "Amazon's global content-delivery and edge network. It is often used to front websites, APIs, and static assets close to users.",
        "Google Apigee": "An API gateway and API-management layer. If a hostname lands here, it usually means the public endpoint is being governed as an API product rather than being exposed directly from an application server.",
        "Pega Cloud": "A managed hosting platform for Pega applications and workflow systems. It often fronts case-management or process-heavy applications.",
        "Microsoft Edge": "Microsoft-operated edge infrastructure. In DNS this usually means the public name lands on Microsoft's front-door network rather than directly on a private application host.",
        "Infinite / agency alias": "A third-party aliasing pattern typically used by an agency or service intermediary. It points traffic onward to the actual delivery platform.",
        "CNAME": "A DNS alias record. It says one hostname is really another hostname, rather than directly mapping to an IP address.",
        "A record": "A DNS record that maps a hostname to an IPv4 address.",
        "AAAA record": "A DNS record that maps a hostname to an IPv6 address.",
        "PTR record": "A reverse-DNS record. It maps an IP address back to a hostname and is useful as a provider clue, not as proof of ownership.",
        "NXDOMAIN": "A DNS response meaning the name does not exist publicly.",
    }

What this block is doing

Supplies the glossary text used later in the reports.

Flow arrows

The delivery labels used by the report. → provider_explanations → The monograph glossary uses these explanations directly.

How to think about it

Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block?

30 KiB Raw Blame History

ct_dns_utils.py

Module setup

DnsObservation

normalize_name

cache_key

load_json_cache

store_json_cache

run_dig

dig_status

dig_short

parse_answer_section

is_ip_address

classify_observation

infer_provider_hints

infer_stack_signature

scan_name_live

scan_name_cached

ptr_lookup

provider_explanations

30 KiB

Raw Blame History