125 KiB
ct_scan.py
Source file: ct_scan.py
Core Certificate Transparency scanner. This file talks to crt.sh's public database, downloads the real certificate bytes, verifies that they are real leaf certificates, groups them into readable families, and can render the full inventory appendix.
Main flow in one line: domains file -> raw CT query -> parsed leaf certificates -> CN families -> issuer trust -> appendix reports
How to read this page:
- left side: the actual source code block
- right side: a plain-English explanation for a beginner
- read from top to bottom because later blocks depend on earlier ones
Module setup
|
What this block is doing Imports, SQL, constants, and shared data shapes for the core CT scanner. Flow arrows Nothing yet; this is the starting point. → Module setup → `connect`, `query_domain`, `build_hits`, and the report renderers use these shared definitions. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
DatabaseRecord
|
What this block is doing A raw row as it comes back from the crt.sh database before local cleanup. Flow arrows Earlier blocks or operator input feed this block. → DatabaseRecord → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
CertificateHit
|
What this block is doing The cleaned working object used by the rest of the analytics pipeline. Flow arrows Earlier blocks or operator input feed this block. → CertificateHit → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
VerificationStats
|
What this block is doing A tiny running counter that proves how many rows were kept or rejected. Flow arrows Earlier blocks or operator input feed this block. → VerificationStats → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
CertificateGroup
|
What this block is doing One readable family of related certificates after grouping logic runs. Flow arrows Earlier blocks or operator input feed this block. → CertificateGroup → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
ScanStats
|
What this block is doing Top-level summary numbers used in reports. Flow arrows Earlier blocks or operator input feed this block. → ScanStats → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
IssuerTrustInfo
|
What this block is doing Stores the public-trust picture for one issuer family. Flow arrows Earlier blocks or operator input feed this block. → IssuerTrustInfo → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
load_domains
|
What this block is doing This block loads data from disk, cache, or an earlier stage so later code can work with it. Flow arrows Operator's local config file. → load_domains → `query_domain` and the higher-level loaders use this cleaned domain list. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
escape_like
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → escape_like → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
utc_iso
|
What this block is doing This is a small helper that keeps the larger analytical code cleaner and easier to reuse. Flow arrows Earlier blocks or operator input feed this block. → utc_iso → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
serialize_datetime
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → serialize_datetime → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
parse_datetime
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → parse_datetime → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
cache_path
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → cache_path → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
record_to_cache_payload
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → record_to_cache_payload → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
record_from_cache_payload
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → record_from_cache_payload → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
load_cached_records
|
What this block is doing This block loads data from disk, cache, or an earlier stage so later code can work with it. Flow arrows Earlier blocks or operator input feed this block. → load_cached_records → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
store_cached_records
|
What this block is doing This block saves an intermediate result so the next run can reuse it instead of recomputing everything. Flow arrows Earlier blocks or operator input feed this block. → store_cached_records → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
connect
|
What this block is doing Opens the direct guest PostgreSQL connection to crt.sh's certwatch backend. Flow arrows Called by query functions that need live crt.sh data. → connect → `query_domain`, `query_raw_match_count`, and issuer-trust lookups all depend on this connection. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
query_domain
|
What this block is doing Runs the main certificate query for one search term and refuses silent undercounting. Flow arrows A domain plus the safety cap and retry settings. → query_domain → `build_hits` receives the raw records returned here. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
query_raw_match_count
|
What this block is doing Counts how many raw hits exist before the capped query runs. Flow arrows A domain string from the local config. → query_raw_match_count → `query_domain` uses this count to refuse silent undercounting. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
row_to_record
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → row_to_record → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
extract_san_entries
|
What this block is doing This block pulls one specific piece of information out of a larger object. Flow arrows Earlier blocks or operator input feed this block. → extract_san_entries → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
format_general_name
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → format_general_name → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
extract_common_name
|
What this block is doing This block pulls one specific piece of information out of a larger object. Flow arrows Earlier blocks or operator input feed this block. → extract_common_name → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
has_precertificate_poison
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → has_precertificate_poison → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
is_leaf_certificate
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → is_leaf_certificate → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
revocation_fields
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → revocation_fields → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
revocation_priority
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → revocation_priority → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_hits
|
What this block is doing Parses certificate bytes, rejects bad objects, and merges duplicate views of the same cert. Flow arrows Raw `DatabaseRecord` rows from crt.sh. → build_hits → `build_groups`, purpose analysis, DNS analysis, and CAA analysis all consume these cleaned hits. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
canonicalize_subject_cn
|
What this block is doing This block makes values consistent so matching and grouping do not get confused by superficial differences. Flow arrows Earlier blocks or operator input feed this block. → canonicalize_subject_cn → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
normalize_counter_pattern
|
What this block is doing This block makes values consistent so matching and grouping do not get confused by superficial differences. Flow arrows Earlier blocks or operator input feed this block. → normalize_counter_pattern → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
UnionFind
|
What this block is doing This class is a structured container for one piece of data that later code passes around instead of juggling many loose variables. Flow arrows Earlier blocks or operator input feed this block. → UnionFind → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_groups
|
What this block is doing Turns a flat certificate list into CN-based families such as exact endpoints or numbered rails. Flow arrows The flat list of `CertificateHit` objects. → build_groups → The report builders use these groups to turn raw certificate clutter into readable families. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
describe_group_basis
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → describe_group_basis → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
primary_issuer_name
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → primary_issuer_name → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
query_issuer_trust
|
What this block is doing Checks which issuers are currently trusted for public TLS in the major WebPKI contexts. Flow arrows The cleaned current certificate hits. → query_issuer_trust → Report builders use this trust view in the certificate chapters and appendix tables. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
status_marker
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → status_marker → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
one_line_revocation
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → one_line_revocation → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
san_tail_split
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → san_tail_split → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_san_tree_lines
|
What this block is doing This block constructs a richer higher-level result from simpler inputs. Flow arrows Earlier blocks or operator input feed this block. → build_san_tree_lines → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_san_tree_units_with_style
|
What this block is doing This block constructs a richer higher-level result from simpler inputs. Flow arrows Earlier blocks or operator input feed this block. → build_san_tree_units_with_style → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_san_tree_chunks_with_style
|
What this block is doing This block constructs a richer higher-level result from simpler inputs. Flow arrows Earlier blocks or operator input feed this block. → build_san_tree_chunks_with_style → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
build_san_tree_lines_with_style
|
What this block is doing This block constructs a richer higher-level result from simpler inputs. Flow arrows Earlier blocks or operator input feed this block. → build_san_tree_lines_with_style → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
group_hits_by_issuer
|
What this block is doing This block clusters related items together so later code can analyze them as families instead of as isolated rows. Flow arrows Earlier blocks or operator input feed this block. → group_hits_by_issuer → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
latex_escape
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → latex_escape → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
summarize_san_patterns
|
What this block is doing This block compresses many detailed rows into a smaller, easier-to-read summary. Flow arrows Earlier blocks or operator input feed this block. → summarize_san_patterns → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
latex_status_badge
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → latex_status_badge → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
latex_webpki_badge
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → latex_webpki_badge → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
render_markdown_report
|
What this block is doing Writes the raw inventory appendix as readable Markdown. Flow arrows Current hits, groups, and trust data. → render_markdown_report → Produces the Markdown inventory appendix. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
render_latex_report
|
What this block is doing Writes the raw inventory appendix as LaTeX for PDF assembly. Flow arrows Current hits, groups, and trust data. → render_latex_report → Produces the LaTeX appendix source that later becomes PDF. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
cleanup_latex_auxiliary_files
|
What this block is doing This function is one of the building blocks inside `ct_scan.py`. It exists so the file can do one narrow job at a time instead of one giant unreadable routine. Flow arrows Earlier blocks or operator input feed this block. → cleanup_latex_auxiliary_files → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
compile_latex_to_pdf
|
What this block is doing Hands LaTeX to XeLaTeX and turns it into a finished PDF file. Flow arrows A finished `.tex` file. → compile_latex_to_pdf → Produces the human-readable PDF artifact. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
parse_args
|
What this block is doing This block defines the command-line knobs for the file: input paths, cache settings, output paths, and other runtime switches. Flow arrows Earlier blocks or operator input feed this block. → parse_args → Later blocks in the same file or in the next analytical stage consume its output. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |
main
|
What this block is doing The standalone command-line entrypoint for the inventory scanner. Flow arrows CLI arguments from the operator. → main → Runs the whole scanner end to end. How to think about it Treat this block as one small station in a pipeline. Ask: what comes in here, what gets changed here, and what comes out for the next block? |