CertTransparencySearch/README.md

262 lines
6.9 KiB
Markdown
Raw Normal View History

2026-03-29 09:40:06 +00:00
# Certificate Transparency Search
This project builds a publication-grade report set from Certificate Transparency and public DNS:
2026-03-29 09:40:06 +00:00
- it finds currently valid leaf certificates whose SAN values contain configured search terms
- it verifies locally that the certificates are real leaf certificates rather than CA certificates or precertificates
- it assesses intended usage from EKU and KeyUsage
- it scans the DNS names exposed by the SAN corpus
- it produces readable Markdown, LaTeX, and PDF outputs
The project is designed for public source control:
2026-03-29 09:40:06 +00:00
- real search terms live only in `domains.local.txt`
- generated artefacts live only in `output/`
- caches live only in `.cache/`
None of those paths should be committed.
## What You Need On A Fresh Machine
### Required software
- `git`
- `python3`
- `make`
- `dig`
- `xelatex`
### What each dependency is for
- `python3`: runs the scanners and report generators
- `make`: gives you short repeatable commands instead of long manual command lines
- `dig`: performs the live DNS scan
- `xelatex`: compiles the PDF reports
If `xelatex` is missing, the Markdown and LaTeX outputs can still be generated, but the PDF targets will fail.
## Fresh Install On Another Computer
Clone the repository from your chosen remote and enter the directory:
2026-03-29 09:40:06 +00:00
```bash
git clone <repository-url>
cd CertTransparencySearch
2026-03-29 09:40:06 +00:00
```
Create the local Python environment and install dependencies:
2026-03-29 09:40:06 +00:00
```bash
make bootstrap
```
Create the local-only search-term file:
```bash
make init-config
```
Then edit `domains.local.txt` and replace the placeholder values with the real search terms you want to scan.
## Local Search Terms
The tracked file is:
- `domains.example.txt`
The local-only file is:
- `domains.local.txt`
Rules:
- keep real search terms only in `domains.local.txt`
- do not rename that file unless you also pass `DOMAINS=...` to `make`
- do not commit it
2026-03-29 09:40:06 +00:00
## One-Command Runs
2026-03-29 09:40:06 +00:00
### Main publication
This is the publication-grade monograph with appendices:
2026-03-29 09:40:06 +00:00
```bash
make monograph
2026-03-29 09:40:06 +00:00
```
Outputs:
2026-03-29 09:40:06 +00:00
- `output/corpus/monograph.md`
- `output/corpus/monograph.tex`
- `output/corpus/monograph.pdf`
- `output/corpus/appendix-inventory.md`
- `output/corpus/appendix-inventory.tex`
- `output/corpus/appendix-inventory.pdf`
### Supporting purpose assessment
2026-03-29 09:40:06 +00:00
```bash
make purpose
2026-03-29 09:40:06 +00:00
```
Outputs:
2026-03-29 09:40:06 +00:00
- `output/corpus/certificate-purpose-assessment.md`
- `output/corpus/certificate-purpose-assessment.json`
2026-03-29 09:40:06 +00:00
### Historical lineage analysis
This report extends the analysis across current and expired certificates to study:
- repeated issuance under the same Subject CN
- Subject CN with different Subject DN over time
- Subject CN with different issuing CA or vendor over time
- Subject CN with different SAN profiles over time
- issuance bursts and step-change start dates
```bash
make lineage
```
Outputs:
- `output/corpus/certificate-lineage-report.md`
- `output/corpus/certificate-lineage-report.tex`
- `output/corpus/certificate-lineage-report.pdf`
### Shorter executive report
2026-03-29 09:40:06 +00:00
```bash
make consolidated
2026-03-29 09:40:06 +00:00
```
Outputs:
2026-03-29 09:40:06 +00:00
- `output/corpus/consolidated-corpus-report.md`
- `output/corpus/consolidated-corpus-report.tex`
- `output/corpus/consolidated-corpus-report.pdf`
2026-03-29 09:58:30 +00:00
### Full operator run
2026-03-29 09:58:30 +00:00
This creates the local config if missing, then runs the purpose assessment, historical lineage analysis, and the full monograph:
2026-03-29 09:58:30 +00:00
```bash
make all
```
2026-03-29 09:58:30 +00:00
## Reproducibility And Run Behaviour
The default `Makefile` values are:
- `DOMAINS=domains.local.txt`
- `CACHE_TTL=0`
- `DNS_CACHE_TTL=86400`
- `MAX_CANDIDATES=10000`
This means:
- Certificate Transparency is refreshed live on every normal run.
- DNS results are reused for up to one day unless you override the DNS cache TTL.
- The query cap is high enough for the current corpus and the scanner will refuse to run if the live raw match count exceeds the cap.
If you want to override values:
```bash
make monograph CACHE_TTL=86400 DNS_CACHE_TTL=86400
```
Or:
```bash
make monograph DOMAINS=/path/to/other.local.txt
```
## Manual Commands
If you do not want to use `make`, the equivalent commands are:
### Inventory appendix source
```bash
.venv/bin/python ct_scan.py \
--domains-file domains.local.txt \
--cache-ttl-seconds 0 \
--max-candidates-per-domain 10000 \
--output output/corpus/current-valid-certificates.md \
--latex-output output/corpus/current-valid-certificates.tex \
--pdf-output output/corpus/current-valid-certificates.pdf
```
### Purpose assessment
```bash
.venv/bin/python ct_usage_assessment.py \
--domains-file domains.local.txt \
--cache-ttl-seconds 0 \
--max-candidates 10000 \
--markdown-output output/corpus/certificate-purpose-assessment.md \
--json-output output/corpus/certificate-purpose-assessment.json
```
### Consolidated report
2026-03-29 09:58:30 +00:00
```bash
.venv/bin/python ct_master_report.py \
--domains-file domains.local.txt \
--cache-ttl-seconds 0 \
--dns-cache-ttl-seconds 86400 \
--max-candidates-per-domain 10000 \
2026-03-29 09:58:30 +00:00
--markdown-output output/corpus/consolidated-corpus-report.md \
--latex-output output/corpus/consolidated-corpus-report.tex \
--pdf-output output/corpus/consolidated-corpus-report.pdf
```
2026-03-29 09:40:06 +00:00
### Historical lineage report
```bash
.venv/bin/python ct_lineage_report.py \
--domains-file domains.local.txt \
--cache-ttl-seconds 0 \
--max-candidates-per-domain 10000 \
--markdown-output output/corpus/certificate-lineage-report.md \
--latex-output output/corpus/certificate-lineage-report.tex \
--pdf-output output/corpus/certificate-lineage-report.pdf
```
### Full monograph
```bash
.venv/bin/python ct_monograph_report.py \
--domains-file domains.local.txt \
--cache-ttl-seconds 0 \
--dns-cache-ttl-seconds 86400 \
--max-candidates-per-domain 10000 \
--markdown-output output/corpus/monograph.md \
--latex-output output/corpus/monograph.tex \
--pdf-output output/corpus/monograph.pdf \
--appendix-markdown-output output/corpus/appendix-inventory.md \
--appendix-latex-output output/corpus/appendix-inventory.tex \
--appendix-pdf-output output/corpus/appendix-inventory.pdf
```
## Project Structure
- `ct_scan.py`: core CT scan, leaf verification, grouping, and detailed inventory report
- `ct_usage_assessment.py`: EKU and KeyUsage assessment
- `ct_lineage_report.py`: historical Subject CN, Subject DN, issuer, SAN, and issuance-burst analysis
- `ct_dns_utils.py`: DNS scanning and provider-signature logic
- `ct_master_report.py`: shorter consolidated report
- `ct_monograph_report.py`: publication-grade monograph with appendices
- `Makefile`: reproducible operator workflow
## Safety Against Silent Undercounts
The scanner checks the live raw identity-row count before it executes the capped query. If the configured cap is too low, it stops with an error instead of silently returning an incomplete corpus.
2026-03-29 09:40:06 +00:00
## Public Repo Rules
- keep `domains.local.txt` local only
- never commit `output/`
- never commit `.cache/`
- if you need a sample config in git, update `domains.example.txt`, not `domains.local.txt`